Diary Of An x264 Developer

10/31/2009 (7:12 pm)

The other hidden cycle-eating demon: code cache misses

Filed under: assembly,gcc,speed ::

A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death.  This time we’ll look at an even more insidious evil: the L1 code cache miss.

With data caches, despite what I said in the previous article, you still have a good bit of control.  You can prefetch data to them explicitly using the prefetch instructions.  You control memory allocation and can make all sorts of changes to potentially improve access patterns.  Every single memory access is explicit by you in your code.

But it isn’t the same with the L1 code cache (L1I).  You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache.  Unless you write everything directly in assembly, you can’t control the allocation and layout of the code.  And you don’t control access to the code at all; it is accessed implicitly when it is run.

Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why.  Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well.  While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.

Read More…

03/25/2009 (9:01 pm)

An excerpt from objdump

Filed under: assembly,fail,ffmpeg,gcc ::

mov eax,esi
shr r10d,2
movsxd rax,[rax*4+0]

The tags should make this post self-explanatory.

(Yes, I’ll post something else soon.  Bug me about it on IRC or something.)

Update: the braindeadness continues.

cmp    edx,esi
cmovge edi,[esp+0xac]
cmp    edx,esi

05/16/2008 (11:43 pm)

Write combining

Filed under: gcc,speed,ugly code,x264 ::

Let’s say we need to copy a few variables from one array to another. The obvious way is something like this:

byte array1[4] = {1,2,3,4};
byte array2[4];
int i;
for(i = 0; i < 4; i++) array2[i] = array1[i];

But this is suboptimal for many reasons. For one, we’re doing 8-bit reads and writes, which on 32-bit systems may actually be slower than 32-bit reads and writes; i.e. a single 32-bit read/write may be faster than a single 8-bit read/write. But the main issue is that we could be doing this:

DECLARE_ALIGNED_4(byte array1[4] = {1,2,3,4});
DECLARE_ALIGNED_4(byte array2[4]);
*(uint32_t*)array2 = *(uint32_t*)array1;

In a single operation instead of 4, we just copied the whole array. Faster speed-wise and shorter code-wise, too. The alignment is to ensure that we don’t copy between unaligned arrays, which could crash on non-x86 architectures (e.g. PowerPC) and would also go slightly slower on x86 (but still faster than the uncombined write). But, one might ask, can’t the compiler do this? Well, there are many reasons it doesn’t happen. We’ll start from the easiest case and go to the hardest case.

The easiest case is a simple zeroing of a struct (say s={a,b} where a and b are 16-bit integers). The struct is likely to be aligned by the compiler to begin with and writing zero to {a,b} is the same as writing a 32-bit zero to the whole struct. But GCC doesn’t even optimize this; it still assigns the zeroes separately! How stupid.

The second-easiest case is the generalization of this; if you’re dealing with arrays in which the function is directly accessing them (rather than pointers to arrays, which it might not know whether they’re aligned or not) and assigning zero or constant value, write-combining is trivial. But again, GCC doesn’t do it.

Now, we get to the harder stuff. What if we’re copying between two arrays, both of which are directly accessed? Now, we have to be able to detect this sequential copying and merge it. This basically is a simple form of autovectorization; its no surprise at all that GCC doesn’t do this.

The hardest, and in fact nearly impossible case is the one in which we’re dealing with pointers to arrays as arguments; the compiler really has no reliable way of knowing that the pointers are aligned (though we as programmers might know that they always are). There are cases where it could make accurate derivations (by annotating pointers passed between functions) as to whether they are aligned or not, in which case it might be able to do write combining; this would of course be very difficult.  Of course, on x86, its still worthwhile to combine even if there’s a misalignment risk, since it will only go slightly slower rather than crash.

The end result of this kind of operation is a massive speed boost in such functions; for example, in the section where motion vectors are cached (in macroblock_cache_save) I got over double the speed by converting 16-bit copies to write-combined copies. This of course is only on a 32-bit system; on a 64-bit system we could do even better. The code of course uses 64-bit so that a 64-bit compiled binary will do it as best it can. The compiler is smart enough to split the copies on 32-bit systems, of course.

We could actually do even better if we were willing to use MMX or SSE, since MMX could be used for 64-bit copies on 32-bit systems and SSE could be used for 128-bit copies. Unfortunately, this would completely sacrifice portability and at this point the speed boost would be pretty small from the current merged copies.

One of the big tricks currently is the ability to treat two motion vectors as one, and since all motion vectors come in pairs (X and Y, 16-bit signed integers each), its quite easy to manipulate them as pairs. This allowed me to drastically speed up a lot of manipulation involved in motion vector prediction and general copying and storing. The result of all the issues described in the article is this massive diff.