Diary Of An x264 Developer

10/31/2009 (7:12 pm)

The other hidden cycle-eating demon: code cache misses

Filed under: assembly,gcc,speed ::

A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death.  This time we’ll look at an even more insidious evil: the L1 code cache miss.

With data caches, despite what I said in the previous article, you still have a good bit of control.  You can prefetch data to them explicitly using the prefetch instructions.  You control memory allocation and can make all sorts of changes to potentially improve access patterns.  Every single memory access is explicit by you in your code.

But it isn’t the same with the L1 code cache (L1I).  You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache.  Unless you write everything directly in assembly, you can’t control the allocation and layout of the code.  And you don’t control access to the code at all; it is accessed implicitly when it is run.

Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why.  Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well.  While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.

How much code size matters depends heavily on the situation.  I’ll go through some cases in x264 to demonstrate.  First, I’ll grab print_sizes.pl, a nice little script that will tell you the size of every function in a binary.

6784 _x264_frame_deblock_row

That’s one huge function: but it doesn’t matter.  frame_deblock_row is called on each row of the frame separately, as compared to most other functions which are called on a per-block level.  As such, it’s run a huge number of times in sequence without other parts of the program being active.  This means that we don’t care about keeping other functions around in the instruction cache, since they’re not actively being called.  As long as this function and the functions it calls total less than the size of the instruction cache (32 kilobytes on a modern Intel chip), we’re good.

128 _x264_pixel_ssd_4x4_mmx
208 _x264_pixel_ssd_8x4_mmx
224 _x264_pixel_ssd_4x8_mmx
368 _x264_pixel_ssd_8x8_mmx
704 _x264_pixel_ssd_16x8_mmx
720 _x264_pixel_ssd_8x16_mmx
1392 _x264_pixel_ssd_16x16_mmx

That’s about 3 kilobytes of assembly functions… which are run a few times each at the end of every single x264_rd_cost_mb() call.  Since the total amount of code that x264_rd_cost_mb() ends up calling can well exceed 32 kilobytes, every single time our RD calls finish, we evict 1-2 kilobytes of data from the instruction cache.

If we look at the code, we find that the SSD functions are completely unrolled; rolling them up saves a great deal of binary size and actually increases performance, despite checkasm’s benching feature telling us the functions got slower: because in real situations, the cost of high code size is greater than the cost of the few instructions involved in loop overhead.

4896 _refine_subpel
18672 _x264_me_search_ref

Holy bejeezus those are some huge functions, considering that they run dozens of times per macroblock.  But are they really a big problem?  x264 runs all the motion search functions first, and then moves on to other parts of analysis: this means that as long as motion search fits in the 32 kilobyte instruction cache it should be reasonably efficient.  Additionally, me_search_ref actually has 5 parts, one for each motion estimation method–only one of which is used at a time.  Furthermore, it turns out that half the space in me_search_ref ends up being the fault of UMH.

Despite all this, we can’t be sure that there isn’t something to be gained here; the effectiveness of these two commits show that shrinking a function, even one which is very slow (and thus for which cache eviction is not as important), can still provide great benefits for performance.  This could also be affected by the compiler; gcc is known to be less efficient at optimizing extremely large functions.

What can we get out of all of this?  There are a few guidelines we can follow:

1.  Whenever possible, try to restrict the code size of your working set to 32 kilobytes (64 kilobytes if you’re optimizing for AMD chips).  If you can’t, try to organize the parts of that working set such that each part is less than 32 kilobytes and is run separately from the other parts.

2.  Whenever you make an optimization that significantly increases code size, try to weigh the gained speed against the lost instruction cache.  This is often a very difficult decision to make; I would tend to err on the side of smaller code size.

3.  Keep a close watch on the compiler’s inlining and loop unrolling habits.  Sometimes it’s better to partially inline a function than to inline it completely.  At other times you may want to use the NOINLINE attribute to force gcc to stop over-inlining.  Sometimes you even may find it useful to use ugly kludges to stop gcc from unrolling loops unnecessarily in particular cases.  Of course, careful benchmarking should accompany any such changes.

16 Responses to “The other hidden cycle-eating demon: code cache misses”

  1. Fruit Says:

    Probably a dumb question: was the advantage of lower-sized code over unrolled but bigger one meassured on AMD too, given that it has 2x bigger L1 caches at the moment…

    Just out of curiosity…

  2. Dark Shikari Says:

    Nope, I don’t have an AMD chip to test on. (well, I have one on SSH, but I’m lazy.)

    We have found in the past that the results can differ though; in libavcodec, for example, NOINLINEing most of the cabac functions helped 2-4% on Intel chips but hurt about 0.5% on AMD.

  3. Mans Says:

    With GNU nm, the -S option prints the size of each symbol more efficiently than your objdump/perl hack.

  4. Dark Shikari Says:

    @Mans

    Except that it doesn’t actually work. At best, it ignores the last symbol in each file (which is often a very large and important function). At worst, it gives completely nonsensical results (as on Windows).

  5. Pengvado Says:

    Ignoring the last symbol isn’t from nm, it was a bug in Astrange’s script to use nm’s reported addresses to derive function sizes. Which he wrote because yasm doesn’t output whatever info nm uses to determine sizes.

  6. astrange Says:

    I wrote it because darwin nm doesn’t have -S. Might as well ask for it, maybe it’ll happen…

    It’d be nice for yasm files to have debug info; very occasionally I want to see their params in gdb.

  7. Fruit Says:

    I did a small benchmark of revision 1318 versus revision 1310 to see how does the performance change…

    CPU: Athlon 64 2Ghz 512kb (Venice/K8).
    Win XP SP3 32bit, official builds, source is 660×320 mpeg2 (cropped dvd, ivtc through tfm, served from dgdecode)
    This is my arbitrary commandline (I averaged results from 10 consecutive runs):
    –crf 19 –no-progress –no-mbtree –tune grain –b-pyramid normal –preset slower –output nul

    r1318: 24.851 fps
    r1310: 25.043 fps

    (About 0.8% overal slowdown on K8?)

  8. Pengvado Says:

    @Fruit
    Benchmarks without error bars are useless. Dark Shikari is allowed to say “0.5%” without further qualifiers, because I trust that he has already done the proper statistics and that this is only a summary. You, however, are presenting a data point. And for all I know, the precision of your measurements might by coarser than 0.8%, making it meaningless. It’s certainly not .01% (the precision with which you wrote your numbers), because timing on windows machines just isn’t that deterministic.

  9. Fruit Says:

    @Pengvado:
    If you want the numbers from each run:
    r1318: 24,86 (5x) 24,89 (2x) 24,82 (2x) 24,79
    r1310: 25,02 (5x) 24,99 (2x) 25,22 25,05 25,21

    There, r1310 was faster in all runs…
    Of course, there could have been background services running all teh time, but otherwise, there was just irc client, explorer window and cmd.exe open when running these.

  10. Fruit Says:

    Damn, I keep forgetting to add that, the source clip was 300 frames long. Sorry for that.
    I also tried preset medium out of curiosity, but when the first results showed about the same difference, I didn’t run it again…

  11. Dark Shikari Says:

    @Fruit

    It could be because the Athlon 64 has a larger code cache size than Intel chips (64kb vs 32kb) and thus an optimization that helps on an Intel chip may hurt on an Athlon 64.

    I have seen this occur myself when trying to optimize libavcodec.

    Since the Athlon 64 is the chip with the *largest* code cache of all chips we care about (Intel’s all being smaller, and ARM usually being even smaller than Intel), I would tend to err on the side of smaller code even if it very slightly hurts Athlon 64 in some cases.

    In theory we could flood the code with ifdefs for inlining decisions, but I don’t feel masochistic enough to do that.

  12. Fruit Says:

    Yeah, I see the reasons (there’s more users of Core 2 and i7, the gain is bigger than the loss with AMD, it would be hard to have important code in two versions).

    But you can’t blame me for not liking slowdowns, although I’ll just deal with them. I hope there won’t be too much opportunities for this kind of optimizations too :)

  13. Peter Cordes Says:

    IIRC, call/return costs more on K8 than on core/nehalem.

    I had a look at
    http://www.agner.org/optimize/
    http://www.agner.org/optimize/microarchitecture.pdf

    And K8 doesn’t have a “stack engine” to keep track of %rsp for call/ret, push/pop, so they’re slower. K10 does, though, just like Intel >= core2.

    Fortunately even K8 has a call/return stack to predict return addresses, though.

    On hyperthreading (nehalem), caches are competitively shared between threads, so it’s probably best to think of having ~16kB of L1 I$, except for code that both threads will use regardless of whether they’re in lockstep or not.

    One not-so-painful way to go would be to have a few different levels of NOINLINE macro.

    I don’t think cpp is flexible enough to make anything like this work:
    #define NOINLINE(x) (x) 1000)
    #undef NOINLINE_100
    #define NOINLINE_100 // empty
    // ditto for NOINLINE_1000
    #endif

    int foo() NOINLINE_1000 {}
    int small() NOINLINE_100 {}
    int large() NOINLINE_10000 {}
    [/code]

    The numbers would either be arbitrary, and then you'd just select a set of functions to NOINLINE based on tuning for each arch, or would be based on how much I$ was needed to make inlining them a good idea.

    Probably still more trouble than it's worth.

    It's too bad AMD doesn't do better without inlining. Maybe the extra call instructions put more load on the branch predictor, so they're more expensive than the 2 clock latency they're listed as having. Or is like you said, that without inlining they can't be optimized as well (because of lack of context from the caller). And I guess gcc follows the ABI even for static functions, and the caller has to save and restore registers. And on 32bit the x86 calling convention adds the overhead of pushing the args onto the stack, in memory, then reading them out again in the caller. (yuck. One of AMD64's best features is the chance to redesign the ABI!)

  14. oliver Says:

    Out of curiosity, do you have any experience with using Valgrind/Cachegrind for analyzing this? I’m interested in whether these tools give reasonable and useful results for such complicated optimizations…

  15. Dark Shikari Says:

    @oliver

    Valgrind is pretty useless for profiling because it only counts instructions, not real clock cycles. Cachegrind is good for getting a general idea of the effect of a change on cache utilization, but it doesn’t have anywhere near the same cache characteristics as a real CPU (e.g. prefetch, etc), so it’s really not that useful either.

    In the end, the best approach is to use Oprofile and get the real stats from a real CPU.

  16. Mate Says:

    I find your posts very interesting. Strangely enough, there are very few webpages that talk deeply about CPU pipelines, data cache misses, instruction read misses, etc. So your blogs bring something to light that is extremely important for high-speed computing.

    I do research on SAT solvers, a highly competitive field that is basically number-crunching on a very large scale, not much unlike x264… though there are more worries about cache misses than SSE instructions. I think your expertise alone could win our research competition, the SAT Race :O

Leave a Reply