A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death. This time we’ll look at an even more insidious evil: the L1 code cache miss.
With data caches, despite what I said in the previous article, you still have a good bit of control. You can prefetch data to them explicitly using the prefetch instructions. You control memory allocation and can make all sorts of changes to potentially improve access patterns. Every single memory access is explicit by you in your code.
But it isn’t the same with the L1 code cache (L1I). You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache. Unless you write everything directly in assembly, you can’t control the allocation and layout of the code. And you don’t control access to the code at all; it is accessed implicitly when it is run.
Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why. Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well. While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.
How much code size matters depends heavily on the situation. I’ll go through some cases in x264 to demonstrate. First, I’ll grab print_sizes.pl, a nice little script that will tell you the size of every function in a binary.
That’s one huge function: but it doesn’t matter. frame_deblock_row is called on each row of the frame separately, as compared to most other functions which are called on a per-block level. As such, it’s run a huge number of times in sequence without other parts of the program being active. This means that we don’t care about keeping other functions around in the instruction cache, since they’re not actively being called. As long as this function and the functions it calls total less than the size of the instruction cache (32 kilobytes on a modern Intel chip), we’re good.
That’s about 3 kilobytes of assembly functions… which are run a few times each at the end of every single x264_rd_cost_mb() call. Since the total amount of code that x264_rd_cost_mb() ends up calling can well exceed 32 kilobytes, every single time our RD calls finish, we evict 1-2 kilobytes of data from the instruction cache.
If we look at the code, we find that the SSD functions are completely unrolled; rolling them up saves a great deal of binary size and actually increases performance, despite checkasm’s benching feature telling us the functions got slower: because in real situations, the cost of high code size is greater than the cost of the few instructions involved in loop overhead.
Holy bejeezus those are some huge functions, considering that they run dozens of times per macroblock. But are they really a big problem? x264 runs all the motion search functions first, and then moves on to other parts of analysis: this means that as long as motion search fits in the 32 kilobyte instruction cache it should be reasonably efficient. Additionally, me_search_ref actually has 5 parts, one for each motion estimation method–only one of which is used at a time. Furthermore, it turns out that half the space in me_search_ref ends up being the fault of UMH.
Despite all this, we can’t be sure that there isn’t something to be gained here; the effectiveness of these two commits show that shrinking a function, even one which is very slow (and thus for which cache eviction is not as important), can still provide great benefits for performance. This could also be affected by the compiler; gcc is known to be less efficient at optimizing extremely large functions.
What can we get out of all of this? There are a few guidelines we can follow:
1. Whenever possible, try to restrict the code size of your working set to 32 kilobytes (64 kilobytes if you’re optimizing for AMD chips). If you can’t, try to organize the parts of that working set such that each part is less than 32 kilobytes and is run separately from the other parts.
2. Whenever you make an optimization that significantly increases code size, try to weigh the gained speed against the lost instruction cache. This is often a very difficult decision to make; I would tend to err on the side of smaller code size.
3. Keep a close watch on the compiler’s inlining and loop unrolling habits. Sometimes it’s better to partially inline a function than to inline it completely. At other times you may want to use the NOINLINE attribute to force gcc to stop over-inlining. Sometimes you even may find it useful to use ugly kludges to stop gcc from unrolling loops unnecessarily in particular cases. Of course, careful benchmarking should accompany any such changes.