The other hidden cycle-eating demon: code cache misses
A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death. This time we’ll look at an even more insidious evil: the L1 code cache miss.
With data caches, despite what I said in the previous article, you still have a good bit of control. You can prefetch data to them explicitly using the prefetch instructions. You control memory allocation and can make all sorts of changes to potentially improve access patterns. Every single memory access is explicit by you in your code.
But it isn’t the same with the L1 code cache (L1I). You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache. Unless you write everything directly in assembly, you can’t control the allocation and layout of the code. And you don’t control access to the code at all; it is accessed implicitly when it is run.
Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why. Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well. While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.