Most of the resources out there about optimizing cache access talk about L2 cache misses. Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory. By comparison, an L1 miss, costing just a dozen cycles, is nothing. This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses. It is also very efficient, making reasonably good use of the limited memory bandwidth available. There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.
But what about L1 misses? There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature. Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it. And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.
The AMD processor documentation says that the L2->L1 prefetcher is not strided, and tests on Intel chips suggest the same. This means that if we are performing, for example, an access of a block of image data that is in L2 but not L1 cache, every single line of data will cause an L1 cache miss. The benchmarks seem to agree; the first chroma motion compensation during qpel in x264 takes more than twice as long as the others!
In this sort of case, it’s impossible to keep all the relevant data in L1 cache; a 16×16 window of chroma data might be used during a motion search, resulting in a 64×16 section of data being stuffed into cache. But all the work done during the rest of analyzing that macroblock is vastly larger than L1 cache, so that chroma data is evicted long before the next macroblock is reached–upon which a whole new 64×16 section needs to be fetched.
Here’s some numbers which express how powerful this effect is. At very fast encoding modes, x264 rounds the predictors for the motion search to fullpel to avoid interpolation during predictor checks. However, halfpel pixels are cached, so it shouldn’t require significantly more work to round to halfpel instead of fullpel. But doing so increases the cycle count of predictor checking from 790 to 1231! This almost entirely comes from extra L1 cache misses, since the halfpel data is 4 times as large as the fullpel data and is stored in 4 separate planes. Of course, this problem extends beyond just predictor checking to motion search in general, especially subpel.
At this point, I’m stumped with regard to how to significantly improve this: without a strided prefetcher, it’s impossible to “jump-start” the prefetcher into fetching all the necessary rows with a couple prefetch instructions. Short of lobbying Intel to improve their chips, I don’t see much of an option–but if an improvement could be made, encoding times, especially with very fast settings, could improve dramatically. Of course, this would also improve all other applications for which their working set is significantly larger than the L1 cache–not just video compression.