About 6 months ago Loren Merritt and I, after relatively exhaustive testing on over a dozen types of processors, concluded that Intel chips in general have a problem in the case of a “cacheline split.”
To begin with, modern processors divide their cache into “lines”; for example, a Pentium 3 has a 32-byte cacheline size, while a Core 2 has a 64-byte cacheline. The way these work is that whenever the processor is told to fetch data from memory, it will fetch a full cacheline at a time. This also affects how the data is organized in the cache.
Now, when you’re loading aligned data, that is, data aligned to a 16-byte line, you’re guaranteed to never have that data cross a cache line, since the data is loaded in 8 or 16 byte chunks. But if you’re loading unaligned data, which is completely unavoidable in the case of the motion search, you have to do unaligned loads. In the case of a 16×16 motion search and a 64-byte cacheline, this means that 1 in every 4 motion vectors checked will result in the load spanning a cacheline–that is, some of the data comes from one cacheline, and some from another.
Why is this bad? On AMD chips, it isn’t; the penalty is no different from any other unaligned load. But on Intel chips the penalty is enormous; in fact, the cost of a cacheline-split load is roughly equivalent to the L2 cache latency, suggesting that a cacheline split actually results in a load from L2 cache. This may not seem too bad, but this adds roughly 12 clocks per load on a Core 2… boosting the cost of a single 16×16 SAD from 48 clocks to over 230 clocks. Splits on page lines are even worse; a single load can cost over 220 clock cycles, roughly equivalent to the main memory latency! A 16×16 SAD has now gone from 48 clocks to over 3600 clocks. The penalty on AMD chips for a pageline split exists, but its extremely small.
Since cacheline splits happen only some of the time (and 1 in 64 cachelines are pagelines) it isn’t as bad as it sounds, but the end result is that SADs take about an average of 100 clocks instead of 48. Fortunately, Loren Merritt’s massive cacheline split patch fixed this for SADs, resulting in a considerably faster motion search. Investigating his patch, you can see four solutions that he implemented for this problem:
1. The easiest, SSE3-based. The “lddqu” operation tells the processor to, instead of loading the unaligned data, to load two aligned pieces of data and shift them accordingly to give the same result. As a result, it has none of the cost of a cacheline split… but unfortunately it only works on Pentium 4Ds and Core 1s. On Core 2s, lddqu does the same thing as movdqu.
2. The second easiest, MMX-based. Shift amounts are calculated based on the position of the data to be loaded, the data is loaded in two chunks, and they are shifted and OR’d together to get the result. This is best expressed in an image:
As can be seen, this is quite a bit more complex than an ordinary load; it involves precalculation of two numbers (the shift values) and two shifts plus one OR for every single load! But its still far faster than the 12-14 clock latency for cacheline splits. But wait, you thought this workaround was messy? Just wait until you see the rest…
3. The second hardest, using SSSE3′s palignr. palignr does what the two shifts and the OR did above, all in a single operation! This would be easier than the MMX version, but there’s one complication; the MMX shift described above takes a *variable* value, so we can calculate on runtime the shift amount to use, and then do it. But palignr takes a constant which must be written into the opcode on compile-time! So for this, Loren Merritt made a set of 15 loops, one for each possible alignment, and had the program jump to the correct loop based on the current alignment. Since each loop had a constant size, he could literally calculate the position of the loop in the code and jump to its position numerically instead of storing an array of pointers to the various loops. What a mess! But it works, and faster than the MMX version, since palignr is so much simpler.
4. The worst is SSE; SSE’s version of MMX’s full-shift has the same problem as palignr; it only takes a constant shift amount. So we have to do the same as palignr’s 15-loops, except with all the same hackiness as in the MMX version.
So there we have it; the 4 workarounds, all of which give a similar performance boost on Intel processors. Now you understand why coding assembly for Intel chips can be hell; one has to compensate for such ridiculous performance penalties like these. I extended the cacheline patch to work for the qpel interpolation also, which is used in all of the subpel motion search. This gave an overall speed boost of 1-2% or so (25-40% for qpel interpolation only). My patch only uses the MMX and SSE3 workarounds, since the others were unnecessarily complex for this smaller case. Loren Merritt’s sse2/ssse3 hpel_filter patch also used this concept, though in this case the misalignments were precisely known, so using palignr was drastically easier. My frame_init_lowres asm patch also provides an SSE2 and SSSE3-based solution for cacheline splits of known alignment.