Diary Of An x264 Developer

07/16/2009 (8:47 pm)

Cacheline splits, take two

It has been well over a year since the original cacheline-split patch and my subsequent cacheline-split patch for qpel interpolation.  I never implemented it for chroma, despite the potential benefit, because it required four extra registers, something that chroma MC was in serious short supply of.  Furthermore, chroma was only width-8 and width-4, and the lower the width, the lower the percentage of loads which crossed cachelines, so the less the overall possible benefit relative to the overhead of cacheline-split detection.

The cacheline split implementations, as can be seen in the original post, vary greatly, but they all have one thing in common: they perform two aligned loads, one on either side of the split, and then use shifts (or palignr) to merge the data together accordingly.  However, there is another possible trick that can be used here.

Read More…