Here’s a piece I wrote for Avail Media to explain some of the Nehalem optimizations I made in the past month or two.
Note: “X/Y” instruction timing means a latency of X clocks (after doing that instruction, one has to wait X clocks to get the results), and an inverse throughput of Y clocks (if one runs a ton of that instruction one after another, one can execute that instruction every Y clocks).
The Nehalem CPU has a number of benefits over the previous Intel generation, the Penryn processor.
First of all, the Nehalem has a much faster SSE unit than the Penryn. A huge number of SSE operations have had their throughput doubled:
All shuffle instructions
All basic math instructions (add, subtract, bitmath)
Many more complex math instructions (sign, absolute value, average, compare)
All unpack/pack instructions
These changes are hard to take advantage of: they naturally sped up a large number of functions, especially the Hadamard transform (which by definition is just a massive series of adds, subtracts, and unpacks). That is, lots of stuff got faster, but there’s no obvious way (as far as I’ve found so far) to leverage this for even more of an increase.
The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a cacheline-split load. This, combined with the SSE speed improvements, made it worthwhile to make SSE2 versions of width-8 SAD functions, despite the fact that this requires more instructions than the MMX versions. This also meant that all cacheline functions throughout x264 were no longer useful, and had to be disabled. One of the great benefits of this is not only making SAD faster, but that every function that made heavy use of unaligned loads got faster, even those with cacheline optimizations, but especially those without. The biggest examples, as per the graph earlier, are bipred and pixel_avg (qpel). To give an idea of the magnitude of this improvement, luma motion compensation for a 16×16 block took 150 cycles on Penryn without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem.
Intel has finally come through on their promise to make float-ops-on-SSE-registers-containing-integers have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to completely avoid floatops on integer registers, as many of these operations have no integer equivalents. A classic example is “movhps”, which takes an 8-byte value from memory and puts it into the high section of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register. Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, so unfortunately we cannot fully abide by Intel’s advisories.
For some bizarre reason, phminposuw, despite being only a few clocks faster (3/1 instead of 4/4) than on Penryn, is much faster overall when timed in actual code and may be actually useful in motion search as Intel originally intended. It will require inlining it directly into the code, among other things, but it may be worth it nonetheless. However, there are still some underlying bugs there that we’ll have to work out. By the way, to explain what the instruction does: it takes 8 16-bit unsigned input values and returns two 16-bit values. The first value is the minimum of the 8 values, and the second is the index of the minimum. This may be very useful in code in general, as one can see what it does: find the minimum of 8 values in just three clocks.
The Nehalem still has the same size code cache as previous Intel CPUs despite its much higher execution speed, making it that much more susceptible to unnecessary large code. One patch soon to be committed removes all code duplication in the SATD functions, making a core “satd_8x8″ function that is called repeatedly to do an 8×16, 16×8, and 16×16 Hadamard transform, respectively. DCT, iDCT, and SA8D are already done this way. This change saves 20 kilobytes of code–an enormous amount when dealing with core functions that are called dozens of times per macroblock and need to always be in cache. Demonstrating how bottlenecked the Nehalem is by its small code cache, this change gives a speedup of about 0.6% on Nehalem (only 0.3% on Penryn)–despite not having one iota of impact on the number of instructions that the processor needs to execute, and the fact that SATD of 8×8 or larger is “only” 10-15% of runtime. This patch will come along with a few other refactoring changes that should speed up SA8D considerably.
One interesting feature of the Nehalem is that it appears to not have the “code alignment” problem that the Core 2 had. For a reason we have yet to figure out (though we have much speculation), the Core 2 has this odd habit of execution times changing dramatically solely due to alignment of the code itself. That is, one could literally speed up a small segment of code just by inserting random numbers of nops before it until it got faster. This wasn’t useful for optimization, as it was not only random but misaligning one set of code would hurt another set of code just as much. We suspected it was due to some weirdness involving the cache or TLB, but regardless of what it was, the Nehalem seems to be much more consistent in this regard, making measurements of optimizations’ effectiveness much easier.
While horizontal arithmetic is faster on the Nehalem than Core 2 (2.8/1.5 instead of 3/2), its still not as fast as one would like it to be, especially since transpose (a series of unpacks) and ordinary add/subtract are both so much faster than before. As a result, the horizontal-arithmetic version (created to avoid having to do a transpose) of SA8D is still not worth using. The horizontal-arithmetic version of SATD, as with Penryn, is fast enough to be worth using, providing a ~10% performance boost over the regular version.
PMOV* instructions (mov with sign/zero extent), despite supposedly being single-clock instructions, are still not useful on the Nehalem, with mov+unpack being inexplicably faster in all situations. I have queried Intel as to why, but have gotten no response since the initial discussion (where the engineer said he’d look into it).
Overall, the changes in Nehalem are extremely beneficial to x264 and have led to an enormous overall performance increase. Furthermore, since the primary speed increase is in SIMD, the more assembly code we write, the more of a boost Nehalem gets over previous processors.