Diary Of An x264 Developer

11/29/2008 (2:32 pm)

Nehalem optimizations: the powerful new Core i7

Filed under: assembly,avail,benchmark,Intel,x264 ::

Here’s a piece I wrote for Avail Media to explain some of the Nehalem optimizations I made in the past month or two.

Pretty graph

Note: “X/Y” instruction timing means a latency of X clocks (after doing that instruction, one has to wait X clocks to get the results), and an inverse throughput of Y clocks (if one runs a ton of that instruction one after another, one can execute that instruction every Y clocks).

The Nehalem CPU has a number of benefits over the previous Intel generation, the Penryn processor.

First of all, the Nehalem has a much faster SSE unit than the Penryn. A huge number of SSE operations have had their throughput doubled:

All shuffle instructions
All basic math instructions (add, subtract, bitmath)
Many more complex math instructions (sign, absolute value, average, compare)
All unpack/pack instructions

These changes are hard to take advantage of: they naturally sped up a large number of functions, especially the Hadamard transform (which by definition is just a massive series of adds, subtracts, and unpacks). That is, lots of stuff got faster, but there’s no obvious way (as far as I’ve found so far) to leverage this for even more of an increase.

The cacheline split problem is basically gone: the penalty is now a mere 2 clocks instead of 12 for a cacheline-split load. This, combined with the SSE speed improvements, made it worthwhile to make SSE2 versions of width-8 SAD functions, despite the fact that this requires more instructions than the MMX versions. This also meant that all cacheline functions throughout x264 were no longer useful, and had to be disabled. One of the great benefits of this is not only making SAD faster, but that every function that made heavy use of unaligned loads got faster, even those with cacheline optimizations, but especially those without. The biggest examples, as per the graph earlier, are bipred and pixel_avg (qpel). To give an idea of the magnitude of this improvement, luma motion compensation for a 16×16 block took 150 cycles on Penryn without cacheline split optimization, 111 cycles with, and takes 62 cycles on the Nehalem.

Intel has finally come through on their promise to make float-ops-on-SSE-registers-containing-integers have a speed penalty. So, we removed a few %defines throughout the code that converted integer ops into equivalent, but shorter, floating point instructions. Unfortunately, there seems to be no way to completely avoid floatops on integer registers, as many of these operations have no integer equivalents. A classic example is “movhps”, which takes an 8-byte value from memory and puts it into the high section of a 16-byte SSE register. In integer ops, one can only directly move into the low 8-byte section of the register. Emulating these float ops with complex series of integer ops is far too slow to be worthwhile, so unfortunately we cannot fully abide by Intel’s advisories.

For some bizarre reason, phminposuw, despite being only a few clocks faster (3/1 instead of 4/4) than on Penryn, is much faster overall when timed in actual code and may be actually useful in motion search as Intel originally intended. It will require inlining it directly into the code, among other things, but it may be worth it nonetheless. However, there are still some underlying bugs there that we’ll have to work out. By the way, to explain what the instruction does: it takes 8 16-bit unsigned input values and returns two 16-bit values. The first value is the minimum of the 8 values, and the second is the index of the minimum. This may be very useful in code in general, as one can see what it does: find the minimum of 8 values in just three clocks.

The Nehalem still has the same size code cache as previous Intel CPUs despite its much higher execution speed, making it that much more susceptible to unnecessary large code. One patch soon to be committed removes all code duplication in the SATD functions, making a core “satd_8x8″ function that is called repeatedly to do an 8×16, 16×8, and 16×16 Hadamard transform, respectively. DCT, iDCT, and SA8D are already done this way. This change saves 20 kilobytes of code–an enormous amount when dealing with core functions that are called dozens of times per macroblock and need to always be in cache. Demonstrating how bottlenecked the Nehalem is by its small code cache, this change gives a speedup of about 0.6% on Nehalem (only 0.3% on Penryn)–despite not having one iota of impact on the number of instructions that the processor needs to execute, and the fact that SATD of 8×8 or larger is “only” 10-15% of runtime. This patch will come along with a few other refactoring changes that should speed up SA8D considerably.

One interesting feature of the Nehalem is that it appears to not have the “code alignment” problem that the Core 2 had. For a reason we have yet to figure out (though we have much speculation), the Core 2 has this odd habit of execution times changing dramatically solely due to alignment of the code itself. That is, one could literally speed up a small segment of code just by inserting random numbers of nops before it until it got faster. This wasn’t useful for optimization, as it was not only random but misaligning one set of code would hurt another set of code just as much. We suspected it was due to some weirdness involving the cache or TLB, but regardless of what it was, the Nehalem seems to be much more consistent in this regard, making measurements of optimizations’ effectiveness much easier.

While horizontal arithmetic is faster on the Nehalem than Core 2 (2.8/1.5 instead of 3/2), its still not as fast as one would like it to be, especially since transpose (a series of unpacks) and ordinary add/subtract are both so much faster than before. As a result, the horizontal-arithmetic version (created to avoid having to do a transpose) of SA8D is still not worth using. The horizontal-arithmetic version of SATD, as with Penryn, is fast enough to be worth using, providing a ~10% performance boost over the regular version.

PMOV* instructions (mov with sign/zero extent), despite supposedly being single-clock instructions, are still not useful on the Nehalem, with mov+unpack being inexplicably faster in all situations. I have queried Intel as to why, but have gotten no response since the initial discussion (where the engineer said he’d look into it).

Overall, the changes in Nehalem are extremely beneficial to x264 and have led to an enormous overall performance increase. Furthermore, since the primary speed increase is in SIMD, the more assembly code we write, the more of a boost Nehalem gets over previous processors.

7 Responses to “Nehalem optimizations: the powerful new Core i7”

  1. - Says:

    Well, I’ll bite and ask the obvious question: what about the stuff that got slower? CABAC in particular looks worrying. A 10% slowdown is nothing to sneeze at…

  2. Dark Shikari Says:

    Almost all the stuff that got marginally slower was pmaddwd-limited, and pmaddwd didn’t get any faster, so I would be unsurprised if some minor thing made it a bit slower.

    CABAC being slower did strike me as odd, and while CABAC is extremely sensitive to everything from branch prediction to to instruction reordering, I’d be curious exactly why it is slower (I was unable to come up with any explanation).

  3. Esurnir Says:

    If you could breakdown the time in clock cycle for a gop in a 1080p max setting encode, how would it look like ? (like 20% spend on cabac, 10 % spent on frame decision, 20 % on motion search)

  4. Biggiesized Says:

    Hey, Dark Shikari, all this talk about massive performance increases in x264 with Nehalem CPUs has left me wondering what has happened to Avail’s x264 FPGA hardware encoder. IIRC, it was supposed to bench about twice as fast as an 8-core 3 GHZ machine.

    Is it due for release anytime soon (at an attractive price? $200-400 as I remember.)

    Have you taken a look at the Canopus FIRECODER Blu add-in card? It sounded promising but seems flawed in design and implementation. Hopefully Avail can top it.

  5. Dark Shikari Says:

    @Biggiesized: there never was such a thing. Avail dropped plans to do any encoding on FPGA a long time, and there were never plans to offload more than a basic motion search.

    @Esurnir:

    Max settings are rather bad for that because they can skew things a lot towards a single thing that uses up tons of time on max settings (–me tesa –merange 32 –ref 16) but normally would be far less.

    Here’s the results for subme9/trellis1/meumh/mixedrefs/ref4/8x8dct/weightb/badapt2/bframes 3, a reasonable combination for “high settings”:

    Fullpel motion search: ~30%
    SATD and pixel_avg: ~20% (includes subpel and intra analysis)
    Trellis: ~4%
    Bitstream-writing CABAC: ~1%
    RDO: Basically everything else

  6. Gordon Page Says:

    Thanks for the detailed post about the performance increases. Do you happen to know if the Lynnfield processors ( e.g. a Xeon X3460 ), and future Intel procs, benefit in the same way that the Nehalem’s have?

    Also, any thoughts on the new 6 core AMDs performance per $ wise vs these Intel chips.

    I’m trying to find an affordable yet extremely fast x.264 encoding solution.

    Regards,
    Gordon

  7. Dark Shikari Says:

    @Gordon

    Sorry for the late response, haven’t been checking my comments lately.

    All Nehalem-alikes seem to have similar characteristics, performance-wise, to the Nehalems.

    I don’t know anything about the new AMD chips yet.

Leave a Reply