Diary Of An x264 Developer

09/02/2009 (10:58 pm)

The hidden cycle-eating demon: L1 cache misses

Filed under: Intel,speed,x264 ::

Most of the resources out there about optimizing cache access talk about L2 cache misses.  Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory.  By comparison, an L1 miss, costing just a dozen cycles, is nothing.  This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses.  It is also very efficient, making reasonably good use of the limited memory bandwidth available.  There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.

But what about L1 misses?  There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature.  Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it.  And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

Read More…

11/29/2008 (2:32 pm)

Nehalem optimizations: the powerful new Core i7

Filed under: assembly,avail,benchmark,Intel,x264 ::

Here’s a piece I wrote for Avail Media to explain some of the Nehalem optimizations I made in the past month or two.

Pretty graph

Note: “X/Y” instruction timing means a latency of X clocks (after doing that instruction, one has to wait X clocks to get the results), and an inverse throughput of Y clocks (if one runs a ton of that instruction one after another, one can execute that instruction every Y clocks).

The Nehalem CPU has a number of benefits over the previous Intel generation, the Penryn processor.

First of all, the Nehalem has a much faster SSE unit than the Penryn. A huge number of SSE operations have had their throughput doubled:

Read More…

05/02/2008 (1:39 am)

Cacheline splits, aka Intel hell

Filed under: assembly,cacheline,Intel,x264 ::

About 6 months ago Loren Merritt and I, after relatively exhaustive testing on over a dozen types of processors, concluded that Intel chips in general have a problem in the case of a “cacheline split.”

To begin with, modern processors divide their cache into “lines”; for example, a Pentium 3 has a 32-byte cacheline size, while a Core 2 has a 64-byte cacheline. The way these work is that whenever the processor is told to fetch data from memory, it will fetch a full cacheline at a time. This also affects how the data is organized in the cache.

Now, when you’re loading aligned data, that is, data aligned to a 16-byte line, you’re guaranteed to never have that data cross a cache line, since the data is loaded in 8 or 16 byte chunks. But if you’re loading unaligned data, which is completely unavoidable in the case of the motion search, you have to do unaligned loads. In the case of a 16×16 motion search and a 64-byte cacheline, this means that 1 in every 4 motion vectors checked will result in the load spanning a cacheline–that is, some of the data comes from one cacheline, and some from another.

Why is this bad? On AMD chips, it isn’t; the penalty is no different from any other unaligned load. But on Intel chips the penalty is enormous; in fact, the cost of a cacheline-split load is roughly equivalent to the L2 cache latency, suggesting that a cacheline split actually results in a load from L2 cache. This may not seem too bad, but this adds roughly 12 clocks per load on a Core 2… boosting the cost of a single 16×16 SAD from 48 clocks to over 230 clocks. Splits on page lines are even worse; a single load can cost over 220 clock cycles, roughly equivalent to the main memory latency! A 16×16 SAD has now gone from 48 clocks to over 3600 clocks. The penalty on AMD chips for a pageline split exists, but its extremely small.

Since cacheline splits happen only some of the time (and 1 in 64 cachelines are pagelines) it isn’t as bad as it sounds, but the end result is that SADs take about an average of 100 clocks instead of 48. Fortunately, Loren Merritt’s massive cacheline split patch fixed this for SADs, resulting in a considerably faster motion search. Investigating his patch, you can see four solutions that he implemented for this problem:

1. The easiest, SSE3-based. The “lddqu” operation tells the processor to, instead of loading the unaligned data, to load two aligned pieces of data and shift them accordingly to give the same result. As a result, it has none of the cost of a cacheline split… but unfortunately it only works on Pentium 4Ds and Core 1s. On Core 2s, lddqu does the same thing as movdqu.

2. The second easiest, MMX-based. Shift amounts are calculated based on the position of the data to be loaded, the data is loaded in two chunks, and they are shifted and OR’d together to get the result. This is best expressed in an image:

As can be seen, this is quite a bit more complex than an ordinary load; it involves precalculation of two numbers (the shift values) and two shifts plus one OR for every single load! But its still far faster than the 12-14 clock latency for cacheline splits. But wait, you thought this workaround was messy? Just wait until you see the rest…

3. The second hardest, using SSSE3′s palignr. palignr does what the two shifts and the OR did above, all in a single operation! This would be easier than the MMX version, but there’s one complication; the MMX shift described above takes a *variable* value, so we can calculate on runtime the shift amount to use, and then do it. But palignr takes a constant which must be written into the opcode on compile-time! So for this, Loren Merritt made a set of 15 loops, one for each possible alignment, and had the program jump to the correct loop based on the current alignment. Since each loop had a constant size, he could literally calculate the position of the loop in the code and jump to its position numerically instead of storing an array of pointers to the various loops. What a mess! But it works, and faster than the MMX version, since palignr is so much simpler.

4. The worst is SSE; SSE’s version of MMX’s full-shift has the same problem as palignr; it only takes a constant shift amount. So we have to do the same as palignr’s 15-loops, except with all the same hackiness as in the MMX version.

So there we have it; the 4 workarounds, all of which give a similar performance boost on Intel processors. Now you understand why coding assembly for Intel chips can be hell; one has to compensate for such ridiculous performance penalties like these. I extended the cacheline patch to work for the qpel interpolation also, which is used in all of the subpel motion search. This gave an overall speed boost of 1-2% or so (25-40% for qpel interpolation only). My patch only uses the MMX and SSE3 workarounds, since the others were unnecessarily complex for this smaller case. Loren Merritt’s sse2/ssse3 hpel_filter patch also used this concept, though in this case the misalignments were precisely known, so using palignr was drastically easier. My frame_init_lowres asm patch also provides an SSE2 and SSSE3-based solution for cacheline splits of known alignment.

05/01/2008 (5:08 pm)

Why, Intel, why?

Filed under: assembly,Intel,stupidity,x264 ::

This diff is highly related to this post.

If one looks in Intel’s documentation of their assembly, one notices a few things. In particular, there are a whole bunch of operations which do exactly the same thing but have different opcodes. Intel introduces “movaps” and “movups” for aligned and unaligned moves in SSE1, and then “movdqa” and “movdqu” in SSE2… to do exactly the same thing. The same situation occurs with pand and andps… etc. The end result is a number of things:
1. Wasted opcode space on opcodes that do exactly the same thing.
2. Wasted executable size, since movdqa is larger than movaps (3 byte vs 2 byte opcode) despite doing exactly the same thing.
3. Loss of our sanity.