Diary Of An x264 Developer

05/03/2008 (1:30 am)

Film grain optimization

Optimizing an encoder for film grain is a tough problem. For one, film grain is, by definition, basically uncorrelated between frames; that is, film grain from a previous frame is totally useless in encoding the current frame’s film grain (at least it would seem!). This would suggest that intra blocks are necessary for encoding film grain, which is what generally results. Yet this encounters another problem: film grain is made up of a whole slew of spacial frequencies, many of which cannot be represented at the quantizers often used in P/B-frames! This makes it extremely difficult to efficiently represent the film grain at reasonable bitrates.

But we can cheat.

Previous frames have a lot of the necessary spatial frequencies–why not steal them? Sure, an inter block won’t be as efficient as an intra block, but it might work better. Indeed, the initial idea I got from glancing at the results of the film grain optimization in Elecard’s encoder (Mainconcept core). Their film grain optimization almost completely disabled I-blocks in P-frames, suggesting that this was indeed the avenue to go down. Of course, their film grain optimization really wasn’t that good–so who knows?

To begin with, I tried the obvious; completely disable intra blocks in P-frames for the hell of it. Surprisingly, this actually worked; in many cases it improved grain retention! But if I was to make this practical, I’d have to find a real way of implementing a metric to decide what block type to use, rather than just brute-force disabling an entire category of blocks.

I eventually came back upon an idea I considered a while back–what about NSSD? NSSD, also known as “noise-retaining sum of squared differences,” is a block comparison metric that is supposed to promote retaining grain/noise. How exactly does it do this? NSSD is equal to the sum of the ordinary SSD and the absolute value of the difference in “noise” values for the two blocks to be compared. “Noise” is abs( x(i,j) – x(i+1,j) – x(i,j+1) + x(i+1,j+1) ) summed up over all pixels x(i,j) in the source block (ignoring pixels that would result in this formula going over the edge of the block). In other words, it doesn’t compare the pixels of the two blocks; it simply measures the “noisiness” of each block, and makes sure that they have a “similar” amount of noise. Keeping the two blocks visually similar is taken care of by the SSD portion of the score.

Amazingly, this worked; replacing the RD metric (SSD) with NSSD, combined with tweaking of the RD thresholds to ensure that modes that tended to retain noise were always analyzed, drastically improved grain retention, and made inter blocks drastically more common in grainy footage. The patch can be found here, complete with mildly optimized MMX assembly for the “noise” operation, ported from ffmpeg (where NSSD is available as a -cmp/-subcmp/-rdcmp option).

05/02/2008 (9:04 am)

Google summer of code

Filed under: google,GSOC,x264 ::

This year, x264 is participating in Google Summer of Code. We have accepted four applications out of a total of roughly 15 applications (out of Videolan’s roughly 80 applications and 14 slots). The projects are as follows:

Robert Deaton (masquerade): Improve B-frame decision, both in terms of simple numbers of B-frames, direct modes, frame ordering, reference frame using, etc.

Joey Degges (keram): Improve inter mode search and decision and experiment with more psychovisual optimizations

Aki Jäntti (Kuukunen): Use a “macroblock tree” structure to measure the temporal importance of various parts of the image.

Holger Lubitz (holger): General assembly optimizations and speed improvements.

Good luck to all participants!

05/02/2008 (3:59 am)

Skipping stuff

Filed under: skip,speed,x264 ::

Its nice to be able to take shortcuts in the encoding process–its even better when we can take a shortcut that doesn’t even change the output.

A recent example was my intra skip patch. The basic idea behind intra skip lies in the method by which intra analysis works. For each block of an i4x4 or i8x8 block, all 9 prediction modes are analyzed and the best chosen. The prediction modes use the top, left, and/or top-right pixels to predict the block–so the blocks to the top, left, and top-right of the current block have to be encoded in order to know which pixels we will be predicting with:

The numbers show the order in which the blocks are encoded, in order to make sure all the necessary edge pixels are available during analysis. The gray blocks are from neighboring macroblocks which have already been encoded. Note that the “16″ block never gets encoded as part of intra analysis, since it isn’t needed for the edges of anything else.

As a result, we have to encode each block as we go along in the analysis process. Once we’re done, we can save the results of the encoding so that later, if that intra mode is chosen, we can just restore the backup, saving over 2500 clocks of predict/DCT/quant/zigzag/dequant/iDCT.

Next, there’s my p/bskip patch. When pskip/bskip analysis is done, motion compensation on the block is done (with respect to the “skip” motion vector) and then analysis follows. If the skip is chosen, the analysis terminates and the block is encoded… where the motion compensation gets done again. But we can’t just turn off all skip-based motion compensation, because there are some cases where you can have a skip in which the motion compensation wasn’t the last thing that was done. So I made a variable that told the encoder whether it could skip it or not.

Finally, there’s my upcoming fast probe_skip patch, which takes a number of shortcuts in the skip check process:

1. The current skip_check does a full 16×16 DCT (16 4×4 DCTs), and then checks each 4×4 block individually. Quite often, it’ll be able to tell after the first block or two that the block won’t be a skip; in this case, doing the full 16×16 DCT is a good waste of a few hundred clock cycles. So I made it to an 8×8 DCT (4 4×4 DCTs) for each 8×8 block. Why didn’t I do each 4×4 DCT as I went along? The 8×4 DCT (2 4×4 DCTs) is much faster than the 4×4 because its SSE2-accelerated, so I just went ahead and did the 8×8, which is coded as two 8x4s.

2. When checking for a skip, its highly likely that if one got past the first DCT block or two without terminating, most of the rest of the blocks will be zero. As a result, its efficient to check if they’re all zero before zigzag/decimate_score; this only takes a few clocks and quite often allows us to skip the aforementioned somewhat time-consuming steps.

As you can see, its quite good when we can skip something entirely; its even better than speeding up a function; its calling a function less often.

05/02/2008 (1:39 am)

Cacheline splits, aka Intel hell

Filed under: assembly,cacheline,Intel,x264 ::

About 6 months ago Loren Merritt and I, after relatively exhaustive testing on over a dozen types of processors, concluded that Intel chips in general have a problem in the case of a “cacheline split.”

To begin with, modern processors divide their cache into “lines”; for example, a Pentium 3 has a 32-byte cacheline size, while a Core 2 has a 64-byte cacheline. The way these work is that whenever the processor is told to fetch data from memory, it will fetch a full cacheline at a time. This also affects how the data is organized in the cache.

Now, when you’re loading aligned data, that is, data aligned to a 16-byte line, you’re guaranteed to never have that data cross a cache line, since the data is loaded in 8 or 16 byte chunks. But if you’re loading unaligned data, which is completely unavoidable in the case of the motion search, you have to do unaligned loads. In the case of a 16×16 motion search and a 64-byte cacheline, this means that 1 in every 4 motion vectors checked will result in the load spanning a cacheline–that is, some of the data comes from one cacheline, and some from another.

Why is this bad? On AMD chips, it isn’t; the penalty is no different from any other unaligned load. But on Intel chips the penalty is enormous; in fact, the cost of a cacheline-split load is roughly equivalent to the L2 cache latency, suggesting that a cacheline split actually results in a load from L2 cache. This may not seem too bad, but this adds roughly 12 clocks per load on a Core 2… boosting the cost of a single 16×16 SAD from 48 clocks to over 230 clocks. Splits on page lines are even worse; a single load can cost over 220 clock cycles, roughly equivalent to the main memory latency! A 16×16 SAD has now gone from 48 clocks to over 3600 clocks. The penalty on AMD chips for a pageline split exists, but its extremely small.

Since cacheline splits happen only some of the time (and 1 in 64 cachelines are pagelines) it isn’t as bad as it sounds, but the end result is that SADs take about an average of 100 clocks instead of 48. Fortunately, Loren Merritt’s massive cacheline split patch fixed this for SADs, resulting in a considerably faster motion search. Investigating his patch, you can see four solutions that he implemented for this problem:

1. The easiest, SSE3-based. The “lddqu” operation tells the processor to, instead of loading the unaligned data, to load two aligned pieces of data and shift them accordingly to give the same result. As a result, it has none of the cost of a cacheline split… but unfortunately it only works on Pentium 4Ds and Core 1s. On Core 2s, lddqu does the same thing as movdqu.

2. The second easiest, MMX-based. Shift amounts are calculated based on the position of the data to be loaded, the data is loaded in two chunks, and they are shifted and OR’d together to get the result. This is best expressed in an image:


As can be seen, this is quite a bit more complex than an ordinary load; it involves precalculation of two numbers (the shift values) and two shifts plus one OR for every single load! But its still far faster than the 12-14 clock latency for cacheline splits. But wait, you thought this workaround was messy? Just wait until you see the rest…

3. The second hardest, using SSSE3′s palignr. palignr does what the two shifts and the OR did above, all in a single operation! This would be easier than the MMX version, but there’s one complication; the MMX shift described above takes a *variable* value, so we can calculate on runtime the shift amount to use, and then do it. But palignr takes a constant which must be written into the opcode on compile-time! So for this, Loren Merritt made a set of 15 loops, one for each possible alignment, and had the program jump to the correct loop based on the current alignment. Since each loop had a constant size, he could literally calculate the position of the loop in the code and jump to its position numerically instead of storing an array of pointers to the various loops. What a mess! But it works, and faster than the MMX version, since palignr is so much simpler.

4. The worst is SSE; SSE’s version of MMX’s full-shift has the same problem as palignr; it only takes a constant shift amount. So we have to do the same as palignr’s 15-loops, except with all the same hackiness as in the MMX version.

So there we have it; the 4 workarounds, all of which give a similar performance boost on Intel processors. Now you understand why coding assembly for Intel chips can be hell; one has to compensate for such ridiculous performance penalties like these. I extended the cacheline patch to work for the qpel interpolation also, which is used in all of the subpel motion search. This gave an overall speed boost of 1-2% or so (25-40% for qpel interpolation only). My patch only uses the MMX and SSE3 workarounds, since the others were unnecessarily complex for this smaller case. Loren Merritt’s sse2/ssse3 hpel_filter patch also used this concept, though in this case the misalignments were precisely known, so using palignr was drastically easier. My frame_init_lowres asm patch also provides an SSE2 and SSSE3-based solution for cacheline splits of known alignment.

05/01/2008 (9:30 pm)

Finite state machines and CABAC

During CABAC binarization, the coefficients in each DCT block are converted to binary symbols which are arithmetically coded into the bitstream. Each symbol has a specific context, the choice of which is calculated through a simple algorithm. This context-choice, of course, has to be the same on encoding and decoding. One can calculate this context for each coefficient–or one can make a finite state machine using a set of states and a transition tables, so that given the current context, the next context can be calculated using merely an array lookup.

This improves performance both in CABAC encoding and CABAC decoding. Note the similarities between the two diffs–despite the fact that one is in x264 and designed for encoding, and the other is in ffmpeg and designed for decoding, they are practically the same code. This is true of most of the CABAC code; in fact, most of x264′s CABAC binarization code is simply the code from ffmpeg, except reversed. Original credit for the finite state machine code goes to Loren Merritt, who wrote it to speed up the calculation of CABAC states in x264′s trellis quantization algorithm.

05/01/2008 (9:04 pm)

Unrolling a loop makes it smaller

Filed under: loops,x264 ::

The topic of this post is this recent diff. It doesn’t at all change the results of the function; I simply unrolled the loop to take advantage of the fact that most of the loop’s variables were dependent on the loop constant itself. The result was only 7 lines (!) down from well over 30. Thanks to an earlier ffmpeg patch for the original idea and some of the code.

Apparently GCC doesn’t actually measure the benefit gained from constant propagation in unrolling loops, so one is forced to manually unroll in this sort of case.

Oh, and I got roughly a 50% speedup in this function by doing this (not counting the encode_decision calls).

05/01/2008 (5:16 pm)

Inter RD refine bugs

I think I’ve found two of these in one week now.

For those who don’t know, on inter blocks in P-frames, subme 7 works by a process called “qpel RD”; that is, it does an ordinary subpixel refinement of the motion vectors except instead of the usual fast metric (SAD or SATD), it does a full rate-distortion comparison on each qpel position using a hexagonal search. Of course, to increase speed, it uses a SATD threshold above which it won’t bother with the whole RD process. The reason for this is obvious when you see the numbers; a full-macroblock SATD takes about 450 clocks, while a full-macroblock RD takes about 10,000 clocks. So even if SATD allowed us to avoid doing an RD 1/20th of the time, it would be worth it; the real numbers are much higher than that, of course.

Now, onto the bugs. The first one was in CAVLC, where in an 8x8dct inter block the numbers for the numbers of non-zero coefficients in each DCT block were not being calculated, resulting in incorrect calculations of bit cost. This wasn’t a problem for CABAC, because CABAC only needs to know whether a block is all-zero or not all-zero; while CAVLC needs to know exactly how many non-zero coefficients there are. This was resolved as part of my overhaul of the nnz code.

The second bug I just ran into today, where I found that when RD was done on 8×16 and 16×8 blocks, for simplicity’s sake, the RD function encoded two 8×8 blocks separately and then summed the resulting scores. This isn’t a problem per se, but the 8×8 RD function went on to check what type of 8×8 block it was: this could be an 8×8, 2 8x4s, 2 4x8s, or 4 4×4 blocks. An 8×16 or 16×8 block can’t have such subblocks, while an 8×8 block can. This again isn’t a problem… but when a 16×8 or 8×16 block type is chosen, it doesn’t reset the subpartition types used for the 8×8 search… so if the 8×8 search chose a sub-8×8 block type, the 8×8 block encode now thinks that we have a sub-8×8 block type even if we cannot possibly have one. The solution was pretty simple.

05/01/2008 (5:08 pm)

Why, Intel, why?

Filed under: assembly,Intel,stupidity,x264 ::

This diff is highly related to this post.

If one looks in Intel’s documentation of their assembly, one notices a few things. In particular, there are a whole bunch of operations which do exactly the same thing but have different opcodes. Intel introduces “movaps” and “movups” for aligned and unaligned moves in SSE1, and then “movdqa” and “movdqu” in SSE2… to do exactly the same thing. The same situation occurs with pand and andps… etc. The end result is a number of things:
1. Wasted opcode space on opcodes that do exactly the same thing.
2. Wasted executable size, since movdqa is larger than movaps (3 byte vs 2 byte opcode) despite doing exactly the same thing.
3. Loss of our sanity.

05/01/2008 (5:03 pm)

Array overflows

Filed under: memory management,ugly code,x264 ::

x264 keeps a set of pointers to various arrays used for custom quantization matrices (CQMs), deadzone biases, etc. These are stored in the primary x264_t struct as follows:

uint16_t (*quant4_bias[4])[16]; /* [4][52][16] */
uint16_t (*quant8_bias[2])[64]; /* [2][52][64] */

They are malloced at the start of the program and deleted when x264 finishes. Unfortunately the delete code looks something like this:

for( i = 0; i < 6; i++ )
{
    …
    x264_free( h->quant4_bias[i] );
}

A small part of me died when I read this code. Yes, that’s right, its overflowing the array of pointers intentionally because they’re arranged sequentially in the struct. Apparently, according to Loren Merritt, this simplifies the deletion code.

« Previous Page