Diary Of An x264 Developer

05/29/2008 (9:32 am)

Working at Avail Media

Filed under: avail,development,x264 ::

As some of you know, this summer I am living in Kalispell, Montana and working at Avail Media, a relatively small broadcast/IPTV company that caters primarily to smaller telecomms looking to set up IPTV and VOD services for reasonable prices. What’s unique about Avail, however, is that they don’t use hardware encoders; they use x264! They’re definitely not the only company using x264–other notable users include Google and Facebook–but I know of nobody else using x264 for live HD television. As the company which Loren Merritt (pengvado) worked for, they are responsible for quite a number of x264′s broadcast-related features, such as its interlaced encoding support.

Of course, broadcast has its own difficulties that make it much more of a challenge than ordinary offline encoding. For one, encoding has to run in realtime, quite a challenge when dealing with 1080i input. This is assisted by a patch written by Loren and improved by me, “speed control,” which automatically reconfigures x264′s settings on the fly to run at exactly the specified speed. It even communicates with the encoding/muxing frontend to know exactly how much time it really has left.

Another major issue is that of the VBV; the stream must strictly obey the buffer size or else packets might end up being dropped, corrupting the video stream. This is a hard problem, considering that x264′s VBV, especially in 1pass mode, is not very compliant. This has been a repeated subject of research by me even before I came here. Gabriel from Joost has also spent quite a bit of time on it in the form of his 2pass VBV patch, which in its latest form also improves 1pass VBV ratecontrol.

The biggest issue is the quality of the sources one comes across in broadcast–or better stated, the lack thereof. Much input is in the form of 18 megabit CBR MPEG-2 streams which are of such low quality that motion search above DIA/HEX is nearly useless. This is because in any scene with sufficient motion to require a complex motion search, the stream has already gone completely blocky. Re-encoding to 6-7 megabit H.264 doesn’t make it much better, either! This mess of input also results in a very large percentage of intra blocks in the output, which makes ratecontrol that much more difficult. Add this to the relative inefficiency of x264′s interlaced encoding and things get even worse.

Yet despite the difficulties of broadcast, Avail has managed to get it to work; quite an amazing accomplishment, proving once again that x264 isn’t just for ordinary offline 2pass encoding.

05/21/2008 (11:57 am)

Eliminating Pentium I support

Filed under: assembly,speed,x264 ::

With this change, we have finally eliminated support for all x86 CPUs that don’t support MMXEXT. More correctly, x264 compiled on an MMXEXT-supporting system will not run a non-MMXEXT system, so people who still use Pentium Is for their encoding will have to compile their own versions.

What exactly did we do? Its quite nice to be able to use MMX, CMOV, and other such extended x86 instructions in ordinary C code. Normal assembly code in x264 uses function pointers so that the best function is selected on runtime. But in a case where a performance gain can only be achieved through inlining, function pointers are useless. This is especially the case with small, simple functions that are called often, such as the x264_median in the above diff, used for motion vector prediction. A more extreme example was the CABAC asm, which got a >20% performance boost for CABAC encoding merely by changing a branch to a cmov (but which required the whole function to be rewritten in assembly).

Now, having officially eliminated support for all x86 CPU architectures prior to MMXEXT, we can feel free to throw MMX/CMOV code basically wherever we want in the code, allowing all kinds of small speedups in cases that groups of simple operations can be easily SIMD’d. Another note: the median assembly referenced at the start of this post is the first use of GCC inline assembly in x264′s history.

Edit: Since it appears that some people aren’t exactly clear on the point of this post, the program will still compile on a non-MMX machine; its just that code compiled on an MMX machine will not work on a non-MMX machine, whereas before it did.

05/16/2008 (11:43 pm)

Write combining

Filed under: gcc,speed,ugly code,x264 ::

Let’s say we need to copy a few variables from one array to another. The obvious way is something like this:

byte array1[4] = {1,2,3,4};
byte array2[4];
int i;
for(i = 0; i < 4; i++) array2[i] = array1[i];

But this is suboptimal for many reasons. For one, we’re doing 8-bit reads and writes, which on 32-bit systems may actually be slower than 32-bit reads and writes; i.e. a single 32-bit read/write may be faster than a single 8-bit read/write. But the main issue is that we could be doing this:

DECLARE_ALIGNED_4(byte array1[4] = {1,2,3,4});
DECLARE_ALIGNED_4(byte array2[4]);
*(uint32_t*)array2 = *(uint32_t*)array1;

In a single operation instead of 4, we just copied the whole array. Faster speed-wise and shorter code-wise, too. The alignment is to ensure that we don’t copy between unaligned arrays, which could crash on non-x86 architectures (e.g. PowerPC) and would also go slightly slower on x86 (but still faster than the uncombined write). But, one might ask, can’t the compiler do this? Well, there are many reasons it doesn’t happen. We’ll start from the easiest case and go to the hardest case.

The easiest case is a simple zeroing of a struct (say s={a,b} where a and b are 16-bit integers). The struct is likely to be aligned by the compiler to begin with and writing zero to {a,b} is the same as writing a 32-bit zero to the whole struct. But GCC doesn’t even optimize this; it still assigns the zeroes separately! How stupid.

The second-easiest case is the generalization of this; if you’re dealing with arrays in which the function is directly accessing them (rather than pointers to arrays, which it might not know whether they’re aligned or not) and assigning zero or constant value, write-combining is trivial. But again, GCC doesn’t do it.

Now, we get to the harder stuff. What if we’re copying between two arrays, both of which are directly accessed? Now, we have to be able to detect this sequential copying and merge it. This basically is a simple form of autovectorization; its no surprise at all that GCC doesn’t do this.

The hardest, and in fact nearly impossible case is the one in which we’re dealing with pointers to arrays as arguments; the compiler really has no reliable way of knowing that the pointers are aligned (though we as programmers might know that they always are). There are cases where it could make accurate derivations (by annotating pointers passed between functions) as to whether they are aligned or not, in which case it might be able to do write combining; this would of course be very difficult.  Of course, on x86, its still worthwhile to combine even if there’s a misalignment risk, since it will only go slightly slower rather than crash.

The end result of this kind of operation is a massive speed boost in such functions; for example, in the section where motion vectors are cached (in macroblock_cache_save) I got over double the speed by converting 16-bit copies to write-combined copies. This of course is only on a 32-bit system; on a 64-bit system we could do even better. The code of course uses 64-bit so that a 64-bit compiled binary will do it as best it can. The compiler is smart enough to split the copies on 32-bit systems, of course.

We could actually do even better if we were willing to use MMX or SSE, since MMX could be used for 64-bit copies on 32-bit systems and SSE could be used for 128-bit copies. Unfortunately, this would completely sacrifice portability and at this point the speed boost would be pretty small from the current merged copies.

One of the big tricks currently is the ability to treat two motion vectors as one, and since all motion vectors come in pairs (X and Y, 16-bit signed integers each), its quite easy to manipulate them as pairs. This allowed me to drastically speed up a lot of manipulation involved in motion vector prediction and general copying and storing. The result of all the issues described in the article is this massive diff.

05/16/2008 (10:02 pm)

Oops

Filed under: bugs,CABAC,chroma,stupidity,x264 ::

Apparently my nnz patch managed to break lossless mode and it took us nearly 2 weeks to realize it! Fortunately, I managed to fix it with this patch only 30 minutes after it was reported and it was soon after put into the official git repository. To begin with, to understand this, one has to understand what my nnz patch did.

CAVLC entropy encoding needs to know the number of nonzero coefficients in a DCT block to choose its variable-length codes properly (among other things, it adapts which VLC table to use based on the characteristics of the DCT block). This is a slightly intensive process; its an extra step to take on every single DCT block to count the number of nonzero values. Its not too slow, but in CABAC its completely unnecessary, so the purpose of the patch was to move all nnz calculations to CAVLC and in the macroblock encoding only calculate a single thing: whether the array had coefficients or not, which is very fast and trivial to calculate.

But in DCT arrays of AC coefficients only, such as in i16x16 and 8×8 chroma blocks (see this post for more info), we need to ignore the first coefficient, the DC coefficient, when doing this math. So the simplest way to do this is to just zero the DC coefficient at the same time we write the coefficients to memory so it doesn’t cause issues later.

But I forgot about lossless mode and forgot to set the DC coefficient to zero in that case. The end result? Well, it broke!

05/11/2008 (12:27 pm)

Blog moved to WordPress on multimedia.cx

Filed under: blog,ffmpeg,Wordpress ::

As you can probably tell, I’m in the process of moving the blog here. I’ll post updates as things happen.

Update: moving seems to be done. Some formatting might be borked, but I’m lazy.

05/10/2008 (10:58 pm)

Introducing Photon

Filed under: codec,ffmpeg,photon ::

A few days ago I was somewhat bored; no particular good ideas for quick x264 improvements, no good games to play, and most of my finals out of the way. So I decided to make my own codec, called Photon.

The basic goal is to make a very fast MPEG-2-like format with better compression and speed than MPEG-2, but without the complexities of things like interlacing and B-frames. My eventual plans for Photon are a bit more fancy than what I have so far, of course; currently the encoder and decoder are intra-only, for example. And honestly, I’ll be shocked if I manage to reach my goals; my main purpose in this is to learn how to write a codec and bitstream format from the ground up and experiment with all sorts of features in the process.

The main philosophy behind Photon is to keep everything as simple as possible; the fewer special cases needed, the better. As such, every macroblock is divided up into 8×8 blocks; 4 for luma and 2 for chroma (the codec uses YV12 colorspace). Every single 8×8 block, luma or chroma, is treated exactly the same with the exact same set of code; the bitstream does not even distinguish them. This makes it possible to have a single loop for decoding all of these blocks. For the transform, the H.264 transform/quant/zigzag process is used.

Preceding the blocks is a 6-bit element, with one bit for each block. For each block, its associated bit in the Coded Block Pattern (CBP) is set to 0 if there’s nothing in the block and set to 1 if there is.

For each block with a
CBP of 1:

1. A delta-quantizer element. Yes, that’s right: each 8×8 block has its own quantizer! This increases the effectiveness of adaptive quantization. The delta-quantizer is done with respect to the same numbered block in the previous macroblock and is coded in unary with one bit for the sign (total bit cost: 1 bit to not code a delta quant, N+1 bits to code a delta quant, where N is the delta).
2. A transform element. Set to 0 if the block uses 4x4dct, set to 1 if the block uses 8x8dct.
3. A 4-bit CBP for the 4 4x4dct blocks, set in the same manner as the macroblock CBP. Omitted in the case of 8x8dct.
4. Residual for the 8×8 block, or for each 4×4 block with a CBP (in raster scan order) as follows:
a. The first coefficient in the block, coded as a signed exponential golomb code.
b. Is this the last coefficient? If so, a 1 is coded and the residual coding ends. Otherwise, a 0 is coded.
c. The run length of 0s until the next coefficient, coded as an unsigned exponential golomb code.
This loop repeats to code the entire block.

The macroblock itself has a 2-to-4-bit header specifying the luma and chroma prediction modes to use, and the frame has a header specifying the frametype and frame quantizer.

… and that’s it. That’s the whole thing, so far. Yes, the entropy coding is absurdly suboptimal and would benefit dramatically from custom variable-length codes instead of universal codes. Yes, I’m not using an ounce of assembly whatsoever and the encoding and decoding are far slower than they should be (decoding is still realtime for most sane resolutions though). But it works.

For those who want to see my hackneyed, half-copy-pasted-from-x264′s-common-library code, the diff so far is here.

05/09/2008 (2:51 pm)

Variable length coding and you

Let’s say you want to write a number to a bitstream. The decoder is expecting a number, so it knows exactly what to look for. But how do you format this number?

You could write 16 bits no matter what, putting a max of 65535 on this number. This would also waste tons of bits though; what if the number you’re writing here with your encoder is almost always 0, 1, or 2? Writing 16 bits would seem ridiculous, so it only makes sense to use a variable-length code (VLC). But how would the decoder know how *long* the code is? Here’s a simple example:

5 -> 101
2 -> 10

Both will look the same to the decoder, since it doesn’t know whether the code ends after the first bit, the second, the third, etc. But there’s a solution to this problem; a common one lies in the form of exponential Golomb codes. Here’s an example:

0 -> 1
2 -> 011
5 -> 00110
8 -> 0001001

The number of zeroes is the number of bits in the code, and a 1 is used to terminate the list of zeroes. This may seem inefficient, but in a sense one has no other choice but to have some sort of coding scheme like this. Of course, exponential golomb codes are only optimal in the case of a specific probability distribution; an optimal variable-length coding system has code lengths chosen based on the probability of each value. This is the primary reason why MPEG-4 Part 2 (Xvid/DivX) is superior to MPEG-2; the VLC tables were much better chosen.

Of course, the ideal system would be to allow custom VLC tables to be put in the header of the video file; its quite trivial to design a twopass encoder to create optimal VLC tables for a given video, or even a given frame. H.264 partially does away with this problem with CABAC; while there would be benefit to including a custom initial CABAC state, the cost of various CABAC modes adapts quickly to the video. Of course, this is not always the case; CABAC still relies on variable-length codes, because it converts numbers to series of binary digits, which are then written into the context-based arithmetic coder (the “binary” part of CABAC). One result of this is that even if one has a B-frame made up entirely of intra blocks, the arithmetic coder won’t be able to completely adapt, and each block will still cost more bits; the variable-length coding for the macroblock type is not chosen in such a way that would allow such “complete” adaptation.

05/06/2008 (8:20 pm)

Test clips

Filed under: test sequences,videos,x264 ::

Many users have wondered what kind of test clips are used to test x264 during development. In this post, I will attempt to enlighten readers on my personal suite of test clips and why I chose each one.

My first test clip is a roughly 3600-frame-long segment from Pirates of the Carribean: The Curse of the Black Pearl. I often just use the first 500 or so frames of this. The primary reason this test clip was chosen was to serve as an “ordinary” standard-definition video with a reasonable amount of film grain and the same sort of imperfections that most DVD sources have. The clip can be downloaded here.

My second test clip is a section from Elephant’s Dream, the free full-CGI short made in Blender and released by the Orange Open Movie Project studio. This sample is useful for a number of reasons: first, it is available in true lossless 1080p, so I can use a completely flawless source as input to the encoder. Second, of course, it is completely free. Finally, it serves as a good example of a CGI source for which to test x264 on. The full lossless original can be found in PNG sequence format here. If I need an extra HD CGI short, I usually use For the Birds or Lifted.

Next, I usually have an anime/cartoon source of some sort. I sometimes use the Flash version of the Azumanga Daioh OP, but this clip is often less useful than one would think; because its Flash, it is completely lossless. Most real anime/cartoon sources are either DVD or TV broadcast and are therefore far less clean than the above clip. Since x264 has to be able to deal with the flaws in its input, I usually use a fansub copy of the Haruhi ED sequence.

My penultimate test clip is from the VC-1 Blu-ray version of “300.” This is a serious torture test for any video encoder; the artificial grain in this source is unbelievable and requires a ridiculously high bitrate to maintain at any resolution. This is also one of the main test clips I used in developing my film grain optimization. While I obviously can’t upload the whole video nor even a reasonably sized segment of the source, a relatively high quality sample rip can be found here.

My final test clip is the ultimate torture test: a set of 640×480 videos that need over 4 megabits to avoid noticeable visual artifacts. These are lossless FRAPS captures of Touhou Project games, a series of “danmaku” vertical-scrolling shmups with a difficulty ranging from high to completely ridiculous. With this difficulty comes incredibly complex bullet patterns that make can make many video encoders completely choke; it is nearly impossible to effectively compress the sharp-edged bullets enough to reach the target bitrate without completely decimating the backgrounds. I have used these clips for a comparison between x264 and Elecard and a sample video for H.264-in-browser using Flash. I also have a short lossless clip available for those interested.

05/06/2008 (2:21 am)

x264 development: a six month retrospective

Filed under: development,summary,x264 ::

These past 6 months have consisted mostly of bugfixes, vast speed improvements, and the beginning of what will hopefully be a series of psychovisual optimizations.

How can I best describe the speed boost? Numbers would do the best job, I think. All values are my internal development build compared to the current version from 6 months ago. Adaptive quantization is disabled to make the results comparable. CRF is used for all encodes.

Max speed settings (no B-frames, subme 1, analyse none, me dia): 29.5% speed boost
Near-max speed settings (3 B-frames, subme 1, analyse none, me dia): 24.5% speed boost
Medium speed settings: (3 B-frames, subme 5): 18.5% speed boost
Slow speed settings (3 b-frames, subme 6, b-rdo,
me umh, ref 4): 35% speed boost
Very slow speed settings (16 b-frames, subme 7, b-rdo, me esa, ref 16, trellis 2, no fast-pskip, partitions all, mixed-refs): 52% speed boost
Lossless: 15% speed boost

Notable new features:

1. Psy-based adaptive quantization, for improving quality in flat areas of the frame by taking bits from more complex areas of the frame.
2. –me tesa, transformed exhaustive search. Converted from a ridiculously slow initial algorithm by me to a highly optimized thresholded solution by Loren Merritt, resulting in an even slower alternative to –me esa.
3. A massive preprocessor-based abstraction layer for assembly, allowing complete abstraction between 32-bit and 64-bit assembly and even automatic handling of everything from stack offsets to macros that permute their arguments and SSE/MMX abstraction. Written from scratch by Loren Merritt and drastically simplifies all assembly development.

Notable speed increases:

1. Altivec implementations of various functions; much faster PowerPC encoding.
2. Cacheline optimization for SAD-based motion search. Also for luma MC.
3. Much faster exhaustive motion search.
4. Lots more SSE2 assembly. And SSSE3 too. And even more SSE2. Oh wait, more
5. Skipping stuff.
6. Much much faster CABAC encoding.
7. Tons of small optimizations all over x264. Yes, there’s lots more of these. And more of these. And even morewait, there’s more here

05/05/2008 (11:53 pm)

Chroma encoding optimizations

Filed under: chroma,speed,x264 ::

This post revolves around this diff.

The chroma encoding process is an often-neglected one; we focus all too much on the complexities of luma encoding while forgetting the chroma encoding process, which is exactly the same regardless of block type. The 8×8 residual blocks for the two chroma planes (from motion compensation or intra prediction) are DCT’d, quantized, and encoded. However, there are some aspects to chroma encoding that make it subtly different:

1. Like i16x16 blocks, chroma encoding is separated into the DC coefficients of each 4×4 DCT block and the AC coefficients. DC coefficients are “flat” frequencies, that is, a number representing a uniform change over the whole block. AC coefficients are non-flat, and represent a spatial frequency. In most block types, DC coefficients are not treated differently from AC; in chroma and i16x16, they are. In chroma in particular, the 4 DC coefficients (for each of the 4 DCT blocks in a color plane) form a 2×2 array, to which a Hadamard transform is applied, and then the result is quantized. This allows the overall “flat” change in color to be coded separately from any more complex change in color.

2. Chroma blocks are generally very simple and have nearly no residual data. This is part of the reason that 1) is how the H.264 standard implements chroma coding; nearly all the data is probably going to be very simple if it is there at all.

Now, this changes things slightly. First of all, it changes how DCT decimation works; normally, decimate works by picking a DCT block and setting its contents to zero if its contents are sufficiently simple that it would save a lot of bits just to not code it at all (and have a relatively small quality loss relative to the bit savings). For chroma in particular, decimate happens a *lot*, because usually there are nearly no AC coefficients.

But what if, like we often do, we decimate the AC coefficients… but there’s DC coefficients left over? We don’t decimate these; the visual effect could potentially be large to do so, and the bit savings is small. But we still have to decode the DC-only chroma residual data, that is, perform an inverse DCT, and store the resulting data to represent the decoded frame.

But if we know that there’s only a DC coefficient and no AC coefficients, why do an entire iDCT? That would be a waste of time, since having only a DC coefficient literally just means we’re adding the same value to every pixel in the block. So I ported the DC-only chroma iDCT from ffh264 and rewrote it to do all 4 4×4 blocks at once, so it can handle an entire 8×8 chroma block in one call. Now, whenever we have a DC-only chroma block, we can skip most of the iDCT process, saving a bit of time. I also cleaned up a lot of the decimate code in the function, speeding things up further. Thanks a lot to Loren for help with the assembly on this one.

Next Page »