Diary Of An x264 Developer

07/10/2008 (2:34 pm)

A summary of recent revisions

I haven’t written a post here for quite some time, so I figured I’d get back into the swing of things with an update on some recent revisions.  Further information can be found on the official git repository page.

r891: Instead of checking for the end of the bitstream on every single write, check before encoding each macroblock–and instead of silently truncating the frame upon reaching the end of the malloced memory, reallocate with more room. (patch by me)

r892: A patch by Gabriel for compilation on MSVC and other crappy compiles that don’t maintain mod16 stack alignment–disable all functions that require these on such platforms.  This results in a visible CPU flag (“slow_mod4_stack”) to warn the user that their build is crippled.

r893: Various micro-optimizations by me, along with a very cool little idea that saves quite a bit of time in probe_skip.  The idea here is that the probe_skip function *almost never* terminates during the chroma check, but you can’t skip it completely, since if you choose to skip a block before looking at the chroma, and it turned out there was significant detail there that you ignored, there might be noticeable artifacting as a result.  The solution was to do a quick Sum of Squared Differences (SSD) check before doing the actual probe_skip on each chroma plane; if low enough, skip the check.  Even an incredibly low threshold was sufficient for skipping the check nearly half the time yet never changing the results.

r894: Assembly for the lowres interpolation filter used for frametype decision–drastically increases its speed.  The C was modified in order to match the assembly version, which had slightly different rounding.  I originally wrote an MMX, SSE2, SSE3, and SSSE3 version of this patch, but Loren found an even faster method of doing it and rewrote most of it from scratch before committing it.

r895: Fix a bug in adaptive quantization stemming from the original implementation.  One subtle issue with this patch is that on win32, GCC incorrectly puts arrays of all zeroes into .bss even when told to align them–despite the fact that .bss is only labeled as 4-byte aligned.  This causes a crash, of course.  We resolved this with an ugly hack–tack a “1″ onto the end of the array to stop GCC from putting it in .bss.

r896: Improve the permutation macros for easier use throughout x264′s assembly. (patch by Loren)

r897: Port the noise reduction assembly from libavcodec and improve it.  Speed up the C version too.  The SSSE3 assembly is roughly 25 times faster than the original C code.  (patch by me with help from Loren and Holger)

r898-r899: Update the copyright headers across x264. (patch by me)

r900: Fix a subtle bug in Loren’s frame_init_lowres patch; in some cases the FPU might not be re-initialized with “emms” after running frame_init_lowres, so an emms was added just in case.

r901: Lots of mini-optimizations, one by holger, the rest by me.

r902: Considerable cleanup of the SSD assembly functions to simplify the code. (patch by Loren)

r903: A complete rewrite of the bitstream functions from the ground up.  This vastly faster bitstream writer uses 32-bit and 64-bit instead of 8-bit chunks.  This eliminates the need for any loop in bitstream writing.  The 32-bit writer is quite similar to ffmpeg’s 32-bit writer.  The 64-bit writer uses a particularly ingenious system whereby the writer only writes 32 bits at a time, but keeps up to 64 bits in a storage chunk–this means that any variable-length code can be written unconditionally to the chunk, since there’s always at least 32 bits of space left: the write to the bitstream is done afterwards.  Because of this, the 64-bit bitstream writer only needs a single if statement in its write function; not even an else. (patch by me with help from Loren)

r904: The golomb functions were far more complicated than necessary–with the new bitstream writer changes many of the special cases were no longer useful speed-wise.  Additionally, many of the branches were not necessary in most calls to those functions.  (patch by Loren with some changes by me)

r905: When a large static table is stuck in a .h file, its duplicated in every single C file that includes and uses it.  This is a waste of memory–its better to declare it once in a C file and use “extern” to address it from other locations.  This diff de-duplicates VLC tables and additionally reduces them from using 16-bit to using 8-bit codes, since none of the VLC codes have more than 8 significant bits.  (patch by Loren, change to 8-bit by me)

r906:  Add support for PCM macroblocks.   This type of macroblock is *completely* uncompressed and bypasses the CABAC entropy coder.   As such, each PCM block is always 384 bytes.  The advantage of this is that sometimes, given the current state of the CABAC encoder, at very low QPs (0-5) and in lossless mode, its possible that a block will exceed 384 bytes in size–so PCM would have been a better option.  When rate-distortion optimization is enabled, PCM is considered as a possible option by x264.  (patch by me)

05/09/2008 (2:51 pm)

Variable length coding and you

Let’s say you want to write a number to a bitstream. The decoder is expecting a number, so it knows exactly what to look for. But how do you format this number?

You could write 16 bits no matter what, putting a max of 65535 on this number. This would also waste tons of bits though; what if the number you’re writing here with your encoder is almost always 0, 1, or 2? Writing 16 bits would seem ridiculous, so it only makes sense to use a variable-length code (VLC). But how would the decoder know how *long* the code is? Here’s a simple example:

5 -> 101
2 -> 10

Both will look the same to the decoder, since it doesn’t know whether the code ends after the first bit, the second, the third, etc. But there’s a solution to this problem; a common one lies in the form of exponential Golomb codes. Here’s an example:

0 -> 1
2 -> 011
5 -> 00110
8 -> 0001001

The number of zeroes is the number of bits in the code, and a 1 is used to terminate the list of zeroes. This may seem inefficient, but in a sense one has no other choice but to have some sort of coding scheme like this. Of course, exponential golomb codes are only optimal in the case of a specific probability distribution; an optimal variable-length coding system has code lengths chosen based on the probability of each value. This is the primary reason why MPEG-4 Part 2 (Xvid/DivX) is superior to MPEG-2; the VLC tables were much better chosen.

Of course, the ideal system would be to allow custom VLC tables to be put in the header of the video file; its quite trivial to design a twopass encoder to create optimal VLC tables for a given video, or even a given frame. H.264 partially does away with this problem with CABAC; while there would be benefit to including a custom initial CABAC state, the cost of various CABAC modes adapts quickly to the video. Of course, this is not always the case; CABAC still relies on variable-length codes, because it converts numbers to series of binary digits, which are then written into the context-based arithmetic coder (the “binary” part of CABAC). One result of this is that even if one has a B-frame made up entirely of intra blocks, the arithmetic coder won’t be able to completely adapt, and each block will still cost more bits; the variable-length coding for the macroblock type is not chosen in such a way that would allow such “complete” adaptation.