Diary Of An x264 Developer

06/30/2009 (3:28 pm)

Chroma encoding revisited

Chroma has always been a ripe target for optimization. We have to perform transform+quantization on every block, but the vast majority of blocks end up having not a single nonzero coefficient to code, so it seems as if we wasted our time doing all that arithmetic only to find out that there was no information there anyways. But we can’t just skip it, because the few times that there are coefficients, they are very important. Part of this problem is unique to H.264, which has a quite curious method of encoding its chroma, which I will describe here for those not familiar with it.

For each chroma channel in the current macroblock, 4 4×4 transforms are performed on the residual, making up an 8×8 block. Then, the DC coefficients of each transform are collected and put into a separate 2×2 block, which is transformed again with a Hadamard transform. In the bitstream, the encoder can signal three modes, which apply to both chroma channels. The first mode, 0, simply says there is no chroma data. The second mode, 1, says there is DC data, but not AC data (the rest of the coefficients that weren’t put into that special 2×2 block). The third mode, 2, says that there is both DC and AC data. Since having AC but not DC data is extremely rare, there is no special mode for this.

Read More…

05/26/2009 (5:21 pm)

The art of commit messages

Filed under: assembly,development,speed,x264 ::

The commit message is one of the most important tools a developer has: in just a few lines, he can communicate a great deal of information to a great variety of people.  This group includes a vast swath of eager but relatively nontechnical users who merely want to know what was improved in the most recent update.  Additionally, this group includes a number of technical users who may look at the code from time to time and perhaps submit patches intermittently.  This group includes the other developers, who on a larger project may not be entirely aware of everything being worked on.  This group even includes the developer himself, as he will probably not remember today’s change in detail a year from now.

So what can a developer do to make a commit message relatively succint but still satisfy the needs of all of these people?  Let’s take the commit message I wrote for this year’s most significant patch so far, Holger’s overhaul of a large part of x264′s most important assembly code.

Read More…

11/14/2008 (2:50 am)

A simple optimization

I’ll be posting about the Nehalem optimizations soon, but in the meantime, a short and simple post.

You want to find the last nonzero value in an array of 16 16-bit values (DCT coefficients, in this case). How do you do this really quickly, especially in the case that most of the array is expected to be zero? Well here’s my way (l[16] is the array):

i_last = 0;
if( *(uint64_t*)(l+i_last+8)|*(uint64_t*)(l+i_last+12) ) i_last += 8;
if( *(uint64_t*)(l+i_last+4) ) i_last += 4;
if( *(uint32_t*)(l+i_last+2) ) i_last += 2;
if( l[i_last+1] ) i_last++;

This assumes the array isn’t all zero, but where I used this code, we already knew that it wasn’t, so that wasn’t an issue.

Only 4 conditionals (cmov and setne in the compiled asm) to find the exact index of the coefficient. Over twice as fast as the previous code:

for( j = i_count - 4; j >= 4; j -= 4 ) if( *(uint64_t*)(l+j) ) break;
for( i = j; i < j+4; i++ ) if( l[i] ) i_last = i;

And I would guess yet another factor of 2 or similar faster than the naive implementation:

for( i = 0; i < 16; i++ ) if( l[i] ) i_last = i;

10/31/2008 (6:17 pm)

x264 revision 1000 party!

Filed under: development,x264 ::

I promised cake in the previous post.

And cake there was.

Read the full story here.

10/22/2008 (12:41 pm)

In the pipeline, part 2

From the local commit git log:

Read More…

10/13/2008 (8:32 pm)

More updates on x264 development

Filed under: development,GSOC,H.264,speed,x264 ::

Its been quite some time, hasn’t it? I’ve managed to go a few months without updates here, mostly because of my work at Avail and school also, but also because I’ve been intentionally putting it off. So, I’m going to actually put in some updates.

x264 now supports Predictive Lossless, the new lossless format introduced in the 2007 revision of the H.264 standard. The compression improvement is considerable; about 4-25% depending on source, with numbers generally higher in intra-only compression. There is no significant speed difference; if anything, it is faster, since lower bitrate means less time spent in bitstream coding. The downside is that only a single decoder in the entire world currently supports it correctly: CoreAVC. Not even JM supports it correctly (its i16×16 prediction support is bugged). A patch is available on the ffmpeg mailing list (simply search for Predictive Lossless or similar) to add support in libavcodec for the format. The strategy here is that since the main reason any decoder anywhere supported lossless to begin with is likely because of x264, this change should force the adoption of the much-superior Predictive Lossless profile.

I’ve begun merging much of the Google Summer of Code work; next on the menu will be the rest of Holger’s assembly and Joey Degges’ much improved multi-ref p8×8 partition search algorithm.

I also have acquired access to a pre-release Nehalem machine through Avail Media; this will allow for a series of Nehalem-specific speed optimizations to be committed on the day of Nehalem’s official release, giving a significant speed boost to users of the new processor in addition to the already-enormous benefit it gives.

x264 is now used by Facebook for their internet video; at this point in time, 25% of all new Facebook videos are encoded using x264. This will likely soon increase to 100% as VP6 is phased out in favor of H.264.

x264 is now at revision 999: only one left to 1000, and indeed we have special plans… more on that when it happens!

07/10/2008 (2:34 pm)

A summary of recent revisions

I haven’t written a post here for quite some time, so I figured I’d get back into the swing of things with an update on some recent revisions.  Further information can be found on the official git repository page.

r891: Instead of checking for the end of the bitstream on every single write, check before encoding each macroblock–and instead of silently truncating the frame upon reaching the end of the malloced memory, reallocate with more room. (patch by me)

r892: A patch by Gabriel for compilation on MSVC and other crappy compiles that don’t maintain mod16 stack alignment–disable all functions that require these on such platforms.  This results in a visible CPU flag (“slow_mod4_stack”) to warn the user that their build is crippled.

r893: Various micro-optimizations by me, along with a very cool little idea that saves quite a bit of time in probe_skip.  The idea here is that the probe_skip function *almost never* terminates during the chroma check, but you can’t skip it completely, since if you choose to skip a block before looking at the chroma, and it turned out there was significant detail there that you ignored, there might be noticeable artifacting as a result.  The solution was to do a quick Sum of Squared Differences (SSD) check before doing the actual probe_skip on each chroma plane; if low enough, skip the check.  Even an incredibly low threshold was sufficient for skipping the check nearly half the time yet never changing the results.

r894: Assembly for the lowres interpolation filter used for frametype decision–drastically increases its speed.  The C was modified in order to match the assembly version, which had slightly different rounding.  I originally wrote an MMX, SSE2, SSE3, and SSSE3 version of this patch, but Loren found an even faster method of doing it and rewrote most of it from scratch before committing it.

r895: Fix a bug in adaptive quantization stemming from the original implementation.  One subtle issue with this patch is that on win32, GCC incorrectly puts arrays of all zeroes into .bss even when told to align them–despite the fact that .bss is only labeled as 4-byte aligned.  This causes a crash, of course.  We resolved this with an ugly hack–tack a “1″ onto the end of the array to stop GCC from putting it in .bss.

r896: Improve the permutation macros for easier use throughout x264′s assembly. (patch by Loren)

r897: Port the noise reduction assembly from libavcodec and improve it.  Speed up the C version too.  The SSSE3 assembly is roughly 25 times faster than the original C code.  (patch by me with help from Loren and Holger)

r898-r899: Update the copyright headers across x264. (patch by me)

r900: Fix a subtle bug in Loren’s frame_init_lowres patch; in some cases the FPU might not be re-initialized with “emms” after running frame_init_lowres, so an emms was added just in case.

r901: Lots of mini-optimizations, one by holger, the rest by me.

r902: Considerable cleanup of the SSD assembly functions to simplify the code. (patch by Loren)

r903: A complete rewrite of the bitstream functions from the ground up.  This vastly faster bitstream writer uses 32-bit and 64-bit instead of 8-bit chunks.  This eliminates the need for any loop in bitstream writing.  The 32-bit writer is quite similar to ffmpeg’s 32-bit writer.  The 64-bit writer uses a particularly ingenious system whereby the writer only writes 32 bits at a time, but keeps up to 64 bits in a storage chunk–this means that any variable-length code can be written unconditionally to the chunk, since there’s always at least 32 bits of space left: the write to the bitstream is done afterwards.  Because of this, the 64-bit bitstream writer only needs a single if statement in its write function; not even an else. (patch by me with help from Loren)

r904: The golomb functions were far more complicated than necessary–with the new bitstream writer changes many of the special cases were no longer useful speed-wise.  Additionally, many of the branches were not necessary in most calls to those functions.  (patch by Loren with some changes by me)

r905: When a large static table is stuck in a .h file, its duplicated in every single C file that includes and uses it.  This is a waste of memory–its better to declare it once in a C file and use “extern” to address it from other locations.  This diff de-duplicates VLC tables and additionally reduces them from using 16-bit to using 8-bit codes, since none of the VLC codes have more than 8 significant bits.  (patch by Loren, change to 8-bit by me)

r906:  Add support for PCM macroblocks.   This type of macroblock is *completely* uncompressed and bypasses the CABAC entropy coder.   As such, each PCM block is always 384 bytes.  The advantage of this is that sometimes, given the current state of the CABAC encoder, at very low QPs (0-5) and in lossless mode, its possible that a block will exceed 384 bytes in size–so PCM would have been a better option.  When rate-distortion optimization is enabled, PCM is considered as a possible option by x264.  (patch by me)

05/29/2008 (9:32 am)

Working at Avail Media

Filed under: avail,development,x264 ::

As some of you know, this summer I am living in Kalispell, Montana and working at Avail Media, a relatively small broadcast/IPTV company that caters primarily to smaller telecomms looking to set up IPTV and VOD services for reasonable prices. What’s unique about Avail, however, is that they don’t use hardware encoders; they use x264! They’re definitely not the only company using x264–other notable users include Google and Facebook–but I know of nobody else using x264 for live HD television. As the company which Loren Merritt (pengvado) worked for, they are responsible for quite a number of x264′s broadcast-related features, such as its interlaced encoding support.

Of course, broadcast has its own difficulties that make it much more of a challenge than ordinary offline encoding. For one, encoding has to run in realtime, quite a challenge when dealing with 1080i input. This is assisted by a patch written by Loren and improved by me, “speed control,” which automatically reconfigures x264′s settings on the fly to run at exactly the specified speed. It even communicates with the encoding/muxing frontend to know exactly how much time it really has left.

Another major issue is that of the VBV; the stream must strictly obey the buffer size or else packets might end up being dropped, corrupting the video stream. This is a hard problem, considering that x264′s VBV, especially in 1pass mode, is not very compliant. This has been a repeated subject of research by me even before I came here. Gabriel from Joost has also spent quite a bit of time on it in the form of his 2pass VBV patch, which in its latest form also improves 1pass VBV ratecontrol.

The biggest issue is the quality of the sources one comes across in broadcast–or better stated, the lack thereof. Much input is in the form of 18 megabit CBR MPEG-2 streams which are of such low quality that motion search above DIA/HEX is nearly useless. This is because in any scene with sufficient motion to require a complex motion search, the stream has already gone completely blocky. Re-encoding to 6-7 megabit H.264 doesn’t make it much better, either! This mess of input also results in a very large percentage of intra blocks in the output, which makes ratecontrol that much more difficult. Add this to the relative inefficiency of x264′s interlaced encoding and things get even worse.

Yet despite the difficulties of broadcast, Avail has managed to get it to work; quite an amazing accomplishment, proving once again that x264 isn’t just for ordinary offline 2pass encoding.

05/06/2008 (2:21 am)

x264 development: a six month retrospective

Filed under: development,summary,x264 ::

These past 6 months have consisted mostly of bugfixes, vast speed improvements, and the beginning of what will hopefully be a series of psychovisual optimizations.

How can I best describe the speed boost? Numbers would do the best job, I think. All values are my internal development build compared to the current version from 6 months ago. Adaptive quantization is disabled to make the results comparable. CRF is used for all encodes.

Max speed settings (no B-frames, subme 1, analyse none, me dia): 29.5% speed boost
Near-max speed settings (3 B-frames, subme 1, analyse none, me dia): 24.5% speed boost
Medium speed settings: (3 B-frames, subme 5): 18.5% speed boost
Slow speed settings (3 b-frames, subme 6, b-rdo,
me umh, ref 4): 35% speed boost
Very slow speed settings (16 b-frames, subme 7, b-rdo, me esa, ref 16, trellis 2, no fast-pskip, partitions all, mixed-refs): 52% speed boost
Lossless: 15% speed boost

Notable new features:

1. Psy-based adaptive quantization, for improving quality in flat areas of the frame by taking bits from more complex areas of the frame.
2. –me tesa, transformed exhaustive search. Converted from a ridiculously slow initial algorithm by me to a highly optimized thresholded solution by Loren Merritt, resulting in an even slower alternative to –me esa.
3. A massive preprocessor-based abstraction layer for assembly, allowing complete abstraction between 32-bit and 64-bit assembly and even automatic handling of everything from stack offsets to macros that permute their arguments and SSE/MMX abstraction. Written from scratch by Loren Merritt and drastically simplifies all assembly development.

Notable speed increases:

1. Altivec implementations of various functions; much faster PowerPC encoding.
2. Cacheline optimization for SAD-based motion search. Also for luma MC.
3. Much faster exhaustive motion search.
4. Lots more SSE2 assembly. And SSSE3 too. And even more SSE2. Oh wait, more
5. Skipping stuff.
6. Much much faster CABAC encoding.
7. Tons of small optimizations all over x264. Yes, there’s lots more of these. And more of these. And even morewait, there’s more here

« Previous Page