Diary Of An x264 Developer

11/15/2009 (5:15 am)

The spec-violation hall of shame

Filed under: H.264, fail, weighted prediction, x264 ::

People generally make the assumption that when they make an MP3 file, an MP3 player will play it.  When they save a PNG file, a browser will display it.  And when they save a PDF, any PDF application will view it.

And so we assume that H.264 decoders abide by the specification.  If they claim to support Main Profile, they’ll play back Main Profile.  There’s a whole set of rather exhaustive official test cases to validate compliance, which would hopefully ensure that all decoders would work with even the most bizarre combinations of features that fall under the subset of what they claim to support.

Among other things, this makes development dramatically easier: imagine how ridiculous it would be to have to test every change on dozens upon dozens of decoders to make sure that they all still worked.  Instead, we can just test against the official reference to ensure that our streams are still valid.

A week ago Dylan finished up the weighted P-frame prediction Google Summer of Code project.  It passed the JM verification test and it worked with libavcodec too.  We also tested a few other decoders to make sure; all were fine.

And then we committed it.

We immediately got reports of artifacting in a wide variety of decoders, despite all of them claiming to support the feature.  As such, here is a hall of shame of products that are distributed with broken decoders that do not abide by the spec.

Hopefully by making the world’s most popular H.264 encoder generate such problematic bitstreams by default, we will force the makers of these decoders to fix their software.  I’ll update the list as the decoders are fixed (already contacted DivX [Mainconcept owners] and Adobe).

NB: this is not a new feature; it’s been around for 6 years without any relevant change in the specification, so nobody has any excuse.

Adobe Flash Player [fixed in 10.1, thanks Adobe!]

Apple TV

CoreCodec CoreAVC [fixed in as-yet-unreleased 2.0]

Want to see if your decoder is broken?  Download this sample and watch it, especially frames 310-330.  If you notice obvious artifacting, compare it to the output of a working decoder (e.g. libavcodec) for reference.  If you get issues and your decoder isn’t on the list, post your results in the comments.

Edit/Note: It seems that earlier we were testing with a stream that was actually invalid; the reference decoder didn’t complain, but after putting it through a stream validator, it found an obvious issue (which we fixed in x264 r1342).  As such, the list is a lot shorter now.

10/31/2009 (7:12 pm)

The other hidden cycle-eating demon: code cache misses

Filed under: assembly, gcc, speed ::

A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death.  This time we’ll look at an even more insidious evil: the L1 code cache miss.

With data caches, despite what I said in the previous article, you still have a good bit of control.  You can prefetch data to them explicitly using the prefetch instructions.  You control memory allocation and can make all sorts of changes to potentially improve access patterns.  Every single memory access is explicit by you in your code.

But it isn’t the same with the L1 code cache (L1I).  You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache.  Unless you write everything directly in assembly, you can’t control the allocation and layout of the code.  And you don’t control access to the code at all; it is accessed implicitly when it is run.

Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why.  Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well.  While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.

Read More…

10/25/2009 (12:06 pm)

x264’s assembly abstraction layer: now free to use

Filed under: assembly, licensing, speed, x264 ::

About two years ago we decided to merge our two trees of 32-bit and 64-bit assembly; it had become a maintainability nightmare with many functions in one that were not in the other and so forth.  Loren Merritt did this merging by adding an abstraction layer that automatically handles differences in calling convention between platforms (and was extensible to future ones as well).  Thus began the story of common/x86/x86inc.asm.

Over the years, x86inc has grown tremendously as it gained a great deal of functionality.  It now supports x86_32, x86_64, and win64 (thanks to Anton Mitrofanov), which covers the three (by far) most popular x86 C calling conventions.  It also has macros to abstract between MMX and SSE functions, along with automatic handling of register permutations and other such useful features.

All of this serves to make x264ASM, as we call it, by far the best option for writing platform-independent x86 assembly.  It keeps the full optimization capabilities and powerful preprocessor of native assembly while having the platform-independence and convenience of instrinsics and inline assembly.

We’ve received many requests by non-GPL projects (and even commercial proprietary developers) to be able to use this abstraction layer.  And now, it’s available under a permissive BSD-like license (specifically the ISC) for anyone to use, on request by a certain Adobe engineer.

Before you jump into the x264 code to see how it’s used, however, let’s go over some of the basics.  Note of course this explanation is no substitute for reading x86inc.asm itself.

Read More…

10/18/2009 (3:04 am)

Open source collaboration done right

Filed under: benchmark, linux, speed, x264 ::

For years I’ve dealt with all sorts of horrific situations when dealing with open source.  Like software modules written by different teams on a badly managed commercial project, different open source projects tend to defensively program around each others’ flaws rather than actually submitting patches to fix them.  There are even entire projects built around providing API wrappers that simplify usage and fix bugs present in the original library.

In many cases people don’t even submit bug reports.  Sometimes they outright patch each others’ libraries–and don’t submit the patches back to the original project.  At best this leads to tons of bugs and security vulnerabilities being overlooked in the original project.  At worst this leads to situations like the Debian OpenSSL fiasco, in which the people patching the code don’t know enough about it to safely work with it (and don’t even talk to the people who do).

But enough ranting–let me talk about a success story.

Read More…

10/04/2009 (4:43 am)

Why so many H.264 encoders are bad

If one works long enough with a large number of H.264 encoders, one might notice that a large number of them are pretty much awful.  This of course shouldn’t be a surprise: Sturgeon’s Law says that “90% of everything is crap”.  It’s also exacerbated by the fact that H.264 is the most widely-accepted video standard in years and has spawned a huge amount of software that implements it, thus generating more mediocre implementations.

But even this doesn’t really explain the massive gap between good and bad H.264 encoders.  Good H.264 encoders, like x264, can beat previous-generation encoders like Xvid visually at half the bitrate in many cases.  Yet bad H.264 encoders are often so terrible that they lose to MPEG-2!  The disparity wasn’t nearly this large with previous standards… and there’s a good reason for this.

H.264 offers a great variety of compression features, more than any previous standard.  This also greatly increases the number of ways that encoder developers can shoot themselves in the foot.  In this post I’ll go through a sampling of these.  Most of the problems stem from the single fact that blurriness seems good when using mean squared error as a mode decision metric.

Read More…

09/10/2009 (6:36 pm)

iDCT rounding

Filed under: DCT, H.264, chroma, development, quantization, x264 ::

The quantization process in modern video encoders tends to make a lot of assumptions.  A common one is that of continuity and uniform step size–that, for example, if we are quantizing the value 2.5, both 2 and 3 will give equal distortion, being exactly 0.5 off from the correct value.  But this isn’t always true; in reality, we are working with an 8-bit range in each channel.  The inverse transform has to round our high-precision internal values to a small output range.

Normally, this isn’t a problem.  Since AC coefficients have (by definition) different output values for each output pixel, they serve to effectively dither the output of the iDCT.  But what happens when we don’t have any AC coefficients?

Read More…

09/02/2009 (10:58 pm)

The hidden cycle-eating demon: L1 cache misses

Filed under: Intel, speed, x264 ::

Most of the resources out there about optimizing cache access talk about L2 cache misses.  Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory.  By comparison, an L1 miss, costing just a dozen cycles, is nothing.  This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses.  It is also very efficient, making reasonably good use of the limited memory bandwidth available.  There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.

But what about L1 misses?  There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature.  Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it.  And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

Read More…

08/24/2009 (7:54 pm)

Announcing ARM support

Filed under: ARM, GSOC, assembly, speed, x264 ::

Thanks to our Google Summer of Code student David Conrad (aka Yuvi), we now have ARM support in x264, along with a significant amount of SIMD acceleration via NEON, available on the Cortex A8 and A9 chips.  Yes, that’s right, x264 can now run on an iPhone.  Total performance increase from the NEON optimizations (so far) is about 280% on default settings.

With low power becoming more important and ARM chips increasing in speed dramatically (multi-core chips are already hitting silicon), being able to do high quality, high-speed realtime video encoding on ARM chips will become more and more important.  Staying ahead of the game as always, x264 will be the premiere encoder on ARM as well.

One situation showing the usefulness of low-power encoding was brought up a month or two ago: a remote-control airplane enthusiast wanted to make his airplane broadcast camera footage over the cell network so that he can remote control it many miles away from his current location.  The cell network is generally low bandwidth, so he needs a high-efficiency video encoder.  But he can’t afford a powerful system; his airplane is already extremely low power and he needs an encoder that is both low-power and low-weight.  The ARM chip is perfect: it uses a fraction of a watt, almost no space, and now, he can run x264 on it.

Special thanks to Mans Rullgard for helping with lots of assembly questions and contributing the NEON deblocking code, originally used in the ffmpeg H.264 decoder.

Want to play with x264 on an ARM?  Get a Beagleboard.

Commits: 1 2 3 4 5 6 7 8 9 10

08/09/2009 (10:41 pm)

Encoding animation

Filed under: H.264, SSIM, benchmark, ratecontrol, x264 ::

Note: I originally posted this a day earlier, but quickly retracted it when it was pointed out that I made a rather egregious error in the ffmpeg tests’ settings, so I re-did those and added a few more codecs.

Encoder comparisons are a dime a dozen these days, but there’s one thing I’ve almost never seen tested: cartoon sources.  Animated material has totally different characteristics from film and presents a whole separate set of encoding challenges.

First, we’ll start with what makes such video easy to compress.  Animation is mostly static; backgrounds are completely static with characters placed in front of them.  The characters themselves are mostly static as well; modern animation is usually at a significantly lower framerate than the actual video.  Furthermore, characters may stand still with only their mouths moving, dramatically reducing complexity.  Finally, animation is usually very clean, without any film grain.  All of this combines to seemingly make animation compression a very simple task.

Read More…

08/06/2009 (11:36 pm)

A tree of thought

Filed under: GSOC, development, ratecontrol, x264 ::

“There is nothing like looking, if you want to find something… You certainly usually find something, if you look, but it is not always quite the something you were after.”

– J.R.R Tolkien

About a year and a half ago, I had an idea: what if we made a graph of how each block of the video referenced other blocks temporally and used this graph to increase quality on blocks which are referenced a lot and lower it on those which are referenced less?  Clearly this would greatly improve average quality… but when I thought through it, the problem became messier and messier.  I decided to put it off to later. I ended up making it a Google Summer of Code project for 2008, but that student disappeared after a few weeks of relative non-work and a hardly-working initial patch.  I mostly forgot about it; it was in the same category as explicit weighted prediction and MBAFF: messy things that might help, but I didn’t want to do.  This idea in particular got filed away under the name “MB-tree.”

Read More…

Next Page »