Diary Of An x264 Developer

01/17/2010 (10:40 pm)

What’s coming next in x264 development

As seen in the previous post, a whole category of x264 improvements has now been completed and committed.  So, many have asked–what’s next?  Many major features have, from the perspective of users, seemingly come out of nowhere, so hopefully this post will give a better idea as to what’s coming next.

Big thanks to everyone who’s been helping with a lot of the changes outside of the core encoder.  This, by the way, is one of the easiest ways to get involved in x264 development; while learning how the encoder internals work is not necessarily that difficult, understanding enough to contribute useful changes often takes a lot of effort, especially since by and large the existing developers have already eliminated most of the low-hanging fruit.  By comparison, one can work on x264cli or parts of the libx264 outside the analysis sections with significantly less domain knowledge.  This doesn’t mean the work is any less difficult–only that it has a lower barrier to entry.

For most specific examples given below, I’ve put down an estimated time scale as an exercise in project estimation.  This is no guarantee as to when it will be done; just a wild guess by me.  Though it might serve as a personal motivator for the tasks that I’ve assigned to myself.  Don’t harass any of the other developers based on my bad guesses ;)

Do also note that even though projects have time scales doesn’t even necessarily mean that they will be finished at all: not everything that we plan ends up happening.  Many features end up sitting on the TODO list for months or even years before someone decides that it’s important enough to implement.  If your company is particularly interested in one of these features, I might be able to offer you a contract to make sure it gets done much sooner and in a way that best fits your use-case; contact me via email.

Read More…

01/13/2010 (10:23 pm)

x264: the best low-latency video streaming platform in the world

x264 has long held the crown as one of the best, if not the best, general-purpose H.264 video encoder.  With state-of-the-art psy optimizations and powerful internal algorithms, its quality and performance in “normal” situations is mostly unrivaled.

But there are many very important use-cases where this simply isn’t good enough.  All the quality and performance in the world does nothing if x264 can’t meet other requirements necessary for a given business.  Which brings us to today’s topic: low-latency streaming.

The encoding familiar to most users has effectively “infinite” latency: the output file is not needed by the user until the entire encode is completed.  This allows algorithms such as 2-pass encoding, which require that the entire input be processed before even a single frame of the final output is available.  This of course becomes infeasible for any sort of live streaming, in which the viewer must see the video some predictable amount of time after it reaches the encoder.  Which brings us to our first platform: broadcast television.

Read More…

12/06/2009 (1:15 am)

A curious SIMD assembly challenge: the zigzag

Filed under: assembly, development, speed, x264 ::

Most SIMD assembly functions are implemented in a rather straightforward fashion.  An experienced assembly programmer can spend 2 minutes looking at C code and either give a pretty good guess at how one would write SIMD for it–or equally–rule out SIMD as an optimization technique for that code.  There might be a nonintuitive approach that’s somewhat better, but one can usually get very good results merely by following the most obvious method.

But in some rare cases there is no “most obvious method”, even for functions that would seem extraordinarily simple.  These kind of functions present an unusual situation for the assembly programmer: they find themselves looking at some embarrassingly simple algorithm–one which simply cries out for SIMD–and yet they can’t see an obvious way to do it!  So let’s jump into the fray here and look at one of these cases.

Read More…

11/15/2009 (5:15 am)

The spec-violation hall of shame

Filed under: H.264, fail, weighted prediction, x264 ::

People generally make the assumption that when they make an MP3 file, an MP3 player will play it.  When they save a PNG file, a browser will display it.  And when they save a PDF, any PDF application will view it.

And so we assume that H.264 decoders abide by the specification.  If they claim to support Main Profile, they’ll play back Main Profile.  There’s a whole set of rather exhaustive official test cases to validate compliance, which would hopefully ensure that all decoders would work with even the most bizarre combinations of features that fall under the subset of what they claim to support.

Among other things, this makes development dramatically easier: imagine how ridiculous it would be to have to test every change on dozens upon dozens of decoders to make sure that they all still worked.  Instead, we can just test against the official reference to ensure that our streams are still valid.

A week ago Dylan finished up the weighted P-frame prediction Google Summer of Code project.  It passed the JM verification test and it worked with libavcodec too.  We also tested a few other decoders to make sure; all were fine.

And then we committed it.

We immediately got reports of artifacting in a wide variety of decoders, despite all of them claiming to support the feature.  As such, here is a hall of shame of products that are distributed with broken decoders that do not abide by the spec.

Read More…

10/31/2009 (7:12 pm)

The other hidden cycle-eating demon: code cache misses

Filed under: assembly, gcc, speed ::

A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death.  This time we’ll look at an even more insidious evil: the L1 code cache miss.

With data caches, despite what I said in the previous article, you still have a good bit of control.  You can prefetch data to them explicitly using the prefetch instructions.  You control memory allocation and can make all sorts of changes to potentially improve access patterns.  Every single memory access is explicit by you in your code.

But it isn’t the same with the L1 code cache (L1I).  You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache.  Unless you write everything directly in assembly, you can’t control the allocation and layout of the code.  And you don’t control access to the code at all; it is accessed implicitly when it is run.

Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why.  Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well.  While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.

Read More…

10/25/2009 (12:06 pm)

x264’s assembly abstraction layer: now free to use

Filed under: assembly, licensing, speed, x264 ::

About two years ago we decided to merge our two trees of 32-bit and 64-bit assembly; it had become a maintainability nightmare with many functions in one that were not in the other and so forth.  Loren Merritt did this merging by adding an abstraction layer that automatically handles differences in calling convention between platforms (and was extensible to future ones as well).  Thus began the story of common/x86/x86inc.asm.

Over the years, x86inc has grown tremendously as it gained a great deal of functionality.  It now supports x86_32, x86_64, and win64 (thanks to Anton Mitrofanov), which covers the three (by far) most popular x86 C calling conventions.  It also has macros to abstract between MMX and SSE functions, along with automatic handling of register permutations and other such useful features.

All of this serves to make x264ASM, as we call it, by far the best option for writing platform-independent x86 assembly.  It keeps the full optimization capabilities and powerful preprocessor of native assembly while having the platform-independence and convenience of instrinsics and inline assembly.

We’ve received many requests by non-GPL projects (and even commercial proprietary developers) to be able to use this abstraction layer.  And now, it’s available under a permissive BSD-like license (specifically the ISC) for anyone to use, on request by a certain Adobe engineer.

Before you jump into the x264 code to see how it’s used, however, let’s go over some of the basics.  Note of course this explanation is no substitute for reading x86inc.asm itself.

Read More…

10/18/2009 (3:04 am)

Open source collaboration done right

Filed under: benchmark, linux, speed, x264 ::

For years I’ve dealt with all sorts of horrific situations when dealing with open source.  Like software modules written by different teams on a badly managed commercial project, different open source projects tend to defensively program around each others’ flaws rather than actually submitting patches to fix them.  There are even entire projects built around providing API wrappers that simplify usage and fix bugs present in the original library.

In many cases people don’t even submit bug reports.  Sometimes they outright patch each others’ libraries–and don’t submit the patches back to the original project.  At best this leads to tons of bugs and security vulnerabilities being overlooked in the original project.  At worst this leads to situations like the Debian OpenSSL fiasco, in which the people patching the code don’t know enough about it to safely work with it (and don’t even talk to the people who do).

But enough ranting–let me talk about a success story.

Read More…

10/04/2009 (4:43 am)

Why so many H.264 encoders are bad

If one works long enough with a large number of H.264 encoders, one might notice that a large number of them are pretty much awful.  This of course shouldn’t be a surprise: Sturgeon’s Law says that “90% of everything is crap”.  It’s also exacerbated by the fact that H.264 is the most widely-accepted video standard in years and has spawned a huge amount of software that implements it, thus generating more mediocre implementations.

But even this doesn’t really explain the massive gap between good and bad H.264 encoders.  Good H.264 encoders, like x264, can beat previous-generation encoders like Xvid visually at half the bitrate in many cases.  Yet bad H.264 encoders are often so terrible that they lose to MPEG-2!  The disparity wasn’t nearly this large with previous standards… and there’s a good reason for this.

H.264 offers a great variety of compression features, more than any previous standard.  This also greatly increases the number of ways that encoder developers can shoot themselves in the foot.  In this post I’ll go through a sampling of these.  Most of the problems stem from the single fact that blurriness seems good when using mean squared error as a mode decision metric.

Read More…

09/10/2009 (6:36 pm)

iDCT rounding

Filed under: DCT, H.264, chroma, development, quantization, x264 ::

The quantization process in modern video encoders tends to make a lot of assumptions.  A common one is that of continuity and uniform step size–that, for example, if we are quantizing the value 2.5, both 2 and 3 will give equal distortion, being exactly 0.5 off from the correct value.  But this isn’t always true; in reality, we are working with an 8-bit range in each channel.  The inverse transform has to round our high-precision internal values to a small output range.

Normally, this isn’t a problem.  Since AC coefficients have (by definition) different output values for each output pixel, they serve to effectively dither the output of the iDCT.  But what happens when we don’t have any AC coefficients?

Read More…

09/02/2009 (10:58 pm)

The hidden cycle-eating demon: L1 cache misses

Filed under: Intel, speed, x264 ::

Most of the resources out there about optimizing cache access talk about L2 cache misses.  Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory.  By comparison, an L1 miss, costing just a dozen cycles, is nothing.  This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses.  It is also very efficient, making reasonably good use of the limited memory bandwidth available.  There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.

But what about L1 misses?  There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature.  Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it.  And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

Read More…

Next Page »