Diary Of An x264 Developer

10/31/2009 (7:12 pm)

The other hidden cycle-eating demon: code cache misses

Filed under: assembly,gcc,speed ::

A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death.  This time we’ll look at an even more insidious evil: the L1 code cache miss.

With data caches, despite what I said in the previous article, you still have a good bit of control.  You can prefetch data to them explicitly using the prefetch instructions.  You control memory allocation and can make all sorts of changes to potentially improve access patterns.  Every single memory access is explicit by you in your code.

But it isn’t the same with the L1 code cache (L1I).  You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache.  Unless you write everything directly in assembly, you can’t control the allocation and layout of the code.  And you don’t control access to the code at all; it is accessed implicitly when it is run.

Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why.  Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well.  While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.

Read More…

10/25/2009 (12:06 pm)

x264′s assembly abstraction layer: now free to use

Filed under: assembly,licensing,speed,x264 ::

About two years ago we decided to merge our two trees of 32-bit and 64-bit assembly; it had become a maintainability nightmare with many functions in one that were not in the other and so forth.  Loren Merritt did this merging by adding an abstraction layer that automatically handles differences in calling convention between platforms (and was extensible to future ones as well).  Thus began the story of common/x86/x86inc.asm.

Over the years, x86inc has grown tremendously as it gained a great deal of functionality.  It now supports x86_32, x86_64, and win64 (thanks to Anton Mitrofanov), which covers the three (by far) most popular x86 C calling conventions.  It also has macros to abstract between MMX and SSE functions, along with automatic handling of register permutations and other such useful features.

All of this serves to make x264ASM, as we call it, by far the best option for writing platform-independent x86 assembly.  It keeps the full optimization capabilities and powerful preprocessor of native assembly while having the platform-independence and convenience of instrinsics and inline assembly.

We’ve received many requests by non-GPL projects (and even commercial proprietary developers) to be able to use this abstraction layer.  And now, it’s available under a permissive BSD-like license (specifically the ISC) for anyone to use, on request by a certain Adobe engineer.

Before you jump into the x264 code to see how it’s used, however, let’s go over some of the basics.  Note of course this explanation is no substitute for reading x86inc.asm itself.

Read More…

10/18/2009 (3:04 am)

Open source collaboration done right

Filed under: benchmark,linux,speed,x264 ::

For years I’ve dealt with all sorts of horrific situations when dealing with open source.  Like software modules written by different teams on a badly managed commercial project, different open source projects tend to defensively program around each others’ flaws rather than actually submitting patches to fix them.  There are even entire projects built around providing API wrappers that simplify usage and fix bugs present in the original library.

In many cases people don’t even submit bug reports.  Sometimes they outright patch each others’ libraries–and don’t submit the patches back to the original project.  At best this leads to tons of bugs and security vulnerabilities being overlooked in the original project.  At worst this leads to situations like the Debian OpenSSL fiasco, in which the people patching the code don’t know enough about it to safely work with it (and don’t even talk to the people who do).

But enough ranting–let me talk about a success story.

Read More…

10/04/2009 (4:43 am)

Why so many H.264 encoders are bad

If one works long enough with a large number of H.264 encoders, one might notice that a large number of them are pretty much awful.  This of course shouldn’t be a surprise: Sturgeon’s Law says that “90% of everything is crap”.  It’s also exacerbated by the fact that H.264 is the most widely-accepted video standard in years and has spawned a huge amount of software that implements it, thus generating more mediocre implementations.

But even this doesn’t really explain the massive gap between good and bad H.264 encoders.  Good H.264 encoders, like x264, can beat previous-generation encoders like Xvid visually at half the bitrate in many cases.  Yet bad H.264 encoders are often so terrible that they lose to MPEG-2!  The disparity wasn’t nearly this large with previous standards… and there’s a good reason for this.

H.264 offers a great variety of compression features, more than any previous standard.  This also greatly increases the number of ways that encoder developers can shoot themselves in the foot.  In this post I’ll go through a sampling of these.  Most of the problems stem from the single fact that blurriness seems good when using mean squared error as a mode decision metric.

Read More…