Diary Of An x264 Developer

02/15/2010 (9:02 pm)

x264: now with adaptive streaming support

Filed under: ratecontrol,streaming,x264 ::

You’re running a video chat program on a relatively weak upstream connection.  Someone else opens a video chat program on the same connection and your available bandwidth immediately drops.  What do you do?

You’re running a streaming video server that sends live video to an iPhone.  The client moves into an area of weaker reception and the stream begins to break up.  What do you do?

You’re running a streaming video server and it has currently maxed out your connection with the current viewers, but you want another person to be able to connect.   You’d rather not restart the whole server though.  What do you do?

Read More…

01/17/2010 (10:40 pm)

What’s coming next in x264 development

As seen in the previous post, a whole category of x264 improvements has now been completed and committed.  So, many have asked–what’s next?  Many major features have, from the perspective of users, seemingly come out of nowhere, so hopefully this post will give a better idea as to what’s coming next.

Big thanks to everyone who’s been helping with a lot of the changes outside of the core encoder.  This, by the way, is one of the easiest ways to get involved in x264 development; while learning how the encoder internals work is not necessarily that difficult, understanding enough to contribute useful changes often takes a lot of effort, especially since by and large the existing developers have already eliminated most of the low-hanging fruit.  By comparison, one can work on x264cli or parts of the libx264 outside the analysis sections with significantly less domain knowledge.  This doesn’t mean the work is any less difficult–only that it has a lower barrier to entry.

For most specific examples given below, I’ve put down an estimated time scale as an exercise in project estimation.  This is no guarantee as to when it will be done; just a wild guess by me.  Though it might serve as a personal motivator for the tasks that I’ve assigned to myself.  Don’t harass any of the other developers based on my bad guesses ;)

Do also note that even though projects have time scales doesn’t even necessarily mean that they will be finished at all: not everything that we plan ends up happening.  Many features end up sitting on the TODO list for months or even years before someone decides that it’s important enough to implement.  If your company is particularly interested in one of these features, I might be able to offer you a contract to make sure it gets done much sooner and in a way that best fits your use-case; contact me via email.

Read More…

01/13/2010 (10:23 pm)

x264: the best low-latency video streaming platform in the world

x264 has long held the crown as one of the best, if not the best, general-purpose H.264 video encoder.  With state-of-the-art psy optimizations and powerful internal algorithms, its quality and performance in “normal” situations is mostly unrivaled.

But there are many very important use-cases where this simply isn’t good enough.  All the quality and performance in the world does nothing if x264 can’t meet other requirements necessary for a given business.  Which brings us to today’s topic: low-latency streaming.

The encoding familiar to most users has effectively “infinite” latency: the output file is not needed by the user until the entire encode is completed.  This allows algorithms such as 2-pass encoding, which require that the entire input be processed before even a single frame of the final output is available.  This of course becomes infeasible for any sort of live streaming, in which the viewer must see the video some predictable amount of time after it reaches the encoder.  Which brings us to our first platform: broadcast television.

Read More…

12/06/2009 (1:15 am)

A curious SIMD assembly challenge: the zigzag

Filed under: assembly,development,speed,x264 ::

Most SIMD assembly functions are implemented in a rather straightforward fashion.  An experienced assembly programmer can spend 2 minutes looking at C code and either give a pretty good guess at how one would write SIMD for it–or equally–rule out SIMD as an optimization technique for that code.  There might be a nonintuitive approach that’s somewhat better, but one can usually get very good results merely by following the most obvious method.

But in some rare cases there is no “most obvious method”, even for functions that would seem extraordinarily simple.  These kind of functions present an unusual situation for the assembly programmer: they find themselves looking at some embarrassingly simple algorithm–one which simply cries out for SIMD–and yet they can’t see an obvious way to do it!  So let’s jump into the fray here and look at one of these cases.

Read More…

11/15/2009 (5:15 am)

The spec-violation hall of shame

People generally make the assumption that when they make an MP3 file, an MP3 player will play it.  When they save a PNG file, a browser will display it.  And when they save a PDF, any PDF application will view it.

And so we assume that H.264 decoders abide by the specification.  If they claim to support Main Profile, they’ll play back Main Profile.  There’s a whole set of rather exhaustive official test cases to validate compliance, which would hopefully ensure that all decoders would work with even the most bizarre combinations of features that fall under the subset of what they claim to support.

Among other things, this makes development dramatically easier: imagine how ridiculous it would be to have to test every change on dozens upon dozens of decoders to make sure that they all still worked.  Instead, we can just test against the official reference to ensure that our streams are still valid.

A week ago Dylan finished up the weighted P-frame prediction Google Summer of Code project.  It passed the JM verification test and it worked with libavcodec too.  We also tested a few other decoders to make sure; all were fine.

And then we committed it.

We immediately got reports of artifacting in a wide variety of decoders, despite all of them claiming to support the feature.  As such, here is a hall of shame of products that are distributed with broken decoders that do not abide by the spec.

Read More…

10/25/2009 (12:06 pm)

x264′s assembly abstraction layer: now free to use

Filed under: assembly,licensing,speed,x264 ::

About two years ago we decided to merge our two trees of 32-bit and 64-bit assembly; it had become a maintainability nightmare with many functions in one that were not in the other and so forth.  Loren Merritt did this merging by adding an abstraction layer that automatically handles differences in calling convention between platforms (and was extensible to future ones as well).  Thus began the story of common/x86/x86inc.asm.

Over the years, x86inc has grown tremendously as it gained a great deal of functionality.  It now supports x86_32, x86_64, and win64 (thanks to Anton Mitrofanov), which covers the three (by far) most popular x86 C calling conventions.  It also has macros to abstract between MMX and SSE functions, along with automatic handling of register permutations and other such useful features.

All of this serves to make x264ASM, as we call it, by far the best option for writing platform-independent x86 assembly.  It keeps the full optimization capabilities and powerful preprocessor of native assembly while having the platform-independence and convenience of instrinsics and inline assembly.

We’ve received many requests by non-GPL projects (and even commercial proprietary developers) to be able to use this abstraction layer.  And now, it’s available under a permissive BSD-like license (specifically the ISC) for anyone to use, on request by a certain Adobe engineer.

Before you jump into the x264 code to see how it’s used, however, let’s go over some of the basics.  Note of course this explanation is no substitute for reading x86inc.asm itself.

Read More…

10/18/2009 (3:04 am)

Open source collaboration done right

Filed under: benchmark,linux,speed,x264 ::

For years I’ve dealt with all sorts of horrific situations when dealing with open source.  Like software modules written by different teams on a badly managed commercial project, different open source projects tend to defensively program around each others’ flaws rather than actually submitting patches to fix them.  There are even entire projects built around providing API wrappers that simplify usage and fix bugs present in the original library.

In many cases people don’t even submit bug reports.  Sometimes they outright patch each others’ libraries–and don’t submit the patches back to the original project.  At best this leads to tons of bugs and security vulnerabilities being overlooked in the original project.  At worst this leads to situations like the Debian OpenSSL fiasco, in which the people patching the code don’t know enough about it to safely work with it (and don’t even talk to the people who do).

But enough ranting–let me talk about a success story.

Read More…

09/10/2009 (6:36 pm)

iDCT rounding

The quantization process in modern video encoders tends to make a lot of assumptions.  A common one is that of continuity and uniform step size–that, for example, if we are quantizing the value 2.5, both 2 and 3 will give equal distortion, being exactly 0.5 off from the correct value.  But this isn’t always true; in reality, we are working with an 8-bit range in each channel.  The inverse transform has to round our high-precision internal values to a small output range.

Normally, this isn’t a problem.  Since AC coefficients have (by definition) different output values for each output pixel, they serve to effectively dither the output of the iDCT.  But what happens when we don’t have any AC coefficients?

Read More…

09/02/2009 (10:58 pm)

The hidden cycle-eating demon: L1 cache misses

Filed under: Intel,speed,x264 ::

Most of the resources out there about optimizing cache access talk about L2 cache misses.  Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory.  By comparison, an L1 miss, costing just a dozen cycles, is nothing.  This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses.  It is also very efficient, making reasonably good use of the limited memory bandwidth available.  There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.

But what about L1 misses?  There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature.  Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it.  And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

Read More…

08/24/2009 (7:54 pm)

Announcing ARM support

Filed under: ARM,assembly,GSOC,speed,x264 ::

Thanks to our Google Summer of Code student David Conrad (aka Yuvi), we now have ARM support in x264, along with a significant amount of SIMD acceleration via NEON, available on the Cortex A8 and A9 chips.  Yes, that’s right, x264 can now run on an iPhone.  Total performance increase from the NEON optimizations (so far) is about 280% on default settings.

With low power becoming more important and ARM chips increasing in speed dramatically (multi-core chips are already hitting silicon), being able to do high quality, high-speed realtime video encoding on ARM chips will become more and more important.  Staying ahead of the game as always, x264 will be the premiere encoder on ARM as well.

One situation showing the usefulness of low-power encoding was brought up a month or two ago: a remote-control airplane enthusiast wanted to make his airplane broadcast camera footage over the cell network so that he can remote control it many miles away from his current location.  The cell network is generally low bandwidth, so he needs a high-efficiency video encoder.  But he can’t afford a powerful system; his airplane is already extremely low power and he needs an encoder that is both low-power and low-weight.  The ARM chip is perfect: it uses a fraction of a watt, almost no space, and now, he can run x264 on it.

Special thanks to Mans Rullgard for helping with lots of assembly questions and contributing the NEON deblocking code, originally used in the ffmpeg H.264 decoder.

Want to play with x264 on an ARM?  Get a Beagleboard.

Commits: 1 2 3 4 5 6 7 8 9 10

« Previous PageNext Page »