Back when I originally reviewed VP8, I noted that the official decoder, libvpx, was rather slow. While there was no particular reason that it should be much faster than a good H.264 decoder, it shouldn’t have been that much slower either! So, I set out with Ronald Bultje and David Conrad to make a better one in FFmpeg. This one would be community-developed and free from the beginning, rather than the proprietary code-dump that was libvpx. A few weeks ago the decoder was complete enough to be bit-exact with libvpx, making it the first independent free implementation of a VP8 decoder. Now, with the first round of optimizations complete, it should be ready for primetime. I’ll go into some detail about the development process, but first, let’s get to the real meat of this post: the benchmarks.
07/23/2010 (4:01 pm)
07/13/2010 (3:06 am)
I’ve been working the past few weeks to help finish up the ffmpeg VP8 decoder, the first community implementation of On2′s VP8 video format. Now that I’ve written a thousand or two lines of assembly code and optimized a good bit of the C code, I’d like to look back at VP8 and comment on a variety of things — both good and bad — that slipped the net the first time, along with things that have changed since the time of that blog post.
These are less-so issues related to compression — that issue has been beaten to death, particularly in MSU’s recent comparison, where x264 beat the crap out of VP8 and the VP8 developers pulled a Pinocchio in the developer comments. But that was expected and isn’t particularly interesting, so I won’t go into that. VP8 doesn’t have to be the best in the world in order to be useful.
When the ffmpeg VP8 decoder is complete (just a few more asm functions to go), we’ll hopefully be able to post some benchmarks comparing it to libvpx.
05/25/2010 (11:01 pm)
As mentioned in the previous post, H.264 has an adaptive deblocking filter. But what exactly does that mean — and more importantly, what does it mean for performance? And how can we make it as fast as possible? In this post I’ll try to answer these questions, particularly in relation to my recent deblocking optimizations in x264.
H.264′s deblocking filter has two steps: strength calculation and the actual filter. The first step calculates the parameters for the second step. The filter runs on all the edges in each macroblock. That’s 4 vertical edges of length 16 pixels and 4 horizontal edges of length 16 pixels. The vertical edges are filtered first, from left to right, then the horizontal edges, from top to bottom (order matters!). The leftmost edge is the one between the current macroblock and the left macroblock, while the topmost edge is the one between the current macroblock and the top macroblock.
Here’s the formula for the strength calculation in progressive mode. The highest strength that applies is always selected.
If we’re on the edge between an intra macroblock and any other macroblock: Strength 4
If we’re on an internal edge of an intra macroblock: Strength 3
If either side of a 4-pixel-long edge has residual data: Strength 2
If the motion vectors on opposite sides of a 4-pixel-long edge are at least a pixel apart (in either x or y direction) or the reference frames aren’t the same: Strength 1
Otherwise: Strength 0 (no deblocking)
These values are then thrown into a lookup table depending on the quantizer: higher quantizers have stronger deblocking. Then the actual filter is run with the appropriate parameters. Note that Strength 4 is actually a special deblocking mode that performs a much stronger filter and affects more pixels.
05/07/2010 (8:57 am)
For the past few years, various improvements on H.264 have been periodically proposed, ranging from larger transforms to better intra prediction. These finally came together in the JCT-VC meeting this past April, where over two dozen proposals were made for a next-generation video coding standard. Of course, all of these were in very rough-draft form; it will likely take years to filter it down into a usable standard. In the process, they’ll pick the most useful features (hopefully) from each proposal and combine them into something a bit more sane. But, of course, it all has to start somewhere.
A number of features were common: larger block sizes, larger transform sizes, fancier interpolation filters, improved intra prediction schemes, improved motion vector prediction, increased internal bit depth, new entropy coding schemes, and so forth. A lot of these are potentially quite promising and resolve a lot of complaints I’ve had about H.264, so I decided to try out the proposal that appeared the most interesting: the Samsung+BBC proposal (A124), which claims compression improvements of around 40%.
The proposal combines a bouillabaisse of new features, ranging from a 12-tap interpolation filter to 12thpel motion compensation and transforms as large as 64×64. Overall, I would say it’s a good proposal and I don’t doubt their results given the sheer volume of useful features they’ve dumped into it. I was a bit worried about complexity, however, as 12-tap interpolation filters don’t exactly scream “fast”.
I prepared myself for the slowness of an unoptimized encoder implementation, compiled their tool, and started a test encode with their recommended settings.
01/17/2010 (10:40 pm)
As seen in the previous post, a whole category of x264 improvements has now been completed and committed. So, many have asked–what’s next? Many major features have, from the perspective of users, seemingly come out of nowhere, so hopefully this post will give a better idea as to what’s coming next.
Big thanks to everyone who’s been helping with a lot of the changes outside of the core encoder. This, by the way, is one of the easiest ways to get involved in x264 development; while learning how the encoder internals work is not necessarily that difficult, understanding enough to contribute useful changes often takes a lot of effort, especially since by and large the existing developers have already eliminated most of the low-hanging fruit. By comparison, one can work on x264cli or parts of the libx264 outside the analysis sections with significantly less domain knowledge. This doesn’t mean the work is any less difficult–only that it has a lower barrier to entry.
For most specific examples given below, I’ve put down an estimated time scale as an exercise in project estimation. This is no guarantee as to when it will be done; just a wild guess by me. Though it might serve as a personal motivator for the tasks that I’ve assigned to myself. Don’t harass any of the other developers based on my bad guesses
Do also note that even though projects have time scales doesn’t even necessarily mean that they will be finished at all: not everything that we plan ends up happening. Many features end up sitting on the TODO list for months or even years before someone decides that it’s important enough to implement. If your company is particularly interested in one of these features, I might be able to offer you a contract to make sure it gets done much sooner and in a way that best fits your use-case; contact me via email.
01/13/2010 (10:23 pm)
x264 has long held the crown as one of the best, if not the best, general-purpose H.264 video encoder. With state-of-the-art psy optimizations and powerful internal algorithms, its quality and performance in “normal” situations is mostly unrivaled.
But there are many very important use-cases where this simply isn’t good enough. All the quality and performance in the world does nothing if x264 can’t meet other requirements necessary for a given business. Which brings us to today’s topic: low-latency streaming.
The encoding familiar to most users has effectively “infinite” latency: the output file is not needed by the user until the entire encode is completed. This allows algorithms such as 2-pass encoding, which require that the entire input be processed before even a single frame of the final output is available. This of course becomes infeasible for any sort of live streaming, in which the viewer must see the video some predictable amount of time after it reaches the encoder. Which brings us to our first platform: broadcast television.
12/06/2009 (1:15 am)
Most SIMD assembly functions are implemented in a rather straightforward fashion. An experienced assembly programmer can spend 2 minutes looking at C code and either give a pretty good guess at how one would write SIMD for it–or equally–rule out SIMD as an optimization technique for that code. There might be a nonintuitive approach that’s somewhat better, but one can usually get very good results merely by following the most obvious method.
But in some rare cases there is no “most obvious method”, even for functions that would seem extraordinarily simple. These kind of functions present an unusual situation for the assembly programmer: they find themselves looking at some embarrassingly simple algorithm–one which simply cries out for SIMD–and yet they can’t see an obvious way to do it! So let’s jump into the fray here and look at one of these cases.
10/31/2009 (7:12 pm)
A while back I talked about L1 data cache misses which, while individually not a big deal, can slow down your program in the same fashion that a swarm of angry bees sting a honey-thieving bear to death. This time we’ll look at an even more insidious evil: the L1 code cache miss.
With data caches, despite what I said in the previous article, you still have a good bit of control. You can prefetch data to them explicitly using the prefetch instructions. You control memory allocation and can make all sorts of changes to potentially improve access patterns. Every single memory access is explicit by you in your code.
But it isn’t the same with the L1 code cache (L1I). You can’t prefetch to them at all; the prefetch instructions go to the L1 data cache, not the L1 code cache. Unless you write everything directly in assembly, you can’t control the allocation and layout of the code. And you don’t control access to the code at all; it is accessed implicitly when it is run.
Many readers may have heard stories of programs running faster with gcc’s -Os (optimize for size) than -O2 or -O3 (optimize for speed); this is why. Larger code size causes more L1I cache misses, more load on the L2->L1 memory load unit, and uses up L2 cache as well. While the naive programmer may see great advantage to lots of loop unrolling or inlining, even timing the code may not be sufficient to prove that such code-size-increasing optimizations are worthwhile, since other parts of the program called afterwards could suffer due to evictions from the L1 instruction cache.
10/25/2009 (12:06 pm)
About two years ago we decided to merge our two trees of 32-bit and 64-bit assembly; it had become a maintainability nightmare with many functions in one that were not in the other and so forth. Loren Merritt did this merging by adding an abstraction layer that automatically handles differences in calling convention between platforms (and was extensible to future ones as well). Thus began the story of common/x86/x86inc.asm.
Over the years, x86inc has grown tremendously as it gained a great deal of functionality. It now supports x86_32, x86_64, and win64 (thanks to Anton Mitrofanov), which covers the three (by far) most popular x86 C calling conventions. It also has macros to abstract between MMX and SSE functions, along with automatic handling of register permutations and other such useful features.
All of this serves to make x264ASM, as we call it, by far the best option for writing platform-independent x86 assembly. It keeps the full optimization capabilities and powerful preprocessor of native assembly while having the platform-independence and convenience of instrinsics and inline assembly.
We’ve received many requests by non-GPL projects (and even commercial proprietary developers) to be able to use this abstraction layer. And now, it’s available under a permissive BSD-like license (specifically the ISC) for anyone to use, on request by a certain Adobe engineer.
Before you jump into the x264 code to see how it’s used, however, let’s go over some of the basics. Note of course this explanation is no substitute for reading x86inc.asm itself.
10/18/2009 (3:04 am)
For years I’ve dealt with all sorts of horrific situations when dealing with open source. Like software modules written by different teams on a badly managed commercial project, different open source projects tend to defensively program around each others’ flaws rather than actually submitting patches to fix them. There are even entire projects built around providing API wrappers that simplify usage and fix bugs present in the original library.
In many cases people don’t even submit bug reports. Sometimes they outright patch each others’ libraries–and don’t submit the patches back to the original project. At best this leads to tons of bugs and security vulnerabilities being overlooked in the original project. At worst this leads to situations like the Debian OpenSSL fiasco, in which the people patching the code don’t know enough about it to safely work with it (and don’t even talk to the people who do).
But enough ranting–let me talk about a success story.