If one works long enough with a large number of H.264 encoders, one might notice that a large number of them are pretty much awful. This of course shouldn’t be a surprise: Sturgeon’s Law says that “90% of everything is crap”. It’s also exacerbated by the fact that H.264 is the most widely-accepted video standard in years and has spawned a huge amount of software that implements it, thus generating more mediocre implementations.
But even this doesn’t really explain the massive gap between good and bad H.264 encoders. Good H.264 encoders, like x264, can beat previous-generation encoders like Xvid visually at half the bitrate in many cases. Yet bad H.264 encoders are often so terrible that they lose to MPEG-2! The disparity wasn’t nearly this large with previous standards… and there’s a good reason for this.
H.264 offers a great variety of compression features, more than any previous standard. This also greatly increases the number of ways that encoder developers can shoot themselves in the foot. In this post I’ll go through a sampling of these. Most of the problems stem from the single fact that blurriness seems good when using mean squared error as a mode decision metric.
Since this post has gotten linked a good bit outside the technical community, I’ll elaborate slightly on some basic terminology that underlies the concepts in this post.
RD = lambda * bits + distortion, a measure of how “good” a decision is. Lambda is how valuable bits are relative to quality (distortion). If something costs very few bits, for example, it might be able to get away with more distortion. Distortion is measured via a mode decision metric, the most common being sum of squared errors.
Visual energy is the amount of apparent detail in an image or video. Part of the job of a good encoder is to retain energy so that the image doesn’t look blurry.
The good: i16x16 is very appealing as a mode: it is phenomenally cheap bit-wise due to its heirarchical DC transform. In flatter areas of the frame, this usually makes it cost less than a dozen or even half a dozen bits per macroblock. As a result, RD mode decision loves this mode.
The bad: It looks like crap. i16x16 is atrocious at maintaining visual energy: it almost never has any AC coefficients when it is used, three out of four of its prediction modes code nearly no energy at all, and the deblocker tends to blur out any details left anyways. Combined with a lack of adaptive quantization, this is the prime cause of ugly 16×16 blocks in flat areas in encodes by crappy H.264 encoders. While the mode isn’t inherently bad, it’s over-emphasized in the spec and makes a great trap for RD to fall into.
The good: Qpel is of course a good thing for compression, and H.264′s qpel is particularly unique in that it is designed for encoder performance. The hpel filter is slow (6-tap filter), but can be precalculated, while the qpel is simple and can be done on-the-fly (bilinear).
The bad: Bilinear interpolation is blurry, thus losing visual energy. But of course RD mode decision loves blurriness and so will pick it happily. Furthermore, the most naive motion search method (fullpel, one iteration of hpel, one iteration of qpel) tends to bias towards qpel instead of hpel. While qpel is still very useful, its overuse is yet another trap for encoders.
The good: The 4×4 transform is great for coding edges efficiently and helps form the backbone of the highly efficient i4x4 intra mode. It also doesn’t need as fancy an entropy coder (for CAVLC at least) as an 8×8 transform would, thus allowing smaller VLC tables.
The bad: It’s blurry! It has a lower quantization precision at the same quantizer (compared to 8×8 transform); combined with decimation, this results in lots of uncoded blocks, yet another trap for RD. It’s terrible at coding textured areas, especially when the details in the texture are larger than the transform itself. It also gets deblocked more than 8×8. While adaptive transform is good news, the fact that 4×4 was the default (and 8×8 added later) is likely an artifact of the entire specification process being done while optimizing for CIF resolution videos.
The good: Biprediction is at the core of any modern video format: B-frames vastly improve compression efficiency, especially in lower-motion scenes. Biprediction singlehandedly makes possible the high number of skip blocks in B-frames in most sane-bitrate H.264 encodes.
The bad: It’s bilinear interpolation again, so it’s blurry, which acts as a nice RD trap yet again. This makes biprediction get overused even in non-constant areas of the image, such as film grain, ensuring blurry grain in B-frames and clear grain in P-frames (nicely alternating as such).
One should note of course that B-frames and thus biprediction are not at all unique to H.264; this has been an ongoing problem for many years and tends to be exacerbated by lower bitrates.
h/v/dc intra prediction modes
The good: These modes are critical to the intra prediction system. DC is similar to the old-style intra coding before spatial intra prediction, and the latter two are very useful for straight edges. These three tend to be overall the most common intra prediction modes.
The bad: They retain energy terribly. The other intra prediction modes (planar and ddl/ddr/vr/hd/vl/hu) effectively predict frequencies that are difficult to code with a DCT, thus increasing visual energy in the resulting reconstructed image. But h/v/dc don’t really do this. Furthermore, because of how the mode prediction system works, they tend to be the cheapest modes to signal (in terms of bits).
Of course, x264 effectively uses all of these features without most of the aforementioned problems. Developers of other encoders: take note.