Diary Of An x264 Developer

10/04/2009 (4:43 am)

Why so many H.264 encoders are bad

If one works long enough with a large number of H.264 encoders, one might notice that a large number of them are pretty much awful.  This of course shouldn’t be a surprise: Sturgeon’s Law says that “90% of everything is crap”.  It’s also exacerbated by the fact that H.264 is the most widely-accepted video standard in years and has spawned a huge amount of software that implements it, thus generating more mediocre implementations.

But even this doesn’t really explain the massive gap between good and bad H.264 encoders.  Good H.264 encoders, like x264, can beat previous-generation encoders like Xvid visually at half the bitrate in many cases.  Yet bad H.264 encoders are often so terrible that they lose to MPEG-2!  The disparity wasn’t nearly this large with previous standards… and there’s a good reason for this.

H.264 offers a great variety of compression features, more than any previous standard.  This also greatly increases the number of ways that encoder developers can shoot themselves in the foot.  In this post I’ll go through a sampling of these.  Most of the problems stem from the single fact that blurriness seems good when using mean squared error as a mode decision metric.

Since this post has gotten linked a good bit outside the technical community, I’ll elaborate slightly on some basic terminology that underlies the concepts in this post.

RD = lambda * bits + distortion, a measure of how “good” a decision is.  Lambda is how valuable bits are relative to quality (distortion).  If something costs very few bits, for example, it might be able to get away with more distortion.  Distortion is measured via a mode decision metric, the most common being sum of squared errors.

Visual energy is the amount of apparent detail in an image or video.  Part of the job of a good encoder is to retain energy so that the image doesn’t look blurry.

i16x16 macroblocks

The good: i16x16 is very appealing as a mode: it is phenomenally cheap bit-wise due to its heirarchical DC transform.  In flatter areas of the frame, this usually makes it cost less than a dozen or even half a dozen bits per macroblock.  As a result, RD mode decision loves this mode.

The bad: It looks like crap.  i16x16 is atrocious at maintaining visual energy: it almost never has any AC coefficients when it is used, three out of four of its prediction modes code nearly no energy at all, and the deblocker tends to blur out any details left anyways.  Combined with a lack of adaptive quantization, this is the prime cause of ugly 16×16 blocks in flat areas in encodes by crappy H.264 encoders.  While the mode isn’t inherently bad, it’s over-emphasized in the spec and makes a great trap for RD to fall into.

Bilinear qpel

The good: Qpel is of course a good thing for compression, and H.264′s qpel is particularly unique in that it is designed for encoder performance.  The hpel filter is slow (6-tap filter), but can be precalculated, while the qpel is simple and can be done on-the-fly (bilinear).

The bad: Bilinear interpolation is blurry, thus losing visual energy.  But of course RD mode decision loves blurriness and so will pick it happily.  Furthermore, the most naive motion search method (fullpel, one iteration of hpel, one iteration of qpel) tends to bias towards qpel instead of hpel.  While qpel is still very useful, its overuse is yet another trap for encoders.

4×4 transform

The good: The 4×4 transform is great for coding edges efficiently and helps form the backbone of the highly efficient i4x4 intra mode.  It also doesn’t need as fancy an entropy coder (for CAVLC at least) as an 8×8 transform would, thus allowing smaller VLC tables.

The bad: It’s blurry! It has a lower quantization precision at the same quantizer (compared to 8×8 transform); combined with decimation, this results in lots of uncoded blocks, yet another trap for RD.  It’s terrible at coding textured areas, especially when the details in the texture are larger than the transform itself.  It also gets deblocked more than 8×8.  While adaptive transform is good news, the fact that 4×4 was the default (and 8×8 added later) is likely an artifact of the entire specification process being done while optimizing for CIF resolution videos.

Biprediction

The good: Biprediction is at the core of any modern video format: B-frames vastly improve compression efficiency, especially in lower-motion scenes.  Biprediction singlehandedly makes possible the high number of skip blocks in B-frames in most sane-bitrate H.264 encodes.

The bad: It’s bilinear interpolation again, so it’s blurry, which acts as a nice RD trap yet again.  This makes biprediction get overused even in non-constant areas of the image, such as film grain, ensuring blurry grain in B-frames and clear grain in P-frames (nicely alternating as such).

One should note of course that B-frames and thus biprediction are not at all unique to H.264; this has been an ongoing problem for many years and tends to be exacerbated by lower bitrates.

h/v/dc intra prediction modes

The good: These modes are critical to the intra prediction system.  DC is similar to the old-style intra coding before spatial intra prediction, and the latter two are very useful for straight edges.  These three tend to be overall the most common intra prediction modes.

The bad: They retain energy terribly.  The other intra prediction modes (planar and ddl/ddr/vr/hd/vl/hu) effectively predict frequencies that are difficult to code with a DCT, thus increasing visual energy in the resulting reconstructed image.  But h/v/dc don’t really do this.  Furthermore, because of how the mode prediction system works, they tend to be the cheapest modes to signal (in terms of bits).

Of course, x264 effectively uses all of these features without most of the aforementioned problems.  Developers of other encoders: take note.

18 Responses to “Why so many H.264 encoders are bad”

  1. JoeH Says:

    Amazing post as always. All the CUDA encoders available fall into this categories. I would love to see a post from you about your OpenCL (or whatever technology you would use) plans for using video cards to speed up X264s output (obviously splitting up the work between the card and the CPU). I can’t trust anyone else will do it right….

  2. cb Says:

    “Of course, x264 effectively uses all of these features without most of the aforementioned problems.”

    How?

  3. Dark Shikari Says:

    @cb

    By taking energy into account during RD optimization, x264 avoids falling into low-error but low-energy modes.

  4. danx0r Says:

    what would be the best way to understand x264′s energy-retaining R/D approach? (I assume RTFC, but if it’s based on published research that would be extremely helpful)

  5. Dark Shikari Says:

    @danx0r

    I’ve never seen published research on the topic, though I haven’t looked that hard.

    Check encoder/rdo.c for some basic information.

  6. Gonzo Bumm Says:

    I reaqlly would like to read a beginners tutorial by you about how to use x264 – I read many other tutorials now and all of them seem to have misunderstandings of concepts or problems with the many options. There are also many guis for x264 where the authors seem not to understand what they are doing. It would be great to have the one and only real reference. Thanks!

  7. Dark Shikari Says:

    @Gonzo

    1) x264 –help (it even has example usage!)
    2) x264 –longhelp
    3) x264 –fullhelp
    4) http://mewiki.project357.com/wiki/X264_Settings (slightly outdated at times)

  8. Sarang Says:

    Hi,

    I understand that this may be a novice question, but thought you would be the best one to answer:

    1) We want to minimize post ( frequency ) transform energy after Motion Estimation.
    2) However, currently ME is done in spatial domain, although SAD equals the DC term and can give some correlation with minimized transform terms.
    3) For now if we ignore the extremely expensive computational cost, can’t ME be done in frequency domain? This would give the exact bit cost, and would ( hopefully ) give absolutely minimum distortion.
    4) So is this correct – By minimizing the error term of transformed blocks, we can minimize the objective bit cost AND the “Subjective Distortion” as well as approximating measures like SSIM ?

  9. Dark Shikari Says:

    @Sarang

    Yes, it’s called –me tesa.

  10. Esurnir Says:

    Would a -me tumh be a stupid idea? I assume you already tested it and thought “it is” but just throwing the idea.

  11. Dark Shikari Says:

    @Esurnir

    Yup, we tested it on all the various modes. It didn’t help much and the speed cost was very high, so we restricted it to esa only.

  12. Shevach Riabtsev Says:

    As intra prediction, in H.264 (as well as in AVS) intra prediction is performed in pixel domain while in MPEG4 partly intra prediction is executed in frequency domain (DC/AC prediction). Therefore the H.264 intra prediction fails on weak noisy material due to loss of correlation. On the other hand if intra prediction was executed in frequency doamin then at least low-frequency components would have been correlated. This is worth to mention that dynamic range of intra prediction in frequency domain would have been much morethan in pixel one.

  13. Pengvado Says:

    So spatial intra prediction works great for normal content, and only fails when there is no correlation to predict from. Whereas AC prediction is uniformly useless (it makes about 0.1% bitrate difference in MPEG4). Yet another win for H.264.

  14. Shevach Riabtsev Says:

    I would like to stress the following point:
    the spatial correlation usually is good on non-noisy content. On noisy content spatial correlation between neghboring pixels is expected to deteriorate while the correlation for low-frequency AC coefficients remains high since noise affects on high-frequency (although DCT leakage of high-frequency coefficients might slightly impact on low-frequency harmonics).
    I suppose that on noisy content the MPEG4 AC prediction shows a gain more than 0.1%).

  15. Pengvado Says:

    Nope, AC prediction is just as useless at predicting noise as it is for clean content.

  16. skal Says:

    you didn’t talk about in-loop deblocking strength.
    0:0 is too high to my taste, but that’s just me…

  17. Jeremy Noring Says:

    Thanks, this post was really interesting.

    I have a follow-up question: what is a good way of evaluating an encoder’s output? I have an embedded encoder; is there some way to see if it does any of the aforementioned encoding faux pas based on the output?

    Any general strategies you know of here would be welcome. Great blog too, I love it.

  18. Dark Shikari Says:

    @Jeremy

    An embedded encoder is going to be extremely minimal, generally: at best it’ll do SAD mode decision, deadzone quantization, and other extremely simple algorithms. Don’t expect much out of it; at best you’ll get something similar to x264 with –preset veryfast –profile baseline –tune psnr. No point in bothering trying to do fancy evaluation, IMO.

Leave a Reply