Diary Of An x264 Developer

06/21/2010 (6:56 am)

How to cheat on video encoder comparisons

Over the past few years, practically everyone and their dog has published some sort of encoder comparison.  Sometimes they’re actually intended to be something for the world to rely on, like the old Doom9 comparisons and the MSU comparisons.  Other times, they’re just to scratch an itch — someone wants to decide for themselves what is better.  And sometimes they’re just there to outright lie in favor of whatever encoder the author likes best.  The latter is practically an expected feature on the websites of commercial encoder vendors.

One thing almost all these comparisons have in common — particularly (but not limited to!) the ones done without consulting experts — is that they are horribly done.  They’re usually easy to spot: for example, two videos at totally different bitrates are being compared, or the author complains about one of the videos being “washed out” (i.e. he screwed up his colorspace conversion).  Or the results are simply nonsensical.  Many of these problems result from the person running the test not “sanity checking” the results to catch mistakes that he made in his test.  Others are just outright intentional.

The result of all these mistakes, both intentional and accidental, is that the results of encoder comparisons tend to be all over the map, to the point of absurdity.  For any pair of encoders, it’s practically a given that a comparison exists somewhere that will “prove” any result you want to claim, even if the result would be beyond impossible in any sane situation.  This often results in the appearance of a “controversy” even if there isn’t any.

Keep in mind that every single mistake I mention in this article has actually been done, usually in more than one comparison.  And before I offend anyone, keep in mind that when I say “cheating”, I don’t mean to imply that everyone that makes the mistake is doing it intentionally.  Especially among amateur comparisons, most of the mistakes are probably honest.

So, without further ado, we will investigate a wide variety of ways, from the blatant to the subtle, with which you too can cheat on your encoder comparisons.

Blatant cheating

1.  Screw up your colorspace conversions.  A common misconception is that converting from YUV to RGB and back is a simple process where nothing can go wrong.  This is quite untrue. There are two primary attributes of YUV: PC range (0-255) vs TV range (16-235) and BT.709 vs BT.601 conversion coefficients.  That sums up to a total of 4 possible different types of YUV.  When people compare encoders, they often use different frontends, some of which make incorrect assumptions about these attributes.

Incorrect assumptions are so common that it’s often a matter of luck whether the tool gets it right or not.  It doesn’t help that most videos don’t even properly signal which they are to begin with!  Often even the tool that the person running the comparison is using to view the source material gets the conversion wrong.

Subsampling YUV (aka what everyone uses) adds yet another dimension to the problem: the locations which the chroma data represents (“chroma siting”) isn’t constant.  For example, JPEG and MPEG-2 define different positions.  This is even worse because almost nobody actually handles this correctly — the best approach is to simply make sure none of your software is doing any conversion.  A mistake in chroma siting is what created that infamous PSNR graph showing Theora beating x264, which has been cited for ages since despite the developers themselves retracting it after realizing their mistake.

Keep in mind that the video encoder is not responsible for colorspace conversion — almost all video encoders operate in the YUV domain (usually subsampled 4:2:0 YUV, aka YV12).  Thus any problem in colorspace conversion is usually the fault of the tools used, not the actual encoder.

How to spot it: “The color is a bit off” or “the contrast of the video is a bit duller”.  There were a staggering number of “H.264 vs Theora” encoder comparisons which came out in favor of one or the other solely based on “how well the encoder kept the color” — making the results entirely bogus.

2.  Don’t compare at the same (or nearly the same) bitrate. I saw a VP8 vs x264 comparison the other day that gave VP8 30% more bitrate and then proceeded to demonstrate that it got better PSNR. You would think this is blindingly obvious, but people still make this mistake!  The most common cause of this is assuming that encoders will successfully reach the target bitrate you ask of them — particularly with very broken encoders that don’t.  Always check the output filesizes of your encodes.

How to spot it: The comparison lists perfectly round bitrates for every single test, as opposed to the actual bitrates achieved by the encoders, which will never be exactly matching in any real test.

3.  Use unfair encoding settings. This is a bit of a wide topic: there are many ways to do this.  We’ll cover the more blatant ones in this part.  Here’s some common ones:

a.  Simply cheat. Intentionally pick awful settings for the encoder you don’t like.

b.  Don’t consider performance. Pick encoding settings without any regard for some particular performance goal.  For example, it’s perfectly reasonable to say “use the best settings possible, regardless of speed”.  It’s also reasonable to look for a particular encoding speed target.  But what isn’t reasonable is to pick extremely fast settings for one encoder and extremely slow settings for another encoder.

c.  Don’t attempt match compatibility options when it’s reasonable to do so. Keyframe interval is a classic one of these: shorter values reduce compression but improve seeking.  An easy way to cheat is to simply not set them to the same value, biasing towards whatever encoder has the longer interval.  This is most common as an accidental mistake with comparisons involving ffmpeg, where the default keyframe interval is an insanely low 12 frames.

How to spot it: The comparison doesn’t document its approach regarding choice of encoding settings.

4.  Use ratecontrol methods unfairly. Constant bitrate is not the same as average bitrate — using one instead of the other is a great way to completely ruin a comparison.  Another method is to use 1-pass bitrate mode for one encoder and 2-pass or constant quality for another.  A good general approach is that, for any given encoder, one should use 2-pass if available and constant quality if not (it may take a few runs to get the bitrate you want, of course).

Of course, it’s also fine to run a comparison with a particular mode in mind — for example, a comparison targeted at streaming applications might want to test using 1-pass CBR.  Of course, in such a case, if CBR is not available in an encoder, you can’t compare to that encoder.

How to spot it: It’s usually pretty obvious if the encoding settings are given.

5.  Use incredibly old versions of encoders. As it happens, Debian stable is not the best source for the most recent encoding software.  Equally, using recent versions known to be buggy.

6.  Don’t distinguish between video formats and the software that encodes them. This is incredibly common: I’ve seen tests that claim to compare “H.264″ against something else while in fact actually comparing “Quicktime” against something else.  It’s impossible to compare all H.264 encoders at once, so don’t even try — just call the comparison “Quicktime versus X” instead of “H.264 versus X”.  Or better yet, use a good H.264 encoder, like x264 and don’t bother testing awful encoders to begin with.

Less-obvious cheating

1.  Pick a bitrate that’s way too low. Low bitrate testing is very effective at making differences between encoders obvious, particularly if doing a visual comparison.  But past a certain point, it becomes impossible for some encoders to keep up.  This is usually an artifact of the video format itself — a scalability limitation.  Practically all DCT-based formats have this kind of limitation (wavelets are mostly immune).

In reality, this is rarely a problem, because one could merely downscale the video to resolve the problem — lower resolutions need fewer bits.  But people rarely do this in comparisons (it’s hard to do it fairly), so the best approach is to simply not use absurdly low bitrates.  What is “absurdly low”?  That’s a hard question — it ends up being a matter of using one’s best judgement.

This tends to be less of a problem in larger-scale tests that use many different bitrates.

How to spot it: At least one of the encoders being compared falls apart completely and utterly in the screenshots.

Biases towards, a lot: Video formats with completely scalable coding methods (Dirac, Snow, JPEG-2000, SVC).

Biases towards, a little: Video formats with coding methods that improve scalability, such as arithmetic coding, B-frames, and run-length coding.  For example, H.264 and Theora tend to be more scalable than MPEG-4.

2.  Pick a bitrate that’s way too high. This is staggeringly common mistake: pick a bitrate so high that all of the resulting encodes look absolutely perfect.  The claim is then made that “there’s no significant difference” between any of the encoders tested.  This is surprisingly easy to do inadvertently on sources like Big Buck Bunny, which looks transparent at relatively low bitrates.  An equally common but similar mistake is to test at a bitrate that isn’t so high that the videos look perfect, but high enough that they all look very good.  The claim is then made that “the difference between these encoders is small”.  Well, of course, if you give everything tons of bitrate, the difference between encoders is small.

How to spot it: You can’t tell which image is the source and which is the encode.

3.  Making invalid comparisons using objective metrics. I explained this earlier in the linked blog post, but in short, if you’re going to measure PSNR, make sure all the encoders are optimized for PSNR.  Equally, if you’re going to leave the encoder optimized for visual quality, don’t measure PSNR — post screenshots instead.  Same with SSIM or any other objective metric.  Furthermore, don’t blindly do metric comparisons — always at least look at the output as a sanity test.  Finally, do not claim that PSNR is particularly representative of visual quality, because it isn’t.

How to spot it: Encoders with psy optimizations, such as x264 or Theora 1.2, do considerably worse than expected in PSNR tests, but look much better in visual comparisons.

4.  Lying with graphs. Using misleading scales on graphs is a great way to make the differences between encoders seem larger or smaller than they actually are.  A common mistake is to scale SSIM linearly: in fact, 0.99 is about twice as good as 0.98, not 1% better.  One solution for this is to use db to compare SSIM values.

5.  Using lossy screenshots. Posting screenshots as JPEG is a silly, pointless way to worsen an encoder comparison.

Subtle cheating

1.  Unfairly pick screenshots for comparison. Comparing based on stills is not ideal, but it’s often vastly easier than comparing videos in motion.  But it also opens up the door to unfairness.  One of the most common mistakes is to pick a frame immediately after (or on) a keyframe for one encoder, but which isn’t for the other encoder.  Particularly in the case of encoders that massively boost keyframe quality, this will unfairly bias in favor of the one with the recent keyframe.

How to spot it: It’s very difficult to tell, if not impossible, unless they provide the video files to inspect.

2.  Cherry-pick source videos. Good source videos are incredibly hard to come by — almost everything is already compressed and what’s left is usually a very poor example of real content.  Here’s some common ways to bias unfairly using cherry-picking:

a.  Pick source videos that are already heavily compressed. Pre-compressed source isn’t much of an issue if your target quality level for testing is much lower than that of the source, since any compression artifacts in the source will be a lot smaller than those created by the encoders.  But if the source is already very compressed, or you’re testing at a relatively high quality level, this becomes a significant issue.

Biases towards: Anything that uses a similar transform to the source content.  For MPEG-2 source material, this biases towards formats that use the 8x8dct or a very close approximation: MPEG-1/2/4, H.263, and Theora.  For H.264 source material, this biases towards formats that use a 4×4 transform: H.264 and VP8.

b.  Pick standard test clips that were not intended for this purpose. There are a wide variety of uncompressed “standard test clips“.  Some of these are not intended for general-purpose use, but rather exist to test specific encoder capabilities.  For example, Mobile Calendar (“mobcal”) is extremely sharp and low motion, serving to test interpolation capabilities.  It will bias incredibly heavily towards whatever encoder uses more B-frames and/or has higher-precision motion compensation.  Other test clips are almost completely static, such as the classic “akiyo”.  These are also not particularly representative of real content.

c.  Pick very noisy content. Noise is — by definition — not particularly compressible.  Both in terms of PSNR and visual quality, a very noisy test clip will tend to reduce the differences between encoders dramatically.

d.  Pick a test clip to exercise a specific encoder feature. I’ve often used short clips from Touhou games to demonstrate the effectiveness of x264′s macroblock-tree algorithm.  I’ve sometimes even used it to compare to other encoders as part of such a demonstration.  I’ve also used the standard test clip “parkrun” as a demonstration of adaptive quantization.  But claiming that either is representative of most real content — and thus can be used as a general determinant of how good encoders are — is of course insane.

e.  Simply encode a bunch of videos and pick the one your favorite encoder does best on.

3.  Preprocessing the source. A encoder test is a test of encoders, not preprocessing.  Some encoding apps may add preprocessors to the source, such as noise reduction.  This may make the video look better — possibly even better than the source — but it’s not a fair part of comparing the actual encoders.

4.  Screw up decoding. People often forget that in addition to encoding, a test also involves decoding — a step which is equally possible to do wrong.  One common error caused by this is in tests of Theora on content whose resolution isn’t divisible by 16.  Decoding is often done with ffmpeg — which doesn’t crop the edges properly in some cases.  This isn’t really a big deal visually, but in a PSNR comparison, misaligning the entire frame by 4 or 8 pixels is a great way of completely invalidating the results.

The greatest mistake of all

Above all, the biggest and most common mistake — and the one that leads to many of the problems mentioned here –  is the mistaken belief that one, or even a few tests can really represent all usage fairly.  Any comparison has to have some specific goal — to compare something in some particular case, whether it be “maximum offline compression ignoring encoding speed” or “real-time high-speed video streaming” or whatnot.  And even then, no comparison can represent all use-cases in that category alone.  An encoder comparison can only be honest if it’s aware of its limitations.

25 Responses to “How to cheat on video encoder comparisons”

  1. Kuukunen Says:

    One sign of possible cheating or incompetence is that the comparison is not reproducible.

    Like in science in general, a test result does not mean much if others can’t reproduce it.

    In encoder comparisons this basically means providing the exact encoder versions, full settings and access to test clip(s).

  2. Tim Says:

    Great article. By the way, how come doom9 stopped doing their codec comparisons?

  3. Relgoshan Says:

    You missed one under the “subtle” branch – getting someone else to compare your product against another product, preferably using your suggested settings. Bonus points for calling it a “White Paper”.

  4. Multimedia Mike Says:

    Benchmarks are exclusively the domain of people who have a vested interest (usually financial but often emotional) in the results. That’s what I determined when I tried to put together some honest, unbiased benchmarks to compare the speed of code generated by various compilers.

  5. raylu Says:

    “Finally, do not claim that PSNR is particularly representative of visual quality, because it isn’t.”

    Could you elaborate on this?

  6. Dark Shikari Says:

    @raylu

    It has been known for decades that PSNR is not perfectly correlated with visual quality. There are many critical modern psy optimizations, such as complexity masking, which directly work against PSNR. Energy preservation also hurts PSNR while improving visual quality. Thus, there are a large number of widely used encoder features, known to improve visual quality, which make PSNR worse. Accordingly, it isn’t a very good measure of visual quality.

    As explained in my previous post, the reason it’s used is because it’s easy to optimize for, so if you make all your encoders optimize for it, you have a decent measure of how well they can optimize for Some Metric.

  7. Lachlan Stuart Says:

    Nice article. I too am quite frustrated that it’s near impossible to find good encoder benchmarks and I’m glad you’re running one.

    A bit off topic, but I’ve always wondered whether or not HSL or HSV would be more compressible colorspaces than YUV. From the few rough tests I’ve done, it seems that my eyes relatively unreceptive to hue offsets and sat noise/quantization, which could mean more bits for Luma… Have you, by any chance, seen or run any tests involving HSL or HSV?

  8. Pengvado Says:

    Problem with HSL/HSV is that they’re cylindrical coordinate systems, which makes linear transforms (such as DCT and motion interpolation) on the Hue coordinate annoying to implement and possibly just wrong. I’m not sure they help perceptual uniformity either; the ideal quantization step size for H depends on S, and for S depends on L.

    Linear transforms aren’t perfectly appropriate for YUV or RGB either (ideally the transform would be gamma aware), but at least they have the right topology.

    Therefore, I would suggest trying CIELAB instead.

  9. D3C0D3R Says:

    >>>One solution for this is to use db to compare SSIM values.
    hi Dark.i use 1/(1-SSIM) //structural dissimilarity, but just curious what formula x264 use to convert SSIM to db.

  10. Relgoshan Says:

    Perhaps SSIM could be extrapolated to a simple percentage? Maybe one that falls to 0% at 0.60 SSIM?

  11. Esurnir Says:

    When will you announce the result of the comparison?

  12. D3C0D3R Says:

    >>> what formula x264 use to convert SSIM to db.
    i beat my laziness and check encoder.c

    x264_log( h, X264_LOG_INFO, “SSIM Mean Y:%.7f (%6.3fdb)\n”, ssim, x264_ssim( ssim ) );

    thanx

  13. Lu Tze Says:

    Nice list. Most of these are fairly obvious, although there are also a few I would not have thought of so quickly, such as the “make a screenshot directly after an I-frame”.

    But, still, there are some things you can do to prevent some “unintentional cheating”, if you can even call it that way:

    Choose reasonable default settings in your decoder. Newer versions of x264 are quite fine in this regard, but older ones had B-frames disabled, iirc… and I remember a time when people were arguing that “x264 should not have presets” – well of course if you don’t include presets, it makes it much easier for people doing codec comparisons to intentionally or unintentionally screw up, as well as for any one else who just wants to use the codec for that matter… Similarly, the Xvid-Decoder had Deblocking disabled! And you just mentioned the very low I-frame interval of ffmpeg. This should definitely be avoided. For example, I think the standard settings for x264 can still be tweaked a bit, e.g. introduce a default upper limit for quantizer, maybe 40 or 42, but not the entire 51.

  14. Devin Says:

    I’ve noticed some of these in action as well. Thanks for putting them out there, especially at a time when FUD and biased comparisons threaten the future of video standards.

  15. Winston Says:

    I work for a hardware encoder manufacturer for the surveillance industry, and the frequency with which some of our competitors will knowingly mislead with technical specifications of their video servers.

    We’ve had multiple customers speak to us complaining that a two channel encoder cannot seeminly process 120fps as advertised. Whereas it’s been marketed as this by the manufacturer to essentially state that a dual streaming 2-channel codec is running 30fps on each of the 4 streams.

    In recent comparison tests with a certain very prominent manufacturer in the surveillance industries (rhymes with ‘posh’ and starts with a B, they outright lied about the streaming capabilities of the codec, claiming 4cif performance on the 2nd stream when in reality the codec is streaming at 2cif max.

    This might seem a small difference, but when these specifications start creeping into tenders for projects requiring 1000s of encoders, these difference can make a major difference.

  16. witek Says:

    Very good article in many ways. Why it is more important, is that many people working on video encoding, and even researchers and developers of such systems, doesn’t understand (or are cheating) what they are really comparing. Not very scientific.

  17. Relgoshan Says:

    http://compression.ru/video/codec_comparison/h264_2010/appendixes.html

    Their full report is a little interesting, but I got the most from Appendices 7 and 8. In the Appendix 7, it is shown that SSIM is generally a better indicator of quality than PSNR. Then Appendix 8 introduces VP8 results. Humiliating VP8 results. They do include some bleating from the WebM dev team at the end, though. “We only had three weeks” my ass, On2 invested FIVE YEARS into VP8. PSNR-tuning is part of their problem, also Mobile Calendar (while uncompressed) is an absurd and unrealistic usage case.

    I am left wondering how FRAPS sequences would look after being recompressed?

  18. BrUtE AiD Says:

    Well, I don’t agree mutch about sources usage: DVD/DVB backups are very popular, so many peoples consider MPEG2-source encoding performances in serious consideration.

    Of course, with upcoming BD standard it should be taken in consideration too (even if seems unfair versus VP8).

    More generally the right choice is to use lossless sources for encoding comparisons…

    So the most comprensible way to make comparisons is to encode sources @ 2-pass/700 Mb/128k_audio target, IMHO.

    “The Good, the Bad and the Ugly” BD (MondoHE version) would be difficoult enough (lenght, types of scenes, etc.) for a fair comparison.

  19. BiffTannen Says:

    One of the funniest test I see is the “Till Halbach: COMPARISON OF OPEN AND FREE VIDEO COMPRESSION SYSTEMS” available at: http://etill.net/projects/dirac_theora_evaluation/include/halbach-2009-dirac_theora-paper.pdf

    This test was commented (jump to “An aside about the Akiyo graph”) by theora’s Monty:
    http://people.xiph.org/~xiphmont/demo/theora/demo7.html

    The biggest problem here was using a broken encoder the tester didn’t noticed.

  20. R.T. Systems Says:

    http://tech.slashdot.org/article.pl?sid=10/07/07/219210

    You forget an even easier way to cheat on comparisons. Interpret a study to support your viewpoint even when it directly contradicts it.

    “The reference VP8 encoder holds its own against x264 despite the source material offering x264 a slight advantage.”

  21. hurumi Says:

    One comment about Mobile Calendar sequence from my testing results: it is really steady-and-slow-motion sequence. interpolation filter works. In this case, non-usual rate control shows significant gain (almost comparable to b-slice gain), e.g. very high quality key-frame and lower quality coding of consecutive frames. (with always referring to high quality key-frame = VP8 golden frame)

    I think that it is one of the reason why VP8 works nice in this sequence. But, it should be noted that golden frame can be used easily in h.264 by using long-term reference buffer and slice-based qp control.

    Anyway, it is not a good sequence :)

  22. Relgoshan Says:

    Exactly! But we would need multi-TB videos to properly test how uncompressed live-action is handled. Since this is difficult to come by, VP8′s performance in transcoding MPEG-like video (pretty much any handheld camcorder these days, and all video on disc) becomes expecially relevant.

  23. Cocobongo Says:

    @ R.T. Systems
    Right you are! It’s like reading the German Auto Bild car magazine. No matter how well the other non-german cars are built, no German car ever looses a comparison test! Ever! :) It’s sooo funny. You got the 2010 Golf pitted against an (arguably) better Opel (GM) Astra 2010 and after first saying “The Astra is bigger, quieter, firmer, better while the Golf shows its aging design traits” they finish off like this “but the Golf conserves all the quintessential German car building qualities therefore it is our pick of the bunch” I’m in tears.

  24. searler Says:

    Hi, hope this thread isn’t quite finished.
    I am looking to set up a test bed to do some benchmarking. I don’t want to cheat. I am definitely a newbie and know that I have a lot to learn. I have downloaded one of the test sequences from the VQEG mirror at xiph.org and I want to view it in its pristine state (as far as I can on a win32 laptop or linux desktop machine). I downloaded pyuv because it seemed to have promise as a viewer and on reference 9, 625 line, 50 Hz file using settings of ST576 (720*576) for size, 25 frames per second interlaced, YUV colour space, 4:2:2 subsampling, UYVY ordering and 8 bits per sample, I almost have an image. The number of frames and duration look right but the colour is off, there appears to be a ghost image and there are a lot of vertical line artefacts. Any assistance with a better viewer or settings or links to explanatory material greatly appreciated. I want to do this right and I want to move from analysis of video (probably using MSU software on reference file or capture from webcam/std def video camera) to appreciation of bit-rate/quality trade-off for standard encoder(s) and packet loss/quality trade-off. I want to find out acceptable bit rates for interactive latency video ( < 500ms delay) and low latency video (around 1 second max delay).

  25. brunogm Says:

    Here’s another VP8 vs x264 , http://www.quavlive.com/video_codec_comparison

Leave a Reply