Diary Of An x264 Developer

05/07/2010 (8:57 am)

Simply beyond ridiculous

Filed under: H.265,speed ::

For the past few years, various improvements on H.264 have been periodically proposed, ranging from larger transforms to better intra prediction.  These finally came together in the JCT-VC meeting this past April, where over two dozen proposals were made for a next-generation video coding standard.  Of course, all of these were in very rough-draft form; it will likely take years to filter it down into a usable standard.  In the process, they’ll pick the most useful features (hopefully) from each proposal and combine them into something a bit more sane.  But, of course, it all has to start somewhere.

A number of features were common: larger block sizes, larger transform sizes, fancier interpolation filters, improved intra prediction schemes, improved motion vector prediction, increased internal bit depth, new entropy coding schemes, and so forth.  A lot of these are potentially quite promising and resolve a lot of complaints I’ve had about H.264, so I decided to try out the proposal that appeared the most interesting: the Samsung+BBC proposal (A124), which claims compression improvements of around 40%.

The proposal combines a bouillabaisse of new features, ranging from a 12-tap interpolation filter to 12thpel motion compensation and transforms as large as 64×64.  Overall, I would say it’s a good proposal and I don’t doubt their results given the sheer volume of useful features they’ve dumped into it.  I was a bit worried about complexity, however, as 12-tap interpolation filters don’t exactly scream “fast”.

I prepared myself for the slowness of an unoptimized encoder implementation, compiled their tool, and started a test encode with their recommended settings.

I waited.  The first frame, an I-frame, completed.

I took a nap.

I waited. The second frame, a P-frame, was done.

I played a game of Settlers.

I waited. The third frame, a B-frame, was done.

I worked on a term paper.

I waited. The fourth frame, a B-frame, was done.

After a full 6 hours, 8 frames had encoded.  Yes, at this rate, it would take a full two weeks to encode 10 seconds of HD video.  On a Core i7.  This is not merely slow; this is over 1000 times slower than x264 on “placebo” mode.  This is so slow that it is not merely impractical; it is impossible to even test.  This encoder is apparently designed for some sort of hypothetical future computer from space.  And word from other developers is that the Intel proposal is even slower.

This has led me to suspect that there is a great deal of cheating going on in the H.265 proposals.  The goal of the proposals, of course, is to pick the best feature set for the next generation video compression standard.  But there is an extra motivation: organizations whose features get accepted get patents on the resulting standard, and thus income.  With such large sums of money in the picture, dishonesty becomes all the more profitable.

There is a set of rules, of course, to limit how the proposals can optimize their encoders.  If different encoders use different optimization techniques, the results will no longer be comparable — remember, they are trying to compare compression features, not methods of optimizing encoder-side decisions.  Thus all encoders are required to use a constant quantizer, specified frame types, and so forth.  But there are no limits on how slow an encoder can be or what algorithms it can use.

It would be one thing if the proposed encoder was a mere 10 times slower than the current reference; that would be reasonable, given the low level of optimization and higher complexity of the new standard.  But this is beyond ridiculous.  With the prize given to whoever can eke out the most PSNR at a given quantizer at the lowest bitrate (with no limits on speed), we’re just going to get an arms race of slow encoders, with every company trying to use the most ridiculous optimizations possible, even if they involve encoding the frame 100,000 times over to choose the optimal parameters.  And the end result will be as I encountered here: encoders so slow that they are simply impossible to even test.

Such an arms race certainly does little good in optimizing for reality where we don’t have 30 years to encode an HD movie: a feature that gives great compression improvements is useless if it’s impossible to optimize for in a reasonable amount of time.  Certainly once the standard is finalized practical encoders will be written — but it makes no sense to optimize the standard for a use-case that doesn’t exist.  And even attempting to “optimize” anything is difficult when encoding a few seconds of video takes weeks.

Update: The people involved have contacted me and insist that there was in fact no cheating going on.  This is probably correct; the problem appears to be that the rules that were set out were simply not strict enough, making many changes that I would intuitively consider “cheating” to be perfectly allowed, and thus everyone can do it.

I would like to apologize if I implied that the results weren’t valid; they are — the Samsung-BBC proposal is definitely one of the best, which is why I picked it to test with.  It’s just that I think any situation in which it’s impossible to test your own software is unreasonable, and thus the entire situation is an inherently broken one, given the lax rules, slow baseline encoder, and no restrictions on compute time.

24 Responses to “Simply beyond ridiculous”

  1. Zm_Gorynych Says:

    They are actually going to integrate proposals into a single codebase, and debating which codebase to start from.
    Do you want to suggest x264? ;-)

  2. Dark Shikari Says:

    I doubt x264 is a good starting point; it’s too heavily designed with H.264 in mind and not flexible enough for the kind of stuff they’re interested in. Not that it couldn’t be converted into an H.265 encoder when the time comes, but rather that they need to be able to make many complex modifications very quickly to compare them.

  3. Wes Felter Says:

    Doesn’t MPEG have a policy of including some feature from each participant to ensure that no one feels left out of the patent pool?

  4. Aaron Says:

    I once saw an early demo of what became ATSC. It took a 40-foot truck full of SPARC-10s (then super speedy) to encode live MPEG-2 video. It took about 4 of those machines just to decode the video off the tape.

    Moore’s law is real and optimization will be better. History has proven this to be a short-term obstacle of a standards process that takes years and years. No worries. Just ride the wave.

  5. Dark Shikari Says:

    @Aaron

    But that’s not the problem. The problem is that in order to decide which features are better NOW, we need it to be fast enough to test NOW. It doesn’t matter if it will be faster in 5 years, because the spec isn’t being written in 5 years, it’s being written now.

  6. sigdrak Says:

    Haven’t test either the Samsung or the BBC proposal, but it doesn’t surprise me much: the JMKTA was already glacially slow, and JVT-VC is simply comparing “complexity” by evaluating speed against it.

    The Nokia/Tandberg/Ericson was touted as much much faster, but as you may have noticed, it is not particularly efficient. However, all those companies allied to propose a new software called TMuC (Test Model under Consideration). The presence of HHI in the mix is only political.

    Overall, it doesn’t surprise me much though. A real problem is rather that they want it out fast (by 2012 or 2013 at worst) while the Test Model isn’t even available (they even had to rename their ‘Core Experiment’ ‘Tool experiment’, using whatever software is available for lack of a TM…), and that’s going to be messy, even compared to H26L…

  7. rak Says:

    What’s the compute density of a 40′ trailer full of 6-core procs and support hardware?

  8. Ed Says:

    @Dark

    I’m with Aaron on this (thanks for the ATSC story)

    You’ve shown that you can’t test on a single i7.

    But getting one or a few hundred cores together isn’t a technical problem – it may be a funding problem. A rack or two of 8-core or 12-core machines isn’t that large or expensive for the likes of Samsung or even the BBC. (But this does make it difficult or impossible for the small guy to get involved.)

    I’d be very interested to hear if there are technical problems in parallelising these proposals, because that would be an issue. Moore’s law is likely to give us very many cores in a package, it isn’t likely to give us any more orders of magnitude of per-core performance.

  9. Dark Shikari Says:

    @Ed

    But of course, they didn’t parallelize them. The proposals aren’t multithreaded. So, again, they can’t be tested.

    Perhaps they could be multithreaded at some point in the future, but the purpose of these proposals is to be tested now, not 6 months from now, not 2 years from now.

  10. compn Says:

    psh, run that thing for a week or two and tell us if samsung+bbc got the 40% improvement.

    dont leave us hanging!

  11. Dark Shikari Says:

    @compn

    I will! It’ll be in my next encoder comparison: see http://x264dev.multimedia.cx/?p=372 for more details. I’m not running the whole test though; I’m slightly cheating by using a 50-frame segment, using x264′s ratecontrol to decide how many bits to allocate to that segment, and then running the samsung+bbc encoder to match that within the error that constant-quantizer mode allows.

  12. skal Says:

    i’m going to encode footage of guys taking nap.

  13. saintdev Says:

    @skal
    Only if it’s cool time-lapse footage.
    Maybe with some Yackety Sax for accompaniment?

  14. posdnya Says:

    Main thing is _decoding_ speed. If there is a way to decode 1080p in realtime on the modern core i7, then _encoding_ time is not a problem.
    BTW, a lot of people are working on cloud computing. They will be happy to have such a job, which can utilize computing power of all CPU’s in the world for couple years

  15. Ben Says:

    Please add an entry encoded with x264 –tune psnr to see bitstream potential better, since this video benefits a lot from psy, and in theory you could improve the other encoders likewise (I really do hope vp7/8 support changing macroblock qps, however I don’t think 5/6 can so it might be unfixable).

  16. Dark Shikari Says:

    @Ben

    Yes, I will add a video with x264 with no psy optimizations.

    You can already see how important psy is though; Ateme’s v2 encoder core, for example, is roughly on par visually with the Samsung+BBC proposal (IMO). This means that their psy optimizations count for 40-60% PSNR, at least.

  17. Kemal Ugur Says:

    Hello all,

    Please note that the subjective results for some of the proposals showed different behavior than the objective gains. At least that was the case for JCTVC-A119 (Tandberg-Ericsson-Nokia joint proposal), check out the presentation: http://ftp3.itu.int/av-arch/jctvc-site/2010_04_A_Dresden/JCTVC-A119.ppt

    Best Regards,

    Kemal

  18. John Haugeland Says:

    There’s absolutely nothing wrong with those encoding times. Remember that these video standards will be used to deploy commercial movies; for a film studio, a one-time investment of a thousand core 7s, which can be re-used per movie to make radically better looking DVDs, is a no brainer.

    Also, frankly, I think you under-estimate the power of a hand-tuned algorithm. Three orders of magnitude performance improvement is not unrealistic, given sufficient engineering time and talent. Companies will come to market with such codecs, and life will go on.

  19. renadom Says:

    a asmall word.. i work in the field of media codec algorithms for some years now and let me tell you that this issue is completely normal in science.

    Its a normal procedure to think ahead when doing scientifical research. Yes that shit is slow… but its not designed to run on the authors “el cheapo” computer. its designed to save 40% bandwidth on the end result.

    If scientists would develop solutions for the future using scenarios ffrom the past wed all be fucked… yes fucked.

    Just take as an example how fast our hardware has developed… some 15 years ago having 4MB made you the king of all fuckers… and no one ever thought wed need more.

    you want an example that suits to this? i had the honor of getting a beta of the first mp3 encoder from fraunhofer IIS in the mid 90s… that shit was slow as fuck… heck even slower than fuck..

    on my then state of the art pentium 75mhz that little fucker took some 3 hours to encode a single song. yet i knew that it was a revolution back then. and i knew that those 3 hours it didnt matter shit.. because i knew that my pentium 75 would be even slower than “slower as fuck” in some years. i knew because some 3 years before i had a 386@16mhz where i worked on… Heck evn my cell phone is faster than the PC i owned 10 years ago.

    the job of scientists is to think ahead and thats what is happening here. yes that shit isnt going to work on your fancy Corei7… but that piece of metal will be a door stopper soon and research insitutes *do* have farms.. heard of them?

    egoistic closed minds like the author are sciences worst enemy.

    “if it doesnt work on my computer its no good” Ughh…

    sorry if im a bit aggresive here.. but in my own career i had to face guys like this that couldnt think ahead of their own noses (mostly young business hispters). i am sure most redditeros did as well.

  20. Eric Says:

    @Ed/Dark

    I don’t think there’s anything corrupt going on here, that just seems silly. The goal is to combine many competing desires for the new standard into a sort of one size fits all solution. That’s going to take years of compromise, and I’m almost certain the speed issue is going to be a big one. If the computation/electricity costs to encode 10 seconds of video are as ridiculous as you say then a compromise will be reached somewhere. Give it time.

  21. Watson Says:

    I remember building the H.264 reference code back in ~2003-2004…it was equally slow. Equally slow with features that are now commonplace in x264 at 30 fps.

    I think the guy is rushing to conclusions on this. That code is solely meant to be a proof-of-concept, nothing else. You can probably speed that up by a few orders of magnitude, as there is no hint as to the implicit impossibility of making it fast.

  22. foxyshadis Says:

    Most likely, as with H.264 itself, the slowest features will be thrown out, while engineers figure out what’s optimizable and what isn’t. That doesn’t excuse optimizing for the test sequence, though.

    What’s the point of a 12-tap interpolation filter, though? After 4-tap, you’ve virtually exhausted the gain of that method, you’d be much better served with an advanced EDI method (like Tritical’s nnedi3 avisynth filter).

    In this case, in order to parallelize, you’ll have to separate the video into GOPs and generate a ratecontrol. Maybe you could press x264farm as a preprocessor.

    Be glad it’s not implemented in Matlab or lisp, like most research papers.

    @John Haugeland
    Uh, no, studios don’t encode their movies themselves. They farm it out to the lowest bidder that doesn’t completely screw it up. Often the studios are even less competent than the encoders, so I’m not complaining, just pointing out that your scenario is completely unrealistic.

  23. hurumi Says:

    Hi, I am the main developer of Samsung+BBC proposal. Maybe I have to clarify something because I like this site :-)

    1. Our encoder is very slow, but actually 6-times compared to JM17, and already similar level to the earlier version of JM. This slowness is mainly due to the full RD optimization for every stage. Please don’t compare reference encoder to the state-of-the-art encoder like x264 having many early decision mechanisms. In addition, MPEG test forces to use 128-pixel search range, 4 reference frames.

    2. Our decoder is only 2-times slower than JM17. Even with the current weird source code level using C++ and no assembly at all, D1@30Hz can be decoded in a real-time already in most PCs. It’s almost within the range of the market (with 40% bit-rate reduction compared to H.264)

    2. Most importantly, there is no cheating for the BD-rate number. Official anchor is JM17, not JM-KTA. Every proposal uses JM17 and report the number. So the number is fair. Why MPEG have to compare each proposal with non-standard like JM-KTA? JM-KTA was just one of the proposal. Actually H.264-mode, JM17 is much faster and better than JM-KTA, since JM-KTA is based on JM11.

    3. For the sequence-based selection: there were two scenarios. Random access and low-delay. In random access, there is no sequence-based selection, so the result is perfect. In low-delay scenario, we use the setting selectively from only two temporal structures, not sequence-based parameter optimization – hierarchical-P or IPPP while only hierarchical-P was used for the anchor. Actually this kind of decision is not difficult in the encoder side. SImply, IPPP for fast motion sequence and hier-P for slow motion sequence.

    If we used hier-P for all cases? Then absolutely there is no “cheating” for all cases. We already have the data. Only 2% BD-rate drop from 40% gain to 38% gain, which is still the top since the second best proposal provides 32%. Results are not changed at all.

    Finally, large-scale blind visual test was done in MPEG call-for-proposal. Also JM high-profile was included in the test. Not only for the best objective number, but also subjective MOS score shows the consistent results.

    Here is the link of the full source code. You can verify anything technically. (almost every new coding tool can be switched on/off individually. E.g. you can turn off 12-tap, then H.264 6-tap is used instead. Or you can even specify the number of interpolation taps among 4, 6, 8, 10 and 12)

    http://hevc.kw.bbc.co.uk/git/w/jctvc-a124.git

    Welcome to request anything related to our proposal. I KNOW this kind of interaction eventually make better technology, especially they come from the experts like you.

  24. hurumi Says:

    One more thing I have to clarify:

    Actually we submitted two proposals, A124 (Samsung & BBC) and A125 (BBC & Samsung). However, two proposals were the same. (Same software with two different setting). Main target is to show the coding-efficiency and complexity-trade-off in one framework. A124 is designed to show highest coding efficiency and A125 is designed to show the possibility of low-complexity mode. (http://ftp3.itu.int/av-arch/jctvc-site/2010_04_A_Dresden/)

    A125 uses 6-tap for luma instead of 12-tap in A124 and 2-tap for chroma instead of 6-tap in A124. Many complex coding tools, such as adaptive loop filter, pixel-based template matching, etc. were not used in A125 to demonstrate the low-complexity mode.

    A125 encoder is still about 3-times slower than JM17 (A124 – 6-times slower), but the decoder running time is already same to JM17. It means A125 already reaches to H.264 level at least in the non-optimized software comparison. (Please don’t compare with CoreAVC)

    I fully understand that it does not mean highly optimized A125 has similar complexity to x264. However, I just want to say that the expected complexity of our proposal is not very high at least in the decoder-side. I strongly believe that at least A125 can be implemented on top of the very efficient software, such as x264, and easily reach to the commercial level.

    Coding efficiency of A125 is still one of the best. It showed 2nd best average MOS in MPEG CfP subjective test. Objectively, ranked 2nd in random access scenario (32% to JM) and 4th objective score in low-delay scenario (29% reduction to JM), while A124 ranked 1st for both cases (both 40% to JM).

    You can verify A125 (both coding efficiency or decoder running time) by simply adding ‘-p 0′ option in the command-line option of A124 binary provided in the above link.

Leave a Reply