x264 has long held the crown as one of the best, if not the best, general-purpose H.264 video encoder. With state-of-the-art psy optimizations and powerful internal algorithms, its quality and performance in “normal” situations is mostly unrivaled.
But there are many very important use-cases where this simply isn’t good enough. All the quality and performance in the world does nothing if x264 can’t meet other requirements necessary for a given business. Which brings us to today’s topic: low-latency streaming.
The encoding familiar to most users has effectively “infinite” latency: the output file is not needed by the user until the entire encode is completed. This allows algorithms such as 2-pass encoding, which require that the entire input be processed before even a single frame of the final output is available. This of course becomes infeasible for any sort of live streaming, in which the viewer must see the video some predictable amount of time after it reaches the encoder. Which brings us to our first platform: broadcast television.
x264 is used in thousands of servers at hundreds of head-ends for cable and IPTV broadcast, HD and SD, thanks to our good friends at Avail-TVN. In this situation 2-pass is no longer an option: we’re restricted to 1-pass encoding, for obvious reasons. But we still have a lot of flexibility: latency is not particularly critical since the user isn’t interacting with the content he’s viewing. At most our biggest worry is channel-change time, which can be optimized independent of the actual end-to-end latency.
As such, x264 has received many optimizations that assume a few seconds of lookahead. Avail paid me to develop RC-lookahead, which looks ahead a few seconds to plan future bitrate allocation. Other important features, such as macroblock-tree ratecontrol, sync-lookahead, frame-based threading, all have their own latency requirements. Furthermore, the stream itself inherently has some latency: the VBV buffer is usually around a second long, and B-frames require a delay as well., both encoder and decoder-side. Even without x264′s lookahead features, we’d still have a good bit of latency. For those unfamiliar with the topic, the VBV buffer stores the compressed video data on the decoder and is used to absorb fluctuations in bitrate, especially those caused by keyframes.
But some use-cases are more extreme. With interactive video, a 2-10 second delay becomes completely unusable. Videoconferencing requires latencies below 1 second, preferably much lower. If our target is 200ms encoding latency, not counting transport time, at 30fps that’s a mere 6 frames. Instantly, all lookaheads are forced off: we have no choice but to disable them. Even the regular threading model becomes a problem: it adds one frame of latency per thread beyond the first, which with many threads can quickly fill that 6 frame limit. And each B-frame we allow increases the latency by 1 frame too.
The total latency of x264, including encoder/decoder-side buffering, is:
B-frame latency (in frames) + Threading latency (in frames) + RC-lookahead (in frames) + Sync-lookahead (in frames) + VBV buffer size (in seconds) + Time to encode one frame (in milliseconds)
At the start of October 2009, x264 was completely unsuitable for this use-case. Its handling of tiny VBV buffers, especially without the RC-lookahead (which we’re forced to turn off), was disastrous. And the latency added by threading was completely intolerable in many cases, especially considering that we want to use as much of that 200ms as possible for the VBV buffer. None of this was surprising, of course: low-latency is a use-case that requires very specialized features that most encoders don’t have. In short, x264 needed a miracle.
Fortunately, there was a startup–which has requested not to be named–that saw the potential here. With a few features, x264 could be turned into the most powerful low-latency streaming platform in the world. So, in October 2009, we began work.
The prelude to this work was multi-slice encoding support, which I wrote at the end of August. Among other things, it contained a feature that seemed rather useless at the time, but had been requested by a few clients: the ability to cap the size of each output slice of the image, so that each frame is split into a set of slices with a maximum size. One reason for this might be to fit each slice into a single UDP or TCP packet. We’ll come back to this later.
The first step was single-frame VBV support. With a single-frame VBV, every single frame is capped to the same maximum size. This means that the server can instantly send all frames after encoding them, and the client can instantly decode all received frames without buffering them. This effectively eliminates the entire VBV buffer latency and also improved support for small-but-not-nonexistent buffer sizes as well.
But single-frame VBV support seems useless at first glance. Keyframes are far larger than normal frames, so if every frame is capped to the same size, the image will completely fall apart at every single keyframe! This is completely intolerable, obviously. This means the video will only work if there are no keyframes in the stream other than the first–which basically assumes only one viewer, that nobody would want to seek in a recorded version of the live stream, and that no packet loss ever occurs for any reason. This doesn’t fit most use-cases. We’ll come back to this later, too.
The second step was to bring back a threading model which was discontinued in 2006 due to its inefficiency: slice-based threading. Normal threading, also known as frame-based threading, uses a clever staggered-frame system for parallelism. But it comes at a cost: as mentioned earlier, every extra thread requires one more frame of latency. Slice-based threading has no such issue: every frame is split into slices, each slice encoded on one core, and then the result slapped together to make the final frame. Its maximum efficiency is much lower for a variety of reasons, but it allows at least some parallelism without an increase in latency. This begins to resolve the latency problem mentioned earlier.
The final step was to bring it all together with Periodic Intra Refresh. Periodic Intra Refresh completely eliminates the concept of keyframes: instead of periodic keyframes, a column of intra blocks moves across the video from one side to the other, “refreshing” the image. In effect, instead of a big keyframe, the keyframe is “spread” over many frames. The video is still seekable: a special header, called the SEI Recovery Point, tells the decoder to “start here, decode X frames, and then start displaying the video”–this hides the “refresh” effect from the user while the frame loads. Motion vectors are restricted so that blocks on one side of the refresh column don’t reference blocks on the other side, effectively creating a demarcation line in each frame.
Immediately the previous steps become relevant. Without keyframes, it’s feasible to make every frame capped to the same size. With each frame split into packet-sized slices and the image constantly being refreshed by the magic intra refresh column, packet loss resilience skyrocketed, with a videoconference being “watchable” at losses as absurd as 25%.
No longer does 200ms seem out of reach. If anything, it’s now far more than we need. Because with –tune zerolatency, single-frame VBV, and intra refresh, x264 can achieve end-to-end latency (not including transport) of under 10 milliseconds for an 800×600 video stream. And it’s all open source. Furthermore, CELT provides the perfect open source low-latency audio equivalent for x264′s video. We already have multiple companies building software around these new features.
Videoconferencing? Pah! I’m playing Call of Duty 4 over a live video stream!