Diary Of An x264 Developer

09/10/2009 (6:36 pm)

iDCT rounding

The quantization process in modern video encoders tends to make a lot of assumptions.  A common one is that of continuity and uniform step size–that, for example, if we are quantizing the value 2.5, both 2 and 3 will give equal distortion, being exactly 0.5 off from the correct value.  But this isn’t always true; in reality, we are working with an 8-bit range in each channel.  The inverse transform has to round our high-precision internal values to a small output range.

Normally, this isn’t a problem.  Since AC coefficients have (by definition) different output values for each output pixel, they serve to effectively dither the output of the iDCT.  But what happens when we don’t have any AC coefficients?

Read More…

09/02/2009 (10:58 pm)

The hidden cycle-eating demon: L1 cache misses

Filed under: Intel,speed,x264 ::

Most of the resources out there about optimizing cache access talk about L2 cache misses.  Which is sensible–the cost of an L2 miss is extraordinary, taking hundreds of cycles for an access to main memory.  By comparison, an L1 miss, costing just a dozen cycles, is nothing.  This is true on-chip as well; the memory->L2 prefetcher in modern processors is extremely sophisticated and is very good at avoiding cache misses.  It is also very efficient, making reasonably good use of the limited memory bandwidth available.  There are also dedicated prefetch instructions to hint the prefetcher to avoid future L2 misses.

But what about L1 misses?  There’s vastly less literature on them and the L2->L1 prefetchers are often barely documented or not even mentioned in official processor literature.  Explicit prefetch instructions are vastly less useful because the cost of the misses is low enough that the extra overhead of sending off a set of prefetches is often not worth it.  And yet in many cases–such as in x264–much more time is wasted on L1 misses than L2 misses.

Read More…