Diary Of An x264 Developer

11/29/2008 (2:32 pm)

Nehalem optimizations: the powerful new Core i7

Filed under: assembly,avail,benchmark,Intel,x264 ::

Here’s a piece I wrote for Avail Media to explain some of the Nehalem optimizations I made in the past month or two.

Pretty graph

Note: “X/Y” instruction timing means a latency of X clocks (after doing that instruction, one has to wait X clocks to get the results), and an inverse throughput of Y clocks (if one runs a ton of that instruction one after another, one can execute that instruction every Y clocks).

The Nehalem CPU has a number of benefits over the previous Intel generation, the Penryn processor.

First of all, the Nehalem has a much faster SSE unit than the Penryn. A huge number of SSE operations have had their throughput doubled:

Read More…

11/14/2008 (2:50 am)

A simple optimization

I’ll be posting about the Nehalem optimizations soon, but in the meantime, a short and simple post.

You want to find the last nonzero value in an array of 16 16-bit values (DCT coefficients, in this case). How do you do this really quickly, especially in the case that most of the array is expected to be zero? Well here’s my way (l[16] is the array):

i_last = 0;
if( *(uint64_t*)(l+i_last+8)|*(uint64_t*)(l+i_last+12) ) i_last += 8;
if( *(uint64_t*)(l+i_last+4) ) i_last += 4;
if( *(uint32_t*)(l+i_last+2) ) i_last += 2;
if( l[i_last+1] ) i_last++;

This assumes the array isn’t all zero, but where I used this code, we already knew that it wasn’t, so that wasn’t an issue.

Only 4 conditionals (cmov and setne in the compiled asm) to find the exact index of the coefficient. Over twice as fast as the previous code:

for( j = i_count - 4; j >= 4; j -= 4 ) if( *(uint64_t*)(l+j) ) break;
for( i = j; i < j+4; i++ ) if( l[i] ) i_last = i;

And I would guess yet another factor of 2 or similar faster than the naive implementation:

for( i = 0; i < 16; i++ ) if( l[i] ) i_last = i;