Diary Of An x264 Developer

05/01/2008 (5:08 pm)

Why, Intel, why?

Filed under: assembly,Intel,stupidity,x264 ::

This diff is highly related to this post.

If one looks in Intel’s documentation of their assembly, one notices a few things. In particular, there are a whole bunch of operations which do exactly the same thing but have different opcodes. Intel introduces “movaps” and “movups” for aligned and unaligned moves in SSE1, and then “movdqa” and “movdqu” in SSE2… to do exactly the same thing. The same situation occurs with pand and andps… etc. The end result is a number of things:
1. Wasted opcode space on opcodes that do exactly the same thing.
2. Wasted executable size, since movdqa is larger than movaps (3 byte vs 2 byte opcode) despite doing exactly the same thing.
3. Loss of our sanity.

4 Responses to “Why, Intel, why?”

  1. Thomas Says:

    I imagine it has much to do with the development of a cpu. Rather then remake every instruction set, intel is much more likely to “Copy” the circuitry from the last cpu and “Paste” it into new generations.

    Perhaps in part of this process they have an SSE1 section and a SSE2 section. The SSE1 section dating back to the orgional P3 (I believe, maybe it was P2) and the SSE2 section being new.

    That would be my thoughts on why it was done this way. Maybe there is some underlying difference going on that we just don’t see. (fewer gates to hope through for an instruction, I don’t know). But this is why I would guess they did it that way. My guess is the transistors added for having the same instruction was not that big of a deal for them.

  2. Abao Says:

    haha…seems like the organisation doesnt know what they know, hence they recreated the same functions for sse2… -_o

  3. Mike Stoner Says:

    The reason relates to bypass latencies beween execution stacks within the out-of-order engine. For example on Nehalem using PAND (instead of ANDPS) within a dependency chain of FP operations would cause a 2-cycle bypass penalty on both ends moving between the FP and SIMD execution stacks. The MOVAPS/MOVAPD/MOVDQA decision is a special case – for memory loads they all have identical bypass latencies feeding to each stack so we recommend using MOVAPS to save the instruction byte. However, for reg-reg moves (‘MOVAPS xmm, xmm’) mixing data types will incur additional bypass penalties.

  4. Alecco Says:

    Probably a lot of people drop out of learning SSE programming because of its incoherent API and documentation. It’s so frustrating.

Leave a Reply