Most SIMD assembly functions are implemented in a rather straightforward fashion. An experienced assembly programmer can spend 2 minutes looking at C code and either give a pretty good guess at how one would write SIMD for it–or equally–rule out SIMD as an optimization technique for that code. There might be a nonintuitive approach that’s somewhat better, but one can usually get very good results merely by following the most obvious method.
But in some rare cases there is no “most obvious method”, even for functions that would seem extraordinarily simple. These kind of functions present an unusual situation for the assembly programmer: they find themselves looking at some embarrassingly simple algorithm–one which simply cries out for SIMD–and yet they can’t see an obvious way to do it! So let’s jump into the fray here and look at one of these cases.
A shuffle algorithm is anything which takes an input set of values, rearranges them, and outputs them. A zigzag is a particular form of this type of algorithm: it’s a shuffle that rearranges values in a zigzag pattern. The FFmpeg logo represents the 4×4 progressive-scan zigzag for H.264, for example. In x264, the zigzags are done on 16-bit input data, so a 4×4 zigzag is 16*2 = 32 bytes of data (2 registers’ worth) and an 8×8 zigzag is 64*2 = 128 bytes of data (8 registers’ worth). Since the 4×4 zigzag is so small, it’s relatively easy to do; a few shuffles (NB: shuffle in this case being the “shuffle instruction” that acts on a single register, not all of the data) can put everything into place. The 8×8, by comparison, is enormously more difficult. Every single vector of input values gets spread across the output vectors, making it require an enormous amount of creativity to even write a working function.
Holger Lubitz, one of our 2008 Google Summer of Code students and our resident assembly god, wrote our 8×8 progressive scan zigzag. Making use of all of the MMX and SSE registers, it was an incredibly complicated masterpiece. I tried to replicate his work by writing a genetic algorithm to design shuffle algorithms from a set of limited operations. This turned out to be a miserable failure: even on smaller shuffles the search space was simply far too large. Despite the fact that the problem consisted entirely of moving data–not even arithmetic–it almost surely required the intellect and creativity of a human mind. This doesn’t bode well for compilers trying to autovectorize any of this.
Holger described his approach to the problem as one of pattern-recognition: he looked for patterns in the output pattern that suggested known algorithms. For example, parts of the zigzag looked very similar to a transpose, which was a known and relatively simple algorithm. From there he bit by bit munged parts of the rest into place.
A few days ago I decided to try this challenge myself on a similar, but probably somewhat easier problem: the 8×8 interlaced-scan zigzag. It’s easier than the progressive zigzag because it’s not nearly as “diagonal”, so the total number of output rows that a given input row can be spread among is smaller. Since the “spread” is reduced, less rearrangement has to be done. Here’s the output pattern:
0 1 2 8 9 3 4 10 16 11 5 6 7 12 17 24 18 13 14 15 19 25 32 26 20 21 22 23 27 33 40 34 28 29 30 31 35 41 48 42 36 37 38 39 43 49 50 44 45 46 47 51 56 57 52 53 54 55 58 59 60 61 62 63
It seems “sort of” like it’s in order, but it bounces around all weirdly. The first approach I tried to use, since the “spread” of this was so low, was to load the first row (0-7), the second row (8–15), and so forth, and then copy them a number of times each and shuffle each copy in the appropriate fashion, then blend them together via a bitwise or to get the output rows. Example for the first row, where pb_scan8fieldX are the two shuffle masks that tell pshufb how to mix the input values:
movdqa xmm0, [input+ 0] ; 0 1 2 3 4 5 6 7 movdqa xmm1, [input+16] ; 8 9 10 11 12 13 14 15 pshufb xmm0, [pb_scan8fielda GLOBAL] ; 0 1 2 _ _ 3 4 _ pshufb xmm1, [pb_scan8fieldb GLOBAL] ; _ _ _ 8 9 _ _ 10 por xmm0, xmm1 ; 0 1 2 8 9 3 4 10
But this creates a few problems which we can see if we count how many “shuffled sources” we need to blend for each row:
Row 0: 2
Row 1: 3
Row 2: 4
Row 3: 4
Row 4: 4
Row 5: 3
Row 6: 3
Row 7: 2
It becomes immediately clear that this approach is ridiculous. That’s 25 shuffles, 25 loads of shuffle masks, 17 bitwise ors to combine them, and 17 register copies. On a Conroe, it’ll take 50 clock cycles just for the shuffles. So maybe this approach would work for the first and last rows, but for the rest it’s stupid. An inspection of the last row shows that it can be done very easily with no shuffle at all, as all but the first two values are in order, so that pretty much leaves this approach as useless except for the first row.
But if we take a closer look at the problematic middle rows, we notice a simple pattern that may be to our advantage.
…26 20 21 22 23 27…
…34 28 29 30 31 35…
…42 36 37 38 39 43…
The “range” of these values is exactly 8: if we do a load starting at 20, we’ll get 20-27, and a load starting at 28 gets us 28-35, and so forth. In other words, each of these groups of six values can be derived from shuffling the results of a single load. Of course, these loads and stores will no longer be 16-byte aligned, so there will be a speed sacrifice there.
If we follow this to its logical conclusion, using our initial trick for the first row, we get the following:
0 1 2 8 9 3 4 10 __ 11 5 6 7 12 __ __ 18 13 14 15 19 __ __ 26 20 21 22 23 27 __ __ 34 28 29 30 31 35 __ __ 42 36 37 38 39 43 49 50 44 45 46 47 51 56 57 52 53 54 55 58 59 60 61 62 63
This leaves only 9 spots left, which can easily be filled in one-at-a-time.
The result of all this? The assembly function added in this commit. Here’s the performance numbers on an i7, in cycles:
zigzag_scan_8x8_field_c: 46.2 zigzag_scan_8x8_field_ssse3: 17.2
Unfortunately performance is drastically less on Conroe and Penryn, albeit still better than C. The reason is most likely that Nehalem has both faster shuffles and faster unaligned loads, which combine to give it an enormous advantage on this function, which relies heavily on both. Furthermore, as should be clear by the “SSSE3″, there’s still no version of this assembly function for any AMD CPU or any pre-Core 2 Intel chip.
Due to a combination of lack of caring, enormous amounts of time taken up by school final projects, and the fact that my level of assembly skill falls a tad short of “godlike”, I didn’t bother with an MMX or SSE2 version. The primary problem here is that pshufb, the instruction that made up the core of my approach, isn’t available in anything prior to SSSE3: doing this without pshufb would require a completely different approach from the ground up, rendering everything discussed in this post moot.