I haven’t written a post here for quite some time, so I figured I’d get back into the swing of things with an update on some recent revisions. Further information can be found on the official git repository page.
r891: Instead of checking for the end of the bitstream on every single write, check before encoding each macroblock–and instead of silently truncating the frame upon reaching the end of the malloced memory, reallocate with more room. (patch by me)
r892: A patch by Gabriel for compilation on MSVC and other crappy compiles that don’t maintain mod16 stack alignment–disable all functions that require these on such platforms. This results in a visible CPU flag (“slow_mod4_stack”) to warn the user that their build is crippled.
r893: Various micro-optimizations by me, along with a very cool little idea that saves quite a bit of time in probe_skip. The idea here is that the probe_skip function *almost never* terminates during the chroma check, but you can’t skip it completely, since if you choose to skip a block before looking at the chroma, and it turned out there was significant detail there that you ignored, there might be noticeable artifacting as a result. The solution was to do a quick Sum of Squared Differences (SSD) check before doing the actual probe_skip on each chroma plane; if low enough, skip the check. Even an incredibly low threshold was sufficient for skipping the check nearly half the time yet never changing the results.
r894: Assembly for the lowres interpolation filter used for frametype decision–drastically increases its speed. The C was modified in order to match the assembly version, which had slightly different rounding. I originally wrote an MMX, SSE2, SSE3, and SSSE3 version of this patch, but Loren found an even faster method of doing it and rewrote most of it from scratch before committing it.
r895: Fix a bug in adaptive quantization stemming from the original implementation. One subtle issue with this patch is that on win32, GCC incorrectly puts arrays of all zeroes into .bss even when told to align them–despite the fact that .bss is only labeled as 4-byte aligned. This causes a crash, of course. We resolved this with an ugly hack–tack a “1″ onto the end of the array to stop GCC from putting it in .bss.
r896: Improve the permutation macros for easier use throughout x264′s assembly. (patch by Loren)
r897: Port the noise reduction assembly from libavcodec and improve it. Speed up the C version too. The SSSE3 assembly is roughly 25 times faster than the original C code. (patch by me with help from Loren and Holger)
r898-r899: Update the copyright headers across x264. (patch by me)
r900: Fix a subtle bug in Loren’s frame_init_lowres patch; in some cases the FPU might not be re-initialized with “emms” after running frame_init_lowres, so an emms was added just in case.
r901: Lots of mini-optimizations, one by holger, the rest by me.
r902: Considerable cleanup of the SSD assembly functions to simplify the code. (patch by Loren)
r903: A complete rewrite of the bitstream functions from the ground up. This vastly faster bitstream writer uses 32-bit and 64-bit instead of 8-bit chunks. This eliminates the need for any loop in bitstream writing. The 32-bit writer is quite similar to ffmpeg’s 32-bit writer. The 64-bit writer uses a particularly ingenious system whereby the writer only writes 32 bits at a time, but keeps up to 64 bits in a storage chunk–this means that any variable-length code can be written unconditionally to the chunk, since there’s always at least 32 bits of space left: the write to the bitstream is done afterwards. Because of this, the 64-bit bitstream writer only needs a single if statement in its write function; not even an else. (patch by me with help from Loren)
r904: The golomb functions were far more complicated than necessary–with the new bitstream writer changes many of the special cases were no longer useful speed-wise. Additionally, many of the branches were not necessary in most calls to those functions. (patch by Loren with some changes by me)
r905: When a large static table is stuck in a .h file, its duplicated in every single C file that includes and uses it. This is a waste of memory–its better to declare it once in a C file and use “extern” to address it from other locations. This diff de-duplicates VLC tables and additionally reduces them from using 16-bit to using 8-bit codes, since none of the VLC codes have more than 8 significant bits. (patch by Loren, change to 8-bit by me)
r906: Add support for PCM macroblocks. This type of macroblock is *completely* uncompressed and bypasses the CABAC entropy coder. As such, each PCM block is always 384 bytes. The advantage of this is that sometimes, given the current state of the CABAC encoder, at very low QPs (0-5) and in lossless mode, its possible that a block will exceed 384 bytes in size–so PCM would have been a better option. When rate-distortion optimization is enabled, PCM is considered as a possible option by x264. (patch by me)