Run Length Encoding - SIMD

Question

I'll assume that your issue is with RLE in AC block coefficients encoding in JPEG.

Variant of RLE, used in JPEG is quite specific. Each block of 8x8 pixels is transformed with DCT and quantised. Then DCT output (64 coefficients) is separated for first DC coefficient and 63 AC coefficients. 8x8 block of AC coefs is converted to linear array using Zig-Zag pattern, and then RLE-encoded

ZigZag from wikimedia (File:JPEG_ZigZag.svg, by Alex Khristov, PD)

ZigZag is usually combined with RLE run, so we have RLE with non-linear memory access and should optimize both functions.

WARNING! RLE in JPEG is used only for zero elements!, check F.1.2.2.1, F.1.2.2.3 and Figure F.2 of ISO/IEC 10918-1 : 1993 standard.

Some implementations:

There is libjpeg-tubro (Homepage is libjpeg-turbo.org), and some fork of it was optimized by Linaro for ARM in 2010-2011.

There is list of optimizations in this fork:

10 - ARM NEON optimizations for quantization code in 'forward_DCT' (Integer divisions replaced with floating point multiplications. ... This optimized function is now almost 3x faster than original C variant - jcdctmgr.c - Siarhei Siamashka 2010-11-10

9 - ARM assembly optimizations for 'encode_one_block' (ARM assembly optimizations for 'encode_one_block'. Almost 2x faster than original C variant.) - jchuff.c - Siarhei Siamashka 2010-11-10

8 - ARM NEON optimized version of 'rgb_ycc_convert' - Siarhei Siamashka 2010-11-10

7 - ARM NEON optimized version of 'ycc_rgb_convert' - Siarhei Siamashka 2010-11-10

6 - A minor ARM NEON optimization for 'convsamp' - Siarhei Siamashka 2010-11-10

5 - ARM NEON optimized version of 'jpeg_fdct_ifast' - Siarhei Siamashka 2010-11-10

4 - ARM NEON optimized version of 'jpeg_idct_ifast' - Siarhei Siamashka 2010-11-10

3 - ARM NEON optimized version of 'jpeg_idct_4x4' - Siarhei Siamashka 2010-11-10

Revision 9 was ARM optimization of jchuff.c file, for function encode_one_block, which encodes both AC and DC components of block; but it is done without using NEON

/* Encode a single block's worth of coefficients */
LOCAL(boolean)   encode_one_block

Factually, RLE was not optimized; but ZigZag and last zero coef detection was implemented in ARM assembly in the find_last_nonzero_index helper function. (It was generated using ruby generator) or in linaro git. It was scheduled for dual-issue Cortex-A8:

* Find last nonzero coefficient and produce output in natural order,
* instructions are scheduled to make use of ARM Cortex-A8 dual-issue
* capability

This is the corresponding C code of function:

LOCAL(int)
find_last_nonzero_index (JCOEFPTR block, JCOEFPTR out)
{
  int tmp, i, n = 0;
  for (i = 1; i < DCTSIZE2; i++) {
    if ((tmp = block[jpeg_natural_order[i]]) != 0)
      n = i;
    out[i] = tmp;
}
return n;

There is ARM or NEON optimizations for RLE itself here, but I think this asm code helps RLE, converting it to linear memory access version (encode_one_block, under ifdef __arm__):

  for (k = 2; k <= last_nonzero_index; k += 2) {
      innerloop(k);
  }

r used in innerloop macro is the RLE counter.

This linear RLE with hand-coded ZigZag (for Cortex-A8) should be faster than original C version even for Cortex-A9. Thanks to Siarhei Siamashka!

PS: Current version of libjpeg-turbo has no arm optimizations in this RLE: encode_one_block, Line 454 of jchuff.c, rev929 - innerloop was slightly rewritten as kloop, but RLE is still done in non-linear way; zigzag not separated from it.

Some thoughts about NEON

NEON has set of 32 registers each 64-bit wide (D0..D31; according to Anderson @ ELC 2011, page 5) and, in theory, can be used to store 64 of 16-bit coefficients and to implement ZigZag+RLE. Still searching for implementations...

There is similar zig-zag + RLE in MPEG standards, and there are some efforts of SIMD implementation of it for x86 and arm. There is a blog post in x264dev; 8x8 zigzag x264_zigzag_scan_8x8_frame was implemented for x86 with SSSE3. But for ARM there is only 4x4 NEON zigzag in x264.

PS: Just for me and anybody who don't know Jpeg internals. There is short and understandable introduction to JPEG coding in Cardiff CM0340 lecture slides.

PPS (updated at February 18, 2013, 13:30): For optimizing RLE encoding, we can do a pre-scan searching for zeroes in middle of AC coefficients and then work with this precalculated data. We can even save it as half-bytes or as bits in some NEON regs

UPDATE February 18: Patch author says that comment for commit 9 is inaccurate. There was 2x improvement when this code was compared with jpeg6b, not libjpeg-turbo. He says that unroll by 63 (as in libjpeg-tubro) has almost same speed as this asm solution (on some tests it is slightly better on other not).