Question

I wrote this very naive NEON implementation to convert from RGBA to RGB. It works but I was wondering if there was anything else I could do to further improve performances.

I tried playing around with the prefetching size and unrolling the loop a bit more but performances didn't change much. By the way, are there any rule of thumbs when it comes to dimension the prefetching? I couldn't find anything useful on the net. Furthermore in the "ARMv8 Instruction Set Overview" I see there's also a prefetch for store, how is that useful?

Currently I'm getting around 1.7ms to convert a 1280x720 image on an iPhone5s.

// unsigned int * rgba2rgb_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
_rgba2rgb_neon:
    cmp     w2, #0x7
    b.gt    loop

    mov     w0, #0
    ret

loop:
    prfm    pldl1strm, [w1, #64]

    ld4.8b  {v0, v1, v2, v3}, [w1], #32
    ld4.8b  {v4, v5, v6, v7}, [w1], #32

    prfm    pldl1strm, [w1, #64]

    st3.8b  {v0, v1, v2}, [w0], #24
    st3.8b  {v4, v5, v6}, [w0], #24

    subs    w2, w2, #16
    b.gt    loop

done:
    ret
Was it helpful?

Solution

First (since I assume you’re targeting iOS), vImage (part of the Accelerate.framework) provides this conversion for you, as vImageConvert_RGBA8888toRGB888. This has the advantage of being available on all iOS and OS X systems, so you don’t need to write separate implementations for arm64, armv7s, armv7, i386, x86_64.

Now, it may be that you’re writing this conversion as an exercise yourself, and not because you simply didn’t know that one was already available. In that case:

  • Avoid using ld[34] or st[34]. They are convenient but generally slower than using ld1 and a permute.
  • For completely regular data access patterns like this, manual prefetch isn’t necessary.
  • Load four 16b RGBA vectors with ld1.16b, extract three 16b RGB vectors from them with three tbl.16b instructions, and store them with st1.16b
  • Alternatively, try using non-temporal loads and stores (ldnp/stnp), as your image size is too large to fit in the caches.

Finally, to answer your question: a prefetch hint for stores is primarily useful because some implementations might have a significant stall for a partial line write that misses cache. Especially simple implementations might have a stall for any write that misses cache.

OTHER TIPS

See also vImageFlatten_RGBA8888toRGB888 if you want something interesting done with the alpha channel besides chucking it over your shoulder.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top