First (since I assume you’re targeting iOS), vImage (part of the Accelerate.framework) provides this conversion for you, as vImageConvert_RGBA8888toRGB888. This has the advantage of being available on all iOS and OS X systems, so you don’t need to write separate implementations for arm64, armv7s, armv7, i386, x86_64.
Now, it may be that you’re writing this conversion as an exercise yourself, and not because you simply didn’t know that one was already available. In that case:
- Avoid using
ld[34]
orst[34]
. They are convenient but generally slower than usingld1
and a permute. - For completely regular data access patterns like this, manual prefetch isn’t necessary.
- Load four 16b RGBA vectors with
ld1.16b
, extract three 16b RGB vectors from them with threetbl.16b
instructions, and store them withst1.16b
- Alternatively, try using non-temporal loads and stores (
ldnp
/stnp
), as your image size is too large to fit in the caches.
Finally, to answer your question: a prefetch hint for stores is primarily useful because some implementations might have a significant stall for a partial line write that misses cache. Especially simple implementations might have a stall for any write that misses cache.