Optimize RGBA->RGB arm64 assembly

Question 1

First (since I assume you’re targeting iOS), vImage (part of the Accelerate.framework) provides this conversion for you, as vImageConvert_RGBA8888toRGB888. This has the advantage of being available on all iOS and OS X systems, so you don’t need to write separate implementations for arm64, armv7s, armv7, i386, x86_64.

Now, it may be that you’re writing this conversion as an exercise yourself, and not because you simply didn’t know that one was already available. In that case:

Avoid using ld[34] or st[34]. They are convenient but generally slower than using ld1 and a permute.
For completely regular data access patterns like this, manual prefetch isn’t necessary.
Load four 16b RGBA vectors with ld1.16b, extract three 16b RGB vectors from them with three tbl.16b instructions, and store them with st1.16b
Alternatively, try using non-temporal loads and stores (ldnp/stnp), as your image size is too large to fit in the caches.

Finally, to answer your question: a prefetch hint for stores is primarily useful because some implementations might have a significant stall for a partial line write that misses cache. Especially simple implementations might have a stall for any write that misses cache.

Question 2

See also vImageFlatten_RGBA8888toRGB888 if you want something interesting done with the alpha channel besides chucking it over your shoulder.