Neon optimization of interlaced YUYV to gray

Question 1

Here is a starting point. From here you can do cache preloads, loop unrolling, etc. The best performance will happen when more NEON registers are involved to prevent data stalls.

 .equ CAM_HEIGHT, 480 @ fill in the correct values
 .equ CAM_WIDTH,  640

@
@ Call from C as convert_yuyv_to_y(const void *src, char *dest);
@
convert_yuyv_to_y:
  mov r2,#CAM_HEIGHT  
cvtyuyv_top_y:
  mov r3,#CAM_WIDTH
cvtyuyv_top_x:
  vld2.8 {d0,d1},[r0]!   @ assumes source width is a multiple of 8
  vst1.8 {d0},[r1]!      @ work with 8 pixels at a time
  subs r3,r3,#8          @ x+=8
  bgt cvtyuyv_top_x
  subs r2,r2,#1          @ y++
  bgt cvtyuyv_top_y
  bx lr

Question 2

(Promoting my comment to answer)

The least amount of instructions to de-interleave data in NEON architecture is achievable with the sequence:

vld2.8 { d0, d1 }, [r0]!
vst1.8 { d0 }, [r1]!

Here r0 is the source pointer, which advances by 16 each time and r1 is the destination pointer, which advances by 8.

Loop unrolling, ability to retrieve up to 4 registers and offset the registers by 2 can give slightly larger maximum throughput. Coupled with alignment by 16 bytes:

start:
vld4.8 { d0, d1, d2, d3 }, [r0:256]
subs r3, r3, #1
vld4.8 { d4, d5, d6, d7 }, [r1:256]
add r0, r0, #64
add r1, r0, #64
vst2.8 { d0, d2 }, [r2:256]!
vst2.8 { d4, d6 }, [r2:128]!
bgt start

(I can't remember if the format vstx.y {regs}, [rx, ro] exists -- here ro is offset register, that post-increments rx)

While memory transfer optimizations can be useful, it's still better to think, if it can be skipped all together, or merged with some calculation. Also this could be the place to consider planar pixel format, which could completely avoid the copying task.