You need a better test. Your test doesn't use the results of your calculation for anything so the compiler is just going through the motions to make you happy. It looks like you're compiling with -O0 which will produce a bunch of unnecessary loads and stores for debugging purposes. If you compiled with -O3, all of your code would be stripped out. I rewrote your test to preserve the results and compiled with -O3 and here are the results:
$ cat neon.c
#include <arm_neon.h>
void runTest(const float vector[], float result[])
{
float32x4_t vA = vld1q_f32(vector);
vA = vmulq_f32(vA, vA);
vst1q_f32(result, vA);
}
$ xcrun -sdk iphoneos clang -arch arm64 -S neon.c -O3
$ cat neon.s
.section __TEXT,__text,regular,pure_instructions
.globl _runTest
.align 2
_runTest: ; @runTest
; BB#0:
ldr q0, [x0]
fmul.4s v0, v0, v0
str q0, [x1]
ret lr
This code looks optimal