Question

I'm trying to learn more about ARM assembly and understand what exactly is happening behind the scenes with NEON intrinsics. I'm using the latest Xcode LLVM compiler. I find that often, the assembly produced from intrinsics is actually slower than even plain naive C code.

For example this code:

void ArmTest::runTest()
{

    const float vector[4] = {1,2,3,4};
    float result[4];
    float32x4_t vA = vld1q_f32(vector);

    asm("#Begin Test");

    vA = vmulq_f32(vA, vA);

    asm("#End Test");

    vst1q_f32(result, vA);
}

Produces this output:

#Begin Test

ldr q0, [sp, #16]
stp q0, q0, [fp, #-48]
ldur    q1, [fp, #-32]
fmul.4s v0, v1, v0
str q0, [sp, #16]

#End Test

What I fail to understand is why all the loads/stores hitting the memory here? I must be missing something obvious, right? Also, how would one write this in inline assembly so that it is optimal? I would expect just a single instruction, but the output is way different.

Please help me understand.

Thanks!

Était-ce utile?

La solution

You need a better test. Your test doesn't use the results of your calculation for anything so the compiler is just going through the motions to make you happy. It looks like you're compiling with -O0 which will produce a bunch of unnecessary loads and stores for debugging purposes. If you compiled with -O3, all of your code would be stripped out. I rewrote your test to preserve the results and compiled with -O3 and here are the results:

$ cat neon.c 
#include <arm_neon.h>

void runTest(const float vector[], float result[])
{
    float32x4_t vA = vld1q_f32(vector);
     vA = vmulq_f32(vA, vA);
     vst1q_f32(result, vA);
}

$ xcrun -sdk iphoneos clang -arch arm64 -S neon.c -O3
$ cat neon.s 
    .section    __TEXT,__text,regular,pure_instructions
    .globl  _runTest
    .align  2
_runTest:                               ; @runTest
; BB#0:
    ldr q0, [x0]
    fmul.4s v0, v0, v0
    str q0, [x1]
    ret lr

This code looks optimal

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top