Question

I have a lot of calculations with complex numbers (usually an array containing a struct consisting of two floats to represent im and re; see below) and want to speed them up with the NEON C intrinsics. It would be awesome if you could give me an example of how to speed up things like this:

for(n = 0;n < 1024;n++,p++,ptemp++){  // get cir_abs, also find the biggest point (value and location).
    abs_squared = (Uns32)(((Int32)(p->re)) * ((Int32)(p->re)) 
                  + ((Int32)(p->im)) * ((Int32)(p->im)));
    // ...
}

p is an array of this kind:

typedef struct {
    Int16 re;
    Int16 im;
} Complex;

I already read through chapter 12 of "ARM C Language Extensions" but still have problems in understanding how to load and store my kind of construct here to do the calculations on it.

Was it helpful?

Solution

Use vld2* intrinsics to split re and im into different registers upon load, and then process them separately, e.g.

Complex array[16];

const int16x8x2_t vec_complex = vld2q_s16((const int16_t*)array);
const int16x8_t vec_re = vec_complex.val[0];
const int16x8_t vec_im = vec_complex.val[1];
const int16x8_t vec_abssq = vmlaq_s16(vmulq_s16(vec_re, vec_re), vec_im, vec_im);

For the above code clang 3.3 generates

vld2.16 {d18, d19, d20, d21}, [r0]
vmul.i16 q8, q10, q10
vmla.i16 q8, q9, q9
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top