Question

How could NEON be as slow as C?

I have been trying to build a fast Histogram function that would bucket incoming values into ranges by assigning them a value - which is the range threshold they are closest to. This is something that would be applied to images so it would have to be fast (assume an image array of 640x480 so 300,000 elements) . The histogram range numbers are multiples (0,25,50,75,100) . Inputs would be float and final outputs would obviously be integers

I tested the following versions on xCode by opening a new empty project (no app delegate) and just using the main.m file. I removed all linked libraries with the exception of Accelerate.

Here is the C implementation: the older version was plenty of if then but here is the final optimized logic. it took 11s and 300ms.

int main(int argc, char *argv[])
{
  NSLog(@"starting");

  int sizeOfArray=300000;

  float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
  int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray);

  for (int i=0; i<sizeOfArray; ++i)
  {
    inputArray[i]=88.5;
  }

  //Assume range is [0,25,50,75,100]
  int lcd=25;

  for (int j=0; j<1000; ++j)// just to get some good time interval
  {
    for (int i=0; i<sizeOfArray; ++i)
    {
        //a 60.5 would give a 50. An 88.5 would give 100
        outputArray[i]=roundf(inputArray[i]/lcd)*lcd;
    }
  }
NSLog(@"done");
}

Here is the vDSP implementation. Even with some of the tedious floating to integer back and forth, it took only 6s! almost 50% improvement!

//vDSP implementation
 int main(int argc, char *argv[])
 {
   NSLog(@"starting");

   int sizeOfArray=300000;

   float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
   float* outputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);//vDSP requires matching of input output
   int* outputArray=(int*) malloc(sizeof(int)*sizeOfArray); //rounded value to the nearest integere
   float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);
   int* finalOutputArray=(int*) malloc(sizeof(int)*sizeOfArray); //to compare apples to apples scenarios output


   for (int i=0; i<sizeOfArray; ++i)
   {
     inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
   }


   for (int j=0; j<1000; ++j)// just to get some good time interval
   {
     //Assume range is [0,25,50,75,100]
     float lcd=25.0f;

     //divide by lcd
     vDSP_vsdiv(inputArray, 1, &lcd, outputArrayF, 1,sizeOfArray);

     //Round to nearest integer
     vDSP_vfixr32(outputArrayF, 1,outputArray, 1, sizeOfArray);

     // MUST convert int to float (cannot just cast) then multiply by scalar - This step has the effect of rounding the number to the nearest lcd.
    vDSP_vflt32(outputArray, 1, outputArrayF, 1, sizeOfArray);
    vDSP_vsmul(outputArrayF, 1, &lcd, finalOutputArrayF, 1, sizeOfArray);
    vDSP_vfix32(finalOutputArrayF, 1, finalOutputArray, 1, sizeOfArray);
   }
  NSLog(@"done");
}

Here is the Neon implementation. This is my first so play nice! it was slower than vDSP and took 9 sec and 300ms which did not make sense to me. Either vDSP is better optimized than NEON or I am doing something wrong.

//NEON implementation
int main(int argc, char *argv[])
{
NSLog(@"starting");

int sizeOfArray=300000;

float* inputArray=(float*) malloc(sizeof(float)*sizeOfArray);
float* finalOutputArrayF=(float*) malloc(sizeof(float)*sizeOfArray);

for (int i=0; i<sizeOfArray; ++i)
{
    inputArray[i]=37.0; //this will produce an final number of 25. On the other hand 37.5 would produce 50.
}



for (int j=0; j<1000; ++j)// just to get some good time interval
{
    float32x4_t c0,c1,c2,c3;
    float32x4_t e0,e1,e2,e3;
    float32x4_t f0,f1,f2,f3;

    //ranges of histogram buckets
    float32x4_t buckets0=vdupq_n_f32(0);
    float32x4_t buckets1=vdupq_n_f32(25);
    float32x4_t buckets2=vdupq_n_f32(50);
    float32x4_t buckets3=vdupq_n_f32(75);
    float32x4_t buckets4=vdupq_n_f32(100);

    //midpoints of ranges
    float32x4_t thresholds1=vdupq_n_f32(12.5);
    float32x4_t thresholds2=vdupq_n_f32(37.5);
    float32x4_t thresholds3=vdupq_n_f32(62.5);
    float32x4_t thresholds4=vdupq_n_f32(87.5);


    for (int i=0; i<sizeOfArray;i+=16)
    {
        c0= vld1q_f32(&inputArray[i]);//load
        c1= vld1q_f32(&inputArray[i+4]);//load
        c2= vld1q_f32(&inputArray[i+8]);//load
        c3= vld1q_f32(&inputArray[i+12]);//load


        f0=buckets0;
        f1=buckets0;
        f2=buckets0;
        f3=buckets0;

        //register0
        e0=vcgtq_f32(c0,thresholds1);
        f0=vbslq_f32(e0, buckets1, f0);

        e0=vcgtq_f32(c0,thresholds2);
        f0=vbslq_f32(e0, buckets2, f0);

        e0=vcgtq_f32(c0,thresholds3);
        f0=vbslq_f32(e0, buckets3, f0);

        e0=vcgtq_f32(c0,thresholds4);
        f0=vbslq_f32(e0, buckets4, f0);



        //register1
        e1=vcgtq_f32(c1,thresholds1);
        f1=vbslq_f32(e1, buckets1, f1);

        e1=vcgtq_f32(c1,thresholds2);
        f1=vbslq_f32(e1, buckets2, f1);

        e1=vcgtq_f32(c1,thresholds3);
        f1=vbslq_f32(e1, buckets3, f1);

        e1=vcgtq_f32(c1,thresholds4);
        f1=vbslq_f32(e1, buckets4, f1);


        //register2
        e2=vcgtq_f32(c2,thresholds1);
        f2=vbslq_f32(e2, buckets1, f2);

        e2=vcgtq_f32(c2,thresholds2);
        f2=vbslq_f32(e2, buckets2, f2);

        e2=vcgtq_f32(c2,thresholds3);
        f2=vbslq_f32(e2, buckets3, f2);

        e2=vcgtq_f32(c2,thresholds4);
        f2=vbslq_f32(e2, buckets4, f2);


        //register3
        e3=vcgtq_f32(c3,thresholds1);
        f3=vbslq_f32(e3, buckets1, f3);

        e3=vcgtq_f32(c3,thresholds2);
        f3=vbslq_f32(e3, buckets2, f3);

        e3=vcgtq_f32(c3,thresholds3);
        f3=vbslq_f32(e3, buckets3, f3);

        e3=vcgtq_f32(c3,thresholds4);
        f3=vbslq_f32(e3, buckets4, f3);


        vst1q_f32(&finalOutputArrayF[i], f0);
        vst1q_f32(&finalOutputArrayF[i+4], f1);
        vst1q_f32(&finalOutputArrayF[i+8], f2);
        vst1q_f32(&finalOutputArrayF[i+12], f3);
    }
}
NSLog(@"done");
}

PS: this is my first benchmarking on this scale so I tried to keep it simple (large loops, setup code constant, using NSlog to print start/end time, only accelerate framework linked). If any of these assumptions are significantly impacting the outcome, please critique.

Thanks

Was it helpful?

Solution

First, this is not "NEON" per-se. This is intrinsics. It is almost impossible to get good NEON performance using intrinsics under clang or gcc. If you think you need intrinsics, you should hand-write the assembler.

vDSP is not "better optimized" than NEON. vDSP on iOS uses the NEON processor. vDSP's use of the NEON is much better optimized than your use of the NEON.

I haven't dug through your intrinsics code yet, but the most likely (in fact almost certain) cause of trouble is that you're creating wait states. Writing in assembler (and intrinsics are just assembler written with welding gloves on), is nothing like writing in C. You don't loop the same. You don't compare the same. You need a new way of thinking. In assembly you can do more than one thing at a time (because you have different logic units), but you absolutely have to schedule things in such a way that all those things can run in parallel. Good assembly keeps all those pipelines full. If you can read your code and it makes perfect sense, it's probably crap assembly code. If you never repeat yourself, it's probably crap assembly code. You need to carefully consider what is going into what register and at how many cycles there are until you're allowed to read it.

If it were as easy as transliterating C, then the compiler would do that for you. The moment you say "I'm going to write this in NEON" you're saying "I think I can write better NEON than the compiler," because the compiler uses it too. That said, it often is possible to write better NEON than the compiler (particularly gcc and clang).

If you're ready to go diving into that world (and it's a pretty cool world), you have some reading ahead of you. Here's some places I recommend:

ALL THAT SAID... Always always always start by reconsidering your algorithm. Often the answer is not how to make your loop calculate quickly, it's how to not call the loop so often.

OTHER TIPS

ARM NEON has 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide). Your neon implementation already uses at least 18 128-bits wide, so compiler would generate code to move them back and forth from the stack, and that's not good - too much extra memory access.

If you plan to play with assembly, I found it best to use a tool to dump instructions in object files. One is called objdump in Linux, I believe it is called otool in Apple world. This way you can actually see how the resulting machine code looks like, and what did compiler do with your functions.

Below is some part of your neon implementation's dump from gcc (-O3) 4.7.1. You can notice loading a quad register via vldmia sp, {d8-d9}.

1a6:    ff24 cee8   vcgt.f32    q6, q10, q12
1aa:    ff64 4ec8   vcgt.f32    q10, q10, q4
1ae:    ff2e a1dc   vbit    q5, q15, q6
1b2:    ff22 ceea   vcgt.f32    q6, q9, q13
1b6:    ff5c 41da   vbsl    q10, q14, q5
1ba:    ff20 aeea   vcgt.f32    q5, q8, q13
1be:    f942 4a8d   vst1.32 {d20-d21}, [r2]!
1c2:    ec9d 8b04   vldmia  sp, {d8-d9}
1c6:    ff62 4ee8   vcgt.f32    q10, q9, q12
1ca:    f942 6a8f   vst1.32 {d22-d23}, [r2]

Of course this all depends on the compiler, a better compiler can avoid this situation by using available registers more clearly.

So at the end you are at the mercy of the compiler if you don't use assembly (inline, standalone) or should continuously check compiler output until you get what you want from it.

As a complement to Rob's answer, that writing NEON is an art in and of itself (thanks for plugging my Wandering Coder posts, by the way), and auselen's answer (that you are indeed having too many registers live at any given time, leading to spilling), I should add that your intrinsics algorithm is more general than the other two: it allows arbitrary ranges, not just multiples, so you are attempting to compare things that are not comparable. Always compare oranges to oranges; with the exception however that it is fair game to compare a custom algorithm more specific than an off-the-shelf generic one if you only need the specific features of the custom one. So that is another way a NEON algorithm can be as slow as a C one: if they are not the same algorithm.

As for your histogramming needs, use what you've constructed with vDSP for the time being and only if the performance is not satisfying for your application, only then, investigate optimizing another way; avenues to do so, besides using NEON instructions, would include avoiding so much memory movement (likely the bottleneck in the vDSP implementation), and incrementing the counters for each bucket as you are browsing the pixels, instead of having this intermediate output made of coerced values. Efficient DSP code is not just about the calculations themselves, but also how to most efficiently use memory bandwidth and so forth. Even more so on mobile: memory I/O, even to the caches, is more power-intensive than in-processor-core operations, so both the memory I/O buses tend to be run at a lower fraction of the processor clock speed, so you don't have that much memory bandwidth to play with, and you should wisely use the memory bandwidth you do have, as any use of it takes power.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top