Question

I tried to reduce the execution time of this function, and I got the execution time down to

Sys:0.001s

Is there any way to reduce the execution time of this function further?

 int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n)
   {
      int i;
      uint32_t ri, c=0;
      for (i = 0; i < n; i ++)
          {
             ri = a[i] + b[i] + c;
             c = ((ri < a[i]) || ((ri == a[i]) && c));
             r[i] = ri;
          }
      return ((int) c);
   }

No correct solution

OTHER TIPS

I guess, you loose most of the time in your conditional expression: most modern CPU hate branches if they can't predict them correctly most of the time. Consequently, the branches introduce by most loops are fine, because they are only mispredicted once for the entire loop. Branching on a carry condition, however, will likely result in 50% of the branches being mispredicted, and each misprediction is worth 10 to 20 cycles. Even worse, the && and || operators are sequence points, which are a hindrance to the optimizer.

So, I would try to eliminate these conditionals:

int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n) {
    int i;
    uint64_t ri, c=0;
    for (i = 0; i < n; i ++) {
        ri = (uint64_t)a[i] + (uint64_t)b[i] + c;
        c = ri >> 32;
        r[i] = (uint32_t)ri;
    }
    return ((int) c);
}

Here, I have used 64-bit arithmetic, since modern CPUs do 64-bit arithmetic just as fast as 32-bit arithmetic. However, if 64-bit arithmetic is slow on your hardware, you can fall back to 32-bit arithmetic:

int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n) {
    int i;
    uint32_t ri, c=0;
    for (i = 0; i < n; i ++) {
        uint32_t curA = a[i], curB = b[i];
        uint32_t lowA = curA & 0xffffu, highA = curA >> 16;
        uint32_t lowB = curB & 0xffffu, highB = curB >> 16;
        uint32_t lowR = lowA + lowB + c;
        uint32_t highR = highA + highB + (lowR >> 16);
        c = highR >> 16;
        r[i] = (highR << 16) + lowR;
    }
    return ((int) c);
}

Even though this looks like a monster, it's only 12 simple operations which should execute with a latency of one cycle on all hardware, i. e. the calculation of the entire loop body should take less than 12 cycles, consequently, the bottleneck should be the memory bus (and you can't avoid that).

you can get rid of the subscript notation and use pointer arithmetic instead which is said to be faster , however i don't know how much CPU time would that actually save .

 int function(uint32_t *r, const uint32_t *a, const uint32_t *b, int n)
   {
      int i;
      uint32_t ri, c=0;
      for (i = 0; i < n; i ++)
          {
             ri = *(a + i) + *(b + i) + c;
             c = ((ri < *(a + i)) || ((ri == *(a +i)) && c));
             *(r + i) = ri;
          }
      return ((int) c);
   }

for reasons see: Accessing array values via pointer arithmetic vs. subscripting in C

c = (ri < a[i]) + ((ri-a[i])*c) might be faster than your code which also test if c==0

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top