Question

I have a following code in a most inner loop of my program

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

I don't care whether it will be some clever algorithm (but this would be most interesting) or some C++ tricks or intrinsics or assembler. But I need to make findmax function more efficient.

Big thanks in advance.

Edit: It seems that branch is the slowest operation (misprediction?).

Was it helpful?

Solution

Well, I see no obvious room for algorithmic optimizations. Theoreticaly one could only calculate the sum of the five vectors until it is obvious that the maximum cannot be reached, but this would add way to much overhead for only summing five numbers. You could try using multiple threads and assign ranges to the threads, but you have to think about the thread creation overhead when you have only 200 very short work items.

So I tend to say that using Assembler and MMX or SSE instructions on x86 or maybe a (machine specific) C++ a library providing access to this instructions is your best bet.

OTHER TIPS

This might help a bit if the compiler is having difficulty short cutting the jumps:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

Generating a sum table might be faster in terms of cache misses I'll post this in a bit:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}

I don't see any way to do this without examining each sum, making this an O(n) problem. But since your data are laid out linearly, the Intel/AMD MMX or SSE instructions might help. See this link for Microsoft's implementation of intrinsics:

http://msdn.microsoft.com/en-us/library/y0dh78ez(VS.71).aspx

Unless compiler optimizes them out for you, computing a[ai], etc., in the loop will cost you some time (however slight) given that they are fixed for the duration of findmax. In light of that you might try something like:

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

Other means of improving the code might be to alter the way the data is represented, leading to a different (but much faster) findmax algorithm.

Try to iterate all vectors at once. Here's the example for two vectors:

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}

Take a look at loop unwinding (and Duff's device for a specific, but far more complicated, example). Those are the only real algorithm optimizations I can come up with.

Loop_unwinding

Duff's_device

You can't really get that much faster than that without additional information about the data (values) stored in a, b, c, d, and e. You have to inspect every sum to determine which one is the greatest.

It get's a little worse for Nth element queries, but fortunately, you didn't ask that one.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top