Optimization of integral image

Question 1

Start to check compiler settings, is it set to maximum performance?

Than, depending from architecture, calculation of integral image have several bottleneck.

Computations itself, some low cost CPU can't perform integer math with good performance. No solution.
Data flow is not optimal. The solution is to provide optimal data flows ( number of sequential read and write streams). For example you can process 2 rows simultaneously.
Data dependency of algorithm. On modern CPU it can be biggest problem. The solution is to change processing algorithm. For example calculate odd/even pixels without dependency (more calculations , less dependency).
Processing can be done using GPU.

Question 2

I have trouble believing that profile result. In this code

16      for (int x = 1; x < w + 1; x++, pg++, ps += n_bins, psu += n_bins) {
17          s += *pg;
18          *ps = *psu + s;
19      }

it says the lion's share of time is on line 18, very little on 17, and next to nothing on line 16. Yet it is also doing a comparison, two increments, and three adds on every iteration. Cache-misses might explain it, but there's no harm in double-checking, which I do with this technique.

Regardless, the loop could be unrolled, for example:

int x = w;
while(x >= 4){

  s += pg[0];
  ps[n_bins*0] = psu[n_bins*0] + s;

  s += pg[1];
  ps[n_bins*1] = psu[n_bins*1] + s;

  s += pg[2];
  ps[n_bins*2] = psu[n_bins*2] + s;

  s += pg[3];
  ps[n_bins*3] = psu[n_bins*3] + s;

  x -= 4;
  pg += 4;
  ps += n_bins*4;
  psu += n_bins*4;
}
for(; --x >= 0;){
  s += *pg;
  *ps = *psu + s;
  pg++;
  ps += n_bins;
  psu += n_bins;
}

If n_bins happens to be a constant, this could enable the compiler to do some more optimizing of the code in the while loop.

Question 3

You probably don't compute integral images just for the sake of computing integral images.

I imagine two situations:

1) you use the integral images on every pixel to compute a box filter or similar.

2) you use them at a much smaller number of places.

In case 1), the computation of the integral images will probably not be the bottleneck in your application.

In case 2), you should wonder if it is worth to compute the entire integral images.

This said, parallelization with four threads is also an option. The easiest is to let every thread compute every fourth image.

You can also split every image in four, but you will be penalized by the need to synchronize the threads, but also by the fact that prefix sums are constrained by a data dependency. (You can split the image in four and compute separate integral images, but after this step you will need to add a constant to three of the quarter images.)