Question

I apologize ahead of time for the length of this question...It's a little involved. I am writing a 'weighted sum' operation that's pretty simple; take n images, multiply each image by a specific multiplier and sum them into an output image (by iterating through each pixel). When the number of images is constant, I can hardcode the logic into one iteration, however, I wish to make the method flexible enough to handle a variable number of images. I cannot come up with an equally 'performant' way to accomplish this, e.g. without an additional inner loop when the number of inputs is unknown. Here's my situation:

var rnd = new Random();

//Pixels in input and output images
const int x = 1000000;

//An output composite image
var pixelmap = new int[x];

var tStart = DateTime.Now;

//Known number of inputs
int knownNumberOfInputs = 3;

//Weights to apply to each pixel of the input images
//multipliers[0] applies to all pixels of inputA,
//multipliers[1] applies to all pixels of inputB etc.
var multipliers = new byte[3];
rnd.NextBytes(multipliers);

/* situation 1 
* - I know how many input images
* - Arrays are independent */

//3 (knownNumberOfInputs) input images (we'll use random numbers for filler)
var inputA = new byte[x];
rnd.NextBytes(inputA);
var inputB = new byte[x];
rnd.NextBytes(inputB);
var inputC = new byte[x];
rnd.NextBytes(inputC);

//I can iterate through each pixel of each input image, multiply and sum for pixelmap value.
//Without a nested loop
for (var i = 0; i < x; i++)
{
    pixelmap[i] = (
        (inputA[i]*multipliers[0]) +
        (inputB[i]*multipliers[1]) +
        (inputC[i]*multipliers[2])
                        );
}

Console.WriteLine("Operation took " + DateTime.Now.Subtract(tStart).TotalMilliseconds + " ms");
// Operation took 39 ms

tStart = DateTime.Now;

/* situation 2
* unknown number of inputs
* inputs are contained within jagged array */

/* Caveat - multipliers.Length == inputs.Length */

//var unknownNumberOfInputs = rnd.Next(1, 10);
var unknownNumberOfInputs = 3; //Just happens to be the same number (for performance comparisons)

multipliers = new byte[unknownNumberOfInputs];
rnd.NextBytes(multipliers);

//Jagged array to contain input images
var inputs = new byte[unknownNumberOfInputs][];

//Load unknownNumberOfInputs of input images into jagged array
for (var i = 0; i < unknownNumberOfInputs; i++)
{
    inputs[i] = new byte[x];
    rnd.NextBytes(inputs[i]);
}

// I cannot iterate through each pixel of each input image
// Inner nested loop
for (var i = 0; i < x; i++)
{
    for (var t=0;t<multipliers.Length;t++)
    {
        pixelmap[i] += (inputs[t][i] * multipliers[t]);
    }
}

Console.WriteLine("Operation took " + DateTime.Now.Subtract(tStart).TotalMilliseconds + " ms");
//Operation took 54 ms

//How can I get rid of the inner nested loop and gain the performance of LoopA?
//Or is that the cost of not knowing?

Big ups

A little more info

  • The pixelmap is going into a WriteableBitmap in Silverlight - which, once constructed, takes a 1D array as a pixel source (since height/width is passed into the constructor)
  • Each input image has a specific multiplier, e.g. multiply all the pixels of input 1 by 2, all the pixels of input 2 by 3 etc.
  • There will never be more than 20 inputs.
Was it helpful?

Solution

There are three ways you can improve the performance of this code:

  1. Switch the t and i loops. That way, you're working with the same 2 large arrays, and you can apply item 2:
  2. Use temporary variables to eliminate array accesses in the inner loop.
  3. Use a technique called "loop unrolling": do the calculation in groups of 3 until you have less than 3 inputs left. Then do another loop.

Here's how all this would look:

int t = 0;
for (; t < multipliers.Length - 2; t += 3) {
    var input1 = inputs[t];
    var input2 = inputs[t+1];
    var input3 = inputs[t+2];
    var multiplier1 = multipliers[t];
    var multiplier2 = multipliers[t+1];
    var multiplier3 = multipliers[t+2];
    if (t == 0) {
        for (var i = 0; i < x; i++)
            pixelmap[i] = input1[i] * multiplier1
                + input2[i] * multiplier2
                + input3[i] * multiplier3;
    } else {
        for (var i = 0; i < x; i++)
            pixelmap[i] += input1[i] * multiplier1
                + input2[i] * multiplier2
                + input3[i] * multiplier3;
    }
}
if (multipliers.Length < 3)
    Array.Clear(pixelmap, 0, pixelmap.Length);
for (; t < multipliers.Length; t++) {
    var input = inputs[t];
    var multiplier = multipliers[t];
    for (var i = 0; i < x; i++)
        pixelmap[i] += input[i] * multiplier;
}

There are also some changes I'd make to the way you time the results:

  1. It looks like you're running your benchmarks from within Visual Studio, possibly in debug mode. Make sure you run the benchmarks outside of Visual Studio in release mode.
  2. Only benchmark the code you're interested in. Leave the code that creates the test data out of it.
  3. The Stopwatch class is ideal for timing purposes, especially for very short time spans.

OTHER TIPS

I recommend that you create an Expression that represents your computation. Then you compile that expression.

Your expression would be a lambda. Example for three inputs:

void (byte[] inputA, byte[] inputB, byte[] inputC) {
for (var i = 0; i < x; i++)
{
    pixelmap[i] = (
        (inputA[i]*multipliers0) +
        (inputB[i]*multipliers1) +
        (inputC[i]*multipliers1)
                        );
}
}

Using .NET 4 you have the for loop available as an expression (not in .NET 2).

It sounds hard but it is actually pretty easy to do this.

Just to clarify: You would compile at runtime a function that is specialized to a constant number of inputs.

You can even play tricks like unrolling the loop 2 or 4 times. You can also inline the multipliers as constants like it did in the example. This will be much faster than nested loops.

Notice that the loop is inside the expression tree, not around it. This means that you only have the overhead of a single delegate call (and of the reusable compilation result).

Here is some sample code to get you started:

int inputCount = ...;
var paramExpressions = GenerateArray(inputCount, i => Expression.Parameter(typeof(byte[]), "input" + i);

var summands = GenerateArray(inputCount, i => Expression.Mul(/* inputA[i] HERE */, Expression.Constant(multipliers[i]));

var sum = summands.Aggregate((a,b) => Expression.Add(a,b));

var assignment = /* assign sum to pixelmap[i] here */;

var loop = /* build a loop. ask a new question to find out how to do this, or use google */

var lambda = Expression.Lambda(paramExpressions, loop);

var delegate = lambda.Compile();

//you are done compiling. now invoke:

delegate.DynamicInvoke(arrayOfInputs); //send an object of type byte[][] into the lambda

That's it. You need to fill in the gaps.

You should try swapping the inner and outer loop.

Your pixelmap will probably fit in the Cpu cache, and then it need not hurt as much to write to it multiple times.

Then, you can unroll the inner loop that iterates over the pixels for max performance. Be sure to test release builds, outside of the debugger to get correct timings.

If youre still not satisfied, you can compute one image scanline at a time.

Here is another answer: Use T4 templates to generate all possible functions for 1 to 20 inputs as compile time. This is not as cool as my previous answer but also works just fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top