Is there a way to control spatial/memory locality when using Parallel.For

https://stackoverflow.com/questions/23258666

08-07-2023
|

Question

I have a large array of unscaled floats - array length is 40,000,000. In order to scale this array, I thought it would be more performant to use Parallel.For(). Here's the sequential version of the for loop that scales the data:

for (i = 0; i < rawData.Length; i++)
{
    scaledData[i] = rawData[i] * scale + offset;
}

Here is that converted to use Parallel.For() e.g.:

Parallel.For(0, rawData.Length, i => {
    scaledData[i] = rawData[i] * scale + offset;
});

But the performance is worse! My guess, based on observing index/thread combinations, is that Parallel.For() is accessing memory in such a way as to cause excess paging. To test this theory, I tried using Parallel.Invoke() like so:

Parallel.Invoke(
    () => { for (int i =        0; i < 10000000; i++) { dst[i] = src[i] * scale + offset; } },
    () => { for (int i = 10000000; i < 20000000; i++) { dst[i] = src[i] * scale + offset; } },
    () => { for (int i = 20000000; i < 30000000; i++) { dst[i] = src[i] * scale + offset; } },
    () => { for (int i = 30000000; i < 40000000; i++) { dst[i] = src[i] * scale + offset; } },
);

And that performs noticeably better but I hate the hard-coded nature of this code. I have 4 processors which is why there are 4 actions passed to Invoke().

Is there way to cajole Parallel.For() into distributing indices to threads in a fashion that doesn't thrash memory?

Solution

You can use a custom partitioner to get the behavior you want without having to resort to using Parallel.Invoke. The RangPartitioner is what you want to start with.

    var rangePartitioner = Partitioner.Create(0, rawData.Length);

    double[] results = new double[rawData.Length];

    Parallel.ForEach(rangePartitioner, (range, loopState) =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
        {
            scaledData[i] = rawData[i] * scale * offset;
        }
    });

You can create a custom partitioner and overload the GetPartition() method to adjust the block size to something that suites your needs.

See Custom Partitioners for PLINQ and TPL for an in depth discussion.

Will this improve data locality, yes. Provided that your array contains value types. In which case they will be allocated as blocks of contiguous memory. For reference types this is not the case. FWIW I've tried to improve memory locality like this with OK but not astounding improvements. I came to the conclusion that there is potentially a lot of other memory access from the CLR and that this can make it hard to understand the resulting memory access patterns.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow