Using local workers in OpenCL for large matrix computation

https://stackoverflow.com/questions/10378068

04-06-2021
|

سؤال

I'm a newbie to using OpenCL (with the OpenCL.NET library) with Visual Studio C#, and am currently working on an application that computes a large 3D matrix. At each pixel in the matrix, 192 unique values are computed and then summed to yield the final value for that pixel. So, functionally, it is like a 4-D matrix, (161 x 161 x 161) x 192.

Right now I'm calling the kernel from my host code like this:

//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}

//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);

Sample kernel code is posted below.

__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
    int N = 161; //Size of matrix edges
    int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
    int array_id = get_global_id(1); //The location within the otherArray


    //Finding the x,y,z values of the pixel_id.
    float3 p;
    p.x = pixel_id % N;    
    p.y = ((pixel_id % (N*N))-p.x)/N;
    p.z = (pixel_id - p.x - p.y*N)/(N*N);

    float result;

    //...
    //Some long calculation for 'result' involving otherArray and p...
    //...

    BigMatrix[pixel_id] += result;
}

My code currently works, however I'm looking for speed for this application, and I'm not sure if my worker/group setup is the best approach (i.e. 161*161*161 and 192 for dimensions of the worker pool).

I've seen other examples of organizing the global worker pool into local worker groups to increase efficiency, but I'm not quite sure how to implement that in OpenCL.NET. I'm also not sure how this is different than just creating another dimension in the worker pool.

So, my question is: Can I use local groups here, and if so how would I organize them? In general, how is using local groups different than just calling an n-dimensional worker pool? (i.e. calling Execute(args, new int[]{(N*N*N),192}), versus having a local workgroup size of 192?)

Thanks for all the help!

المحلول

I think a lot of performance is lost waiting on memory access. I have answered a similar SO question. I hope my post helps you out. Please ask any questions you have.

Optimizations:

The big boost in my version of your kernel comes from reading otherArray into local memory.
each work item computes 4 values in BigMatrix. This means they can be written at the same time, on the same cacheline. There is minimal loss of parallelism because there are still > 1M work items to execute.

...

#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
    //using 1 quarter of the total size of the matrix
    //this work item will be responsible for computing 4 consecutive values in BigMatrix
    //also reduces global size to (N^3)/4  ~= 1043000 for N=161

    int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
    int pixel_id;
    //array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely

    int local_id = get_local_id(0); //work item id within the group
    int local_size = get_local_size(0); //size of group


    float result[4]; //result cached for 4 global values
    int i, j;
    float3 p;

    //cache the values in otherArray to local memory
    //now each work item in the group will be able to read the values efficently
    //each element in otherArray will be read a total of N^3 times, so this is important
    //opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
    __local float otherValues[otherSize];
    for(i=local_id; i<otherSize; i+= local_size){
        otherValues[i] = otherArray[i];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    //now this work item can compute the complete result for pixel_id 
    for(j=0;j<4;j++){
        result[j] = 0;
        pixel_id = global_id + j;

        //Finding the x,y,z values of the pixel_id.
        //TODO: optimize the calculation of p.y and p.z
        //they will be the same most of the time for a given work item
        p.x = pixel_id % N;    
        p.y = ((pixel_id % Nsqr)-p.x)/N;
        p.z = (pixel_id - p.x - p.y*N)/Nsqr;

        for(i=0;i<otherSize;i++){
            //...
            //Some long calculation for 'result' involving otherValues[i] and p...
            //...
            //result[j] += ...
        }
    }
    //4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
    BigMatrix[global_id] += result[0];
    BigMatrix[global_id + 1] += result[1];
    BigMatrix[global_id + 2] += result[2];
    BigMatrix[global_id + 3] += result[3];
}

Notes:

Global work size needs to be a multiple of four. Ideally, a multiple of 4*workgroupsize. This is because there is no error checking to see if each pixel_id falls within the range: 0..N^3-1. Unprocessed elements can be crunched by the cpu while you wait for the kernel to execute.
The work group size should be fairly large. This will force the cached values to be used more heavily and the benefit of caching the data in LDS will grow.

There is a further optimization to be done with the calculation of p.x/y/z in order to avoid too many costly division and modulo operations. see code below.

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)   {
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id = global_id;

int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group


float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//Finding the initial x,y,z values of the pixel_id.
p.x = pixel_id % N;    
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;

//cache the values here. same as above...

//now this work item can compute the complete result for pixel_id 
for(j=0;j<4;j++){
    result[j] = 0;
//increment the x,y,and z values instead of computing them all from scratch
    p.x += 1;
    if(p.x >= N){
        p.x = 0;
        p.y += 1;
        if(p.y >= N){
            p.y = 0;
            p.z += 1;
        }
    }

    for(i=0;i<otherSize;i++){
        //same i loop as above...
    }
}

نصائح أخرى

I have a few suggestions for you:

I think your code has a race condition. Your last line of code has the same element of BigMatrix being modified by multiple different work items.
If your matrix is truly 161x161x161, there is plenty of work items here to use those dimensions as your only dimensions. You already have > 4 million work items, which should be plenty of parallelism for your machine. You don't need 192 times that. Plus, if you don't split the computation of an individual pixel into multiple work items, you won't need to synchronize the final add.
If your global work size is not a nice multiple of a big power of 2, you might try to pad it out so that it becomes one. Even if you pass NULL as your local work size, some OpenCL implementations choose inefficient local sizes for global sizes that don't divide well.
If you don't need local memory or barriers for your algorithm, you can pretty much skip local workgroups.

Hope this helps!

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow