Nested for loop with += assignement
Question
I have the following code in Java:
float in[][] = new float[2][2];
float B[] = new float[2];
float A[] = new float[2];
float res[] = new float[A[0].length];
for(float r : res){
r = 0;
}
for (int i = 0; i < A[0].length; i++) {
for (int j = 0; j < B[0].length; j++) {
res[i] += A[j] * in[j][i];
}
I simplified it at most, so you should not search for a real logic in there :).
I struggle for some hours converting this in CUDA because of the += statement in the loop.
I started with something like this :
extern "C"
__global__ void filter(float* in, float* A, float* B, float* res, int in_size){
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
res[x] = A[y] * in[x + y * in_width];
}
but quickly realized it couldn't work because of all the threads trying to set the same variable.
I read the example of the dot product in this presentation, but I don't really see how to adapt that with my need of two dimensions for in.
I don't ask for a complete solution, but any direction would definitely be appreciated.
Thx,
Solution
Too much CUDA killed my head.
I found a partial solution by unrolling one of the loops inside my kernel. Here it what it looks like right now :
extern "C"
__global__ void filter(float* in, float* A, float* res, const int in_width, const int sizeB){
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
int i = 0;
for(i = 0; i < sizeB; i++){
res[x] += A[i] * in[i + x * in_width];
}
}
I am sure I can find better, but I think I'll stick with this for today :)
OTHER TIPS
You can split up the multiplication job A[j] * in[j][i] in A[0].length*B[0].length threads and can sum up the results of multiplication as like reduction sample in NVIDIA sdk using shared memory