Question

I have Tried to Implement the HAAR wavelet transform in CUDA for a 1D array.

ALGORITHM

I have 8 indices in the input array

With this condition if(x_index>=o_width/2 || y_index>=o_height/2) I will have 4 threads which should be 0,2,4,6 and I plan to handletwo indices in the input with each one of them.

I calculate the avg.EG: if my thread id is '0'-then avg is (input[0]+input[1])/2 and then at the same time i get the diff which would be input[0]-avg and so on for the rest of the threads.

NOW important thing is the placement of the output.I created a separate thread_id for the output as using indices 0,2,4,6 was creating difficulties with placement of the output in the correct index.

My avgs should be placed in the first 4 indices i.e 0,1,2,3 of the output-and o_thread_id should be 0,1,2,3. Similarly,to place differences at 4,5,6,7 I have incremented 0,1,2,3 with '4' as shown in the code

PROBLEM

My output comes out as all zero!!! No matter what I change I am getting that.

CODE

__global__ void cal_haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int x_index=blockIdx.x*blockDim.x+threadIdx.x;
    int y_index=blockIdx.y*blockDim.y+threadIdx.y;

    if(x_index>=o_width/2 || y_index>=o_height/2) return;

    int i_thread_id=y_index*i_widthstep+(2*x_index);
    int o_thread_id=y_index*o_widthstep+x_index;

    float avg=(input[i_thread_id]+input[i_thread_id+1])/2;
    float diff=input[i_thread_id]-avg;
    output[o_thread_id]=avg;
    output[o_thread_id+4]=diff;

}

void haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int * d_input;
    float * d_output;

    cudaMalloc(&d_input,i_widthstep*o_height);
    cudaMalloc(&d_output,o_widthstep*o_height);

    cudaMemcpy(d_input,input,i_widthstep*o_height,cudaMemcpyHostToDevice);

    dim3 blocksize(16,16);
    dim3 gridsize;
    gridsize.x=(o_width+blocksize.x-1)/blocksize.x;
    gridsize.y=(o_height+blocksize.y-1)/blocksize.y;

    cal_haar<<<gridsize,blocksize>>>(d_input,d_output,i_widthstep,o_widthstep,o_width,o_height);


    cudaMemcpy(output,d_output,o_widthstep*o_height,cudaMemcpyDeviceToHost);

    cudaFree(d_input);
    cudaFree(d_output);

}

The following is my main function:-

void main()
{
    int in_arr[8]={1,2,3,4,5,6,7,8};
    float out_arr[8];
    int i_widthstep=8*sizeof(int);
    int o_widthstep=8*sizeof(float);
    haar(in_arr,out_arr,i_widthstep,o_widthstep,8,1);

    for(int c=0;c<=7;c++)
    {cout<<out_arr[c]<<endl;}
    cvWaitKey();

}

Can you tell me where I am going wrong that it gives me zeros as output? Thank you.

Was it helpful?

Solution

The problem with your code is the following condition:

if(x_index>=o_width/2 || y_index>=o_height/2) return;

Given o_height = 1, we have o_height/2 = 0 (o_height is int, so we have integer division here with rounding down), so no threads perform any operations. To achieve what you want you can either do floating-point arithmetics here, or use (o_height+1)/2 and (o_width+1)/2: it would perform division with "arithmetic" rounding (you will have ( x_index >= (8+1)/2 /*= 4*/ && y_index >= (1+1)/2 /*= 1*/ ))

Besides, there is problem with addressing when you have more than 1 thread in Y-dimension, since then you i_thread_id and o_thread_id calculations would be wrong (_withstep is size in bytes, yet you use it as array index).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top