Question

I am writing a cuda program for matching each input image of resolution ~180X180, with about 10,000 template images of resolution ~128*128. The goal is to achieve realtime performance i.e. Template matching of 25~30 input images(each with all 10,000 templates) in 1 second.

currently I am using following approach

  1. Preloaded all templates on GPU global memory to save runtime I/O operations.
  2. Created a single kernel to match one source Image with all the template images, and return an array for all positive matches.
  3. Doing all operations in time-domain (not using FFT). reason being, I tried Radix-4 fft implemenation but it requires a lot of intermediate Global reads and writes ending up taking more time.

so far for 1 input Image to 10,000 templates,it is taking around 2 seconds.

My questions are:

  1. Is there is way to determine if this task is achieveable in realtime or not? I mean with the help of maxmimum FLOPS and I/O bandwidth limitations e.t.c.
  2. How to compute if the GPU is being fully utilitzed at its maximum?
  3. Possible ways to improve the performance?

Machine specs: [i7-4770, 8GB, GTX-680]

Explaination of current kernel code:

  1. all the template images [size is about 128X128 in RGB] are per-loaded on GPU memory. Idea is to save I/O during runtime operation.
  2. Every input image is loaded on Texture memory, reason being Texture are good option for 2D addressing.
  3. Every "Block" has 1024 threads.
  4. Each thread computes the value for each output pixel,size of output is [31X31 = 961 pixels].
  5. Number of Blocks launched are equal to number of template images being matched.

Kernel Code:

__global__ void cudaMatchTemplate(TemplateArray *templates, uchar *Match)
{
    int global = blockIdx.x*blockDim.x + threadIdx.x;

    __shared__ int idx[TEMPLATE_MATCH_DIM];
    __shared__ float out_shared[TEMPLATE_MATCH_DIM];

    //halving the template size....
    int rows = (templates[blockIdx.x].nHeight)/2;
    int cols = (templates[blockIdx.x].nWidth)/2;

    int fullCol = templates[blockIdx.x].nWidth;

    int x = templates[blockIdx.x].nMatchLeft;
    int y = templates[blockIdx.x].nMatchTop;

    int offset_y =  (threadIdx.x/TEMPLATE_MATCH_SIZE);
    int offset_x =  (threadIdx.x - offset_y*TEMPLATE_MATCH_SIZE);

    // *************** Performing match in time domain *****************************//
    int sum = 0;
    float temp;
    int idxXFactor = 3*(2*(offset_x) + x);
    int idxYFactor = 2*(offset_y) + y ;

    for (int i = 0; i < rows; i++)
    {
        int I=3*i*fullCol;
        int sourceIdxY = idxYFactor + 2*i;
        for (int j = 0; j < cols; j++)
        {
            int J=3*j;
            int sourceIdxX = idxXFactor + 2*J;          
            int templateIdx = 2*I+2*J;
            //**** R *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx]);
            sum = sum + temp*temp;
            //**** G *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX+1,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +1]);
            sum = sum + temp*temp;
            //**** B *****//
            temp = float(tex2D(SourceImgColorTex,sourceIdxX+2,sourceIdxY)) - float(templates[blockIdx.x].pRAWPixels_gpu[templateIdx +2]);
            sum = sum + temp*temp;
        }
    }

    __syncthreads();

//placing all values in shared memory for comparison.
    if(threadIdx.x < TEMPLATE_MATCH_DIM)
    {
        idx[threadIdx.x] = threadIdx.x;
        out_shared[threadIdx.x] = sum;
    }
    __syncthreads();


// //computing the Min location.....//

#pragma unroll
    for(int s=512; s>0; s>>=1) 
    {
        if ((threadIdx.x < s) &&((threadIdx.x + s)<TEMPLATE_MATCH_DIM))
        {
            idx[threadIdx.x] = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? idx[threadIdx.x] : idx[threadIdx.x + s];
            out_shared[threadIdx.x]  = out_shared[threadIdx.x] < out_shared[threadIdx.x + s] ? out_shared[threadIdx.x] : out_shared[threadIdx.x + s];           
        }

    }

    __syncthreads();

    if(threadIdx.x <1)
    {
        int half_Margin = MARGIN_FOR_TEMPLATE_MATCH/2;
        int matchY = idx[0]/TEMPLATE_MATCH_SIZE ;
        int matchX = idx[0] - matchY * TEMPLATE_MATCH_SIZE;

        int diff = absolute(half_Margin - matchX) + absolute(half_Margin - matchY);
        if(diff < THRESHOLD)
        {
            Match[blockIdx.x] = 1;
        }
        else
            Match[blockIdx.x] = 0;

    }
}
Was it helpful?

Solution

I'll try to answer most of your questions

Is there is way to determine if this task is achieveable in realtime or not? I mean with the help of maxmimum FLOPS and I/O bandwidth limitations e.t.c.

I have no idea how to determine whether or not the kernel is real time achievable , you can maximize your CUDA kernel using CUDA Occupancy Calculator, You may consider using texture , surface memory, constant memory, pinned host memory and more, Those are up to your algorithm implementation.

How to compute if the GPU is being fully utilitzed at its maximum?

You can use CUDA Occupancy Calculator and CUDA visual profiler. I'd highly recommend using visual profiler it would guide you through CUDA understanding.

Possible ways to improve the performance?

There are several interesting method doing so, 1st you can maximize your kernel call using the above method, If that's not enough, try implement pipeline using stream objects in order copy the data and computation jobs at the same time.

if that's not going to work out, try working with latency, operate multiple threads accessing the GPU at the same time, since CC 3.5 CUDA launched HyperQ, this may help you complete several calls in parallel.

If that's not going to work, consider using multiple GPU devices.

Please keep us posted

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top