質問

I have a CUDA - related question for you :). Since I am relatively new to using CUDA I would like to know if this "performance" is ok, or not.

I am using C# and Cudafy.Net!

I have a grayscale image (represented as float[]) that I calculated from a screenshot (the size of the image is: 1920x1018 pixel).

Now I use a Sobel filter running on the GPU (through Cudafy.Net) which looks like this:

    [Cudafy]
    public static void PenaltyKernel(GThread thread, Single[] data, Single[] res, Int32 width, Int32 height)
    {
        Single[] shared_data = thread.AllocateShared<Single>("shared_data", BLOCK_WIDTH * BLOCK_WIDTH);
        ///Map from threadIdx/BlockIdx to Pixel Position
        int x = thread.threadIdx.x - FILTER_WIDTH + thread.blockIdx.x * TILE_WIDTH;
        int y = thread.threadIdx.y - FILTER_WIDTH + thread.blockIdx.y * TILE_WIDTH;
        shared_data[thread.threadIdx.x + thread.threadIdx.y * BLOCK_WIDTH] = data[x + y * width];
        thread.SyncThreads();

        if (thread.threadIdx.x >= FILTER_WIDTH && thread.threadIdx.x < (BLOCK_WIDTH - FILTER_WIDTH) &&
            thread.threadIdx.y >= FILTER_WIDTH && thread.threadIdx.y < (BLOCK_WIDTH - FILTER_WIDTH))
        {
            ///Horizontal Filtering (detects horizontal Edges)
            Single diffHorizontal = 0;
            int idx = GetIndex(thread.threadIdx.x - 1, thread.threadIdx.y - 1, BLOCK_WIDTH);
            diffHorizontal -= shared_data[idx];
            idx++;
            diffHorizontal -= 2 * shared_data[idx];
            idx++;
            diffHorizontal -= shared_data[idx];
            idx += 2*BLOCK_WIDTH;
            diffHorizontal += shared_data[idx];
            idx++;
            diffHorizontal += 2 * shared_data[idx];
            idx++;
            diffHorizontal += shared_data[idx];

            ///Vertical Filtering (detects vertical Edges)
            Single diffVertical = 0;
            idx = GetIndex(thread.threadIdx.x - 1, thread.threadIdx.y - 1, BLOCK_WIDTH);
            diffVertical -= shared_data[idx];
            idx += BLOCK_WIDTH;
            diffVertical -= 2 * shared_data[idx];
            idx += BLOCK_WIDTH;
            diffVertical -= shared_data[idx];
            idx = GetIndex(thread.threadIdx.x + 1, thread.threadIdx.y - 1, BLOCK_WIDTH);
            diffVertical += shared_data[idx];
            idx += BLOCK_WIDTH;
            diffVertical += 2 * shared_data[idx];
            idx += BLOCK_WIDTH;
            diffVertical += shared_data[idx];

            ///Convert the "edgyness" for the Pixel and cut off at 1.0
            Single diff = GMath.Min(1.0f, GMath.Sqrt(diffHorizontal * diffHorizontal + diffVertical * diffVertical));

            ///Get the Array-Index
            idx = GetIndex(x, y, width);
            ///Set the Value
            res[x + y * width] = diff;
        }
    }

Constant values Set before runtime:

TILE_WIDTH = 16;
FILTER_WIDTH = 1;
BLOCK_WIDTH = TILE_WIDTH + 2 * FILTER_WIDTH;

When I run this "PenaltyKernel" function, including the memory allocation for the arrays, copying of data to and from device, I come to an average of about 6.2ms runtime (using a GTX 680 GT!).

So my question now is, if this speed is ok (that would make about 161frames per second) or if I am missing something? Is my Sobel filter even ok (I mean, the result looks good :) )?

Any help is appreciated!

役に立ちましたか?

解決

I think that's an OK speed. A great deal of the time is from copying data between host and device (especially transfer from GPU to CPU is slow).

A note on speed: In general, image processing on a GPU may be slower than on a CPU (I haven't tested your code so I don't know if this is true in your case) if the image is small. However, the larger the image, the faster it is to process on device than on host.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top