Cuda - OpenCL CPU 4x faster than OpenCL or CUDA GPU version

https://stackoverflow.com/questions/16430038

14-04-2022
|

سؤال

A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.

(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)

To clarify, for a sample problem:

OpenCL CPU 1200 Hz
OpenCL GPU 320 Hz
CUDA GPU -~330 Hz

I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.

Edit

Here's the C# code that launches one of the kernels. (The others are very similar.)

    public override void UpdateEz(Source source, float Time, float ca, float cb)
    {
        var blockSize = new dim3(1);
        var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));

        Gpu.Launch(gridSize, blockSize)
            .CudaUpdateEz(
                Time
                , ca
                , cb
                , source.Position.X
                , source.Position.Y
                , source.Value
                , _gpuHx.Field
                , _gpuHy.Field
                , _gpuEz.Field
            );

    }

And, here's the relevant CUDA kernel function generated by Cudafy:

extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue,  float* hx, int hxLen0, int hxLen1,  float* hy, int hyLen0, int hyLen1,  float* ez, int ezLen0, int ezLen1)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
    {
        ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
    }
    if (x == sourceX && y == sourceY)
    {
        ez[(x) * ezLen1 + ( y)] += sourceValue;
    }
}

Just for completeness, here's the C# that is used to generate the CUDA:

    [Cudafy]
    public static void CudaUpdateEz(
        GThread thread
        , float time
        , float ca
        , float cb
        , int sourceX
        , int sourceY
        , float sourceValue
        , float[,] hx
        , float[,] hy
        , float[,] ez
        )
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                +
                cb * (hy[i, j] - hy[i - 1, j])
                -
                cb * (hx[i, j] - hx[i, j - 1])
                ;

        if (i == sourceX && j == sourceY)
            ez[i, j] += sourceValue;
    }

Obviously, the if in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.

The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!

المحلول

Answering this to get it off the unanswered list.

The code posted indicates that the kernel launch is specifying a threadblock of 1 (active) thread. This is not the way to write fast GPU code, as it will leave most of the GPU capability idle.

Typical threadblock sizes should be at least 128 threads per block, and higher is often better, in multiples of 32, up to the limit of 512 or 1024 per block, depending on GPU.

The GPU "likes" to hide latency by having a lot of parallel work "available". Specifying more threads per block assists with this goal. (Having a reasonably large number of threadblocks in the grid may also help.)

Furthermore the GPU executes threads in groups of 32. Specifying only 1 thread per block or a non-multiple of 32 will leave some idle execution slots, in every threadblock that gets executed. 1 thread per block is particularly bad.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow