Performance issues with LDS memory in OpenCL

https://stackoverflow.com/questions/21241107

30-09-2022
|

Question

I have a performance problem when using LDS memory with AMD Radeon HD 6850.

I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:

#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
                            __global const float4 *pos,
                           __global float4 *vel,
                           __global float4 *accel,
                           __local float4 *pblock,
                           const float bound)
{
    const int gid = get_global_id(0); //global id of work item
    const int id = get_local_id(0); //local id of work item within work group

    const int s_wg = get_local_size(0); //work group size
    const int n_wg = get_num_groups(0); //number of work groups

    const float4 myPos = pos[gid];
    const float4 myVel = vel[gid];
    const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
    float4 acc = (float4)0.0f;

    for (int jw = 0; jw < n_wg; ++jw)
    {
        pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
        barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group

        for (int i = 0; i < s_wg; )
        {
            #pragma unroll UNROLL_FACTOR
            for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
            {
                float4 r = myPos - pblock[i];

                float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
                float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
                float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
                float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;

                acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
            }
        }   
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    acc *= 24.0f / myPos.w;

    //update velocity only
    float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);

    //write to global memory
    vel[gid] = newVel;
    accel[gid] = acc;
}

The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line

float4 r = myPos - pblock[i];

is replaced by

float4 r = myPos - pos[jw * s_wg + i];

the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.

Moreover, when the line

float4 r = myPos - pblock[i];

is removed completely and all following occurences of r are replaced by myPos - pblock[i], the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r should be the fastest but the compiler somehow "optimizes" this line out.

Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.

Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)

Solution 2

After some more digging it turned out that the code causes some LDS bank conflicts. The reason is that for AMD there are 32 banks with 4 bytes length, but the float4 covers 16 bytes and therefore the half-wavefront accesses different addresses in the same banks. The solution was to make __local float* for x and y coordinates separately and read them also separately with the proper shift of array index as (id + i) % s_wg. Nevertheless, the overall gain in performance is small, most likely due to the overall latencies described in the link provided by @CaptainObvious (well then one has to increase the global work size to hide them).

OTHER TIPS

I'm gonna make a wild guess:

When using float4 r = myPos - pos[jw * s_wg + i]; the compiler is smart enough to notice that the barrier put after the initialization of pblock[id] is not necessary anymore and remove it. Very likely all these barriers (in the for loop) impact your performances, so removing them is very noticeable.

Yeah but global access cost a lot too...So I'm guessing that behind the scene cache memories are well utilized. There is also the fact that you use vectors and as a matter of fact the architecture of the AMD Radeon HD 6850 uses VLIW processors...maybe it helps also to make a better use of the cache memories...maybe.

EDIT: I've just found out a article benchmarking GPU/APU Cache and Memory Latencies. Your GPU is in the list. You might get some more answers (sorry didn't really read it - too tired).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow