After some more digging it turned out that the code causes some LDS bank conflicts. The reason is that for AMD there are 32 banks with 4 bytes length, but the float4
covers 16 bytes and therefore the half-wavefront accesses different addresses in the same banks. The solution was to make __local float*
for x and y coordinates separately and read them also separately with the proper shift of array index as (id + i) % s_wg
. Nevertheless, the overall gain in performance is small, most likely due to the overall latencies described in the link provided by @CaptainObvious (well then one has to increase the global work size to hide them).
Performance issues with LDS memory in OpenCL
-
30-09-2022 - |
Question
I have a performance problem when using LDS memory with AMD Radeon HD 6850.
I have two kernels as parts of a N-particle simulation. Each work unit has to calculate force which acts on a corresponding particle based on relative position to other particles. The problematic kernel is:
#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
__global const float4 *pos,
__global float4 *vel,
__global float4 *accel,
__local float4 *pblock,
const float bound)
{
const int gid = get_global_id(0); //global id of work item
const int id = get_local_id(0); //local id of work item within work group
const int s_wg = get_local_size(0); //work group size
const int n_wg = get_num_groups(0); //number of work groups
const float4 myPos = pos[gid];
const float4 myVel = vel[gid];
const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
float4 acc = (float4)0.0f;
for (int jw = 0; jw < n_wg; ++jw)
{
pblock[id] = pos[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
for (int i = 0; i < s_wg; )
{
#pragma unroll UNROLL_FACTOR
for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
{
float4 r = myPos - pblock[i];
float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
acc *= 24.0f / myPos.w;
//update velocity only
float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
//write to global memory
vel[gid] = newVel;
accel[gid] = acc;
}
The simulation runs fine in terms of results, but the problem is in the performance when using the local memory for caching the particle positions to relieve the big amount of reading from the global memory. Actually if the line
float4 r = myPos - pblock[i];
is replaced by
float4 r = myPos - pos[jw * s_wg + i];
the kernel runs faster. I don't really get that since reading from global should be much slower than reading from local.
Moreover, when the line
float4 r = myPos - pblock[i];
is removed completely and all following occurences of r
are replaced by myPos - pblock[i]
, the speed is the same as before as if the line was not there at all. This I don't get even more as accessing private memory in r
should be the fastest but the compiler somehow "optimizes" this line out.
Global work size is 4608, local worksize is 192. It is run with AMD APP SDK v2.9 and Catalyst drivers 13.12 in Ubuntu 12.04.
Can anyone please help me with this? Is that my fault or is that a problem of the GPU / drivers / ... ? Or is it a feature? :-)
Solution 2
OTHER TIPS
I'm gonna make a wild guess:
When using float4 r = myPos - pos[jw * s_wg + i];
the compiler is smart enough to notice that the barrier put after the initialization of pblock[id]
is not necessary anymore and remove it. Very likely all these barriers (in the for loop) impact your performances, so removing them is very noticeable.
Yeah but global access cost a lot too...So I'm guessing that behind the scene cache memories are well utilized. There is also the fact that you use vectors and as a matter of fact the architecture of the AMD Radeon HD 6850 uses VLIW processors...maybe it helps also to make a better use of the cache memories...maybe.
EDIT: I've just found out a article benchmarking GPU/APU Cache and Memory Latencies. Your GPU is in the list. You might get some more answers (sorry didn't really read it - too tired).