Question

Is there a initial performance hit when using local memory? I was converting my existing kernel that uses global memory and on successful conversion I saw the performance degraded. Obviously you may think I might not have used it correctly and I might even agree and find some more optimizations. But that is not the question here.

As a side experimentation I used the same kernel using global memory as is with no access to local memory. and then all I did was passed in a kernel parameter with local memory, some 1024 integers. and here I saw this kernel execution took almost twice as long. So does the allocation of local memory itself cause some initial performance hit? Has anybody seen this and maybe have an explanation?

[UPDATE] Thank you all for your comments and answers. I tried to write a separate test kernel to see if this behavior was repeatable. It wasn't. I found a post Is private memory slower than local memory? that mentions excess use of private memory may result in spill over to global memory and as a result may slow down the kernel execution. It seems this may be specific to nVidia cards, I wonder what happens on AMD cards. Could it be that allocation of local memory suddenly caused the private memory to spill over to make space for local memory? I am looking at my implementation from that angle now, unless anyone of you suggests otherwise. Is there any documentation or book that has such mention that you guys may be aware of?

Thanks again.

No correct solution

OTHER TIPS

Performance hit may be imposed by using local work group of non-optimal size or synchronizing of WI within WG.

Reading into local memory itself doesn't introduce any performance hit - it has same order of speed as reading into private memory (both placed on chip).

Also, check if your data fits into local memory size, as it's usually has small size.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top