如何使用本地内存中的OpenCL？

https://stackoverflow.com/questions/2541929

opencl

23-09-2019
|

题

我一直在玩的OpenCL最近，我能写仅使用全局存储器简单的内核。现在我想开始使用本地内存，但我似乎无法弄清楚如何同时使用get_local_size()和get_local_id()来计算一个输出的“块”。

例如，假设我想苹果的OpenCL的Hello World示例内核转换为东西使用本地内存。你会怎么做呢？这里的原始内核源代码：

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

如果该实施例中不能很容易地转换到的东西，显示了如何利用的本地存储器，任何其他简单的例子将做。

解决方案

检查出在NVIDIA或AMD的SDK的样本，他们应该指向你在正确的方向。矩阵转置将使用本地存储器例如

使用您的平方的内核，你可以在中间缓冲级的数据。记得在附加参数来传递。

__kernel square(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        // if the threads were reading data from other threads, then we would
        // want a barrier here to ensure the write completes before the read
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}

其他提示

有另一种可能性要做到这一点，如果本地存储器的大小是恒定的。不使用在内核参数列表的指针，本地缓存可以在内核中只是通过声明声明它__local：

__local float localBuffer[1024];

由于较少clSetKernelArg调用此删除代码。

在OpenCL的本地存储器是指横跨工作组中的所有工作项目共享数据。它通常需要做一个屏障呼叫前可以用局部存储器的数据（例如，一个工作项目要读取由其他工作项写入本地存储器中的数据）。障碍是硬件成本。请记住，本地内存应该用于重复数据读/写。银行冲突应避免尽可能。

如果你不小心与本地内存，你可能最终与性能更差一些的时间比使用全局内存。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow