优化CUDA推力在一个循环

https://stackoverflow.com/questions/2405214

18-09-2019
|

题

给予以下一段代码，产生一种代码字典，CUDA使用推力(C++的模板图书馆CUDA):

thrust::device_vector<float> dCodes(codes->begin(), codes->end());
thrust::device_vector<int> dCounts(counts->begin(), counts->end());
thrust::device_vector<int> newCounts(counts->size());

for (int i = 0; i < dCodes.size(); i++) {
    float code = dCodes[i];
    int count = thrust::count(dCodes.begin(), dCodes.end(), code);

    newCounts[i] = dCounts[i] + count;

    //Had we already a count in one of the last runs?
    if (dCounts[i] > 0) {
        newCounts[i]--;
    }

    //Remove
    thrust::detail::normal_iterator<thrust::device_ptr<float> > newEnd = thrust::remove(dCodes.begin()+i+1, dCodes.end(), code);
    int dist = thrust::distance(dCodes.begin(), newEnd);
    dCodes.resize(dist);
    newCounts.resize(dist);
}

codes->resize(dCodes.size());
counts->resize(newCounts.size());

thrust::copy(dCodes.begin(), dCodes.end(), codes->begin());
thrust::copy(newCounts.begin(), newCounts.end(), counts->begin());

问题是，我已经注意到了多份的4个字节，通过使用CUDA些分析器。海事组织这是由

循环计数器我
浮码, int计数 和 dist
每次访问我和变量上指出

这似乎放慢的一切(序复制的4个字节是没有乐趣...).

因此，我是怎么告诉主旨，即这些变量应当处理设备上的?或者是他们已经?

使用推::device_ptr似乎不足以对我，因为我不确定是否对周围的环路上运行的主或装置(可能也是另一个原因slowliness).

解决方案

对于每一个重申我，大小、索引、编码，等等。已经被复制，从主要设备..你有你的计划，没有太多可做的。为了最好的结果，考虑移动整个我循环的装置，这种方式你不会有主要设备的副本。

信任是伟大的一些东西，但是在表现关切的是和算法并不完全适合提供的功能，可能必须改写为最好的绩效而不使用推算法明确。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow