I have a performance problem for my opencl C++ wrapper, I need to transfer data from buffer d to buffer b as fast as possible (using map/unmap to achieve 6GB/s DMA speed of pci-e) then copy that data to buffer a (at device speed about 40GB/s)

*********************************
*       device(discrete gpu)    *
*                               *
*     (enqueueCopyBuffer)       *
*   a <---->b                   *
*********************************
            ^
            |(map/unmap)
            |
***************
*   d------>c *
*  (memcpy)   *
*             *
*    host     *
*             *
***************

I tried many combinations of ALLOC_HOST_PTR, COPY_HOST_PTR,... for a,b and c but couldnt find an optimal solution.

Some of what I tried:

 d---->c (memcpy 10GB/s)
 c----->b(map/unmap CL_ALLOC_HOST_PTR)(6GB/s) ,  b---->a (enqueueCopyBuffer ~5 GB/s) 
                                              (I think  ALLOC makes b host buffer)

 d---->c (memcpy 10GB/s)
 c------>b(map/unmap CL_READ_WRITE)(1.4GB/s) , b---->a (enqueueCopyBuffer 40GB/s)
                     (now b is device buffer but map/unmap is not good and buggy)

 d------>a(enqueueWriteBuf CL_READ_WRITE)(1.7GB/s) 
          (I started the new project with this)
          (multithreaded read/write does not go beyond 2GB/s)

but I need:

 d----->c(memcpy 10GB/s)
 c----->b(map/unmap CL_???_PTR)(6GB/s) ,  b---->a (enqueueCopyBuffer 40 GB/s)

The reason to separate a and b is, kernel execution must use device memory.

The reason to separate d and c is, Im implementing GPU acceleration to an opensource project and I dont want to change project's array integrity.

There are at least a dozen of a,b,c,d that I must use. 2 for velocities, 1 for pressure, ...

Question: Which buffer structure must I implement to reach the goal of "bottlenecking part must not be less than 6GB/s anywhere". Should I bundle all b's together (same for c) into a bigger buffer to do a single read/write/map/unmap for all of them?

有帮助吗?

解决方案

Your fixed bandwidth requirement is very strict.

If a & b are both Device-side buffers, my advice is to use pinned memory technique with simple CL_MEM_READ_WRITE allocation flags. Though, best results I've ever achieved were around 5.3 GB/s on PCIe 2.0 x16. Taking into account, that memory transfer is usually matter of microseconds, you can do the trick with non-blocking memory transfer with event waiting on Host side. Such conveyer-fashion tasking usually shows good bandwidth.

Generally (for a, b, c & d buffers), my advice is to use separate command queue for every type of memory traffic to get benefits of DMA transfers. Usually, 3 command queues are enough - Host-to-Device, Device-to-Device & Host-to-Host.

GPU memory controllers are designed to provide good peak throughput with high latencies, so massive async memory transfers are generally faster. Try to avoid any synchronous operations, they freezes up the whole thing.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top