Вопрос

I try to implement a producer-consumer relationship between two GPUs, my application makes the producer GPU record an event and then the consumer GPU inserts a stream-wait on that event into its command stream. It will stopprocessing its commands when consumer GPU encounters the stream-wait, until the producer GPU has passed the 'point of execution' where cudaEventRecord was called. I started with a staging buffer like this:

cudaError_t chCpyP2P(void *_dst, int dstDevice, const void *_src, int srcDevice, size_t N) {
    cudaError_t status;
    char *dst = (char*) _dst;
    const char *src = (const char*) _src;
    int stagingIndex = 0;
    size_t copySize = min(N, STAGING_BUFFER_SIZE);
    while ( N ) {   
        cudaSetDevice(srcDevice);
        cudaStreamWaitEvent(0, g_events[dstDevice][stagingIndex],0);
        cudaMemcpyAsync(g_hostBuffers[stagingIndex], src, copySize, cudaMemcpyDeviceToHost, NULL);
        cudaEventRecord(g_events[srcDevice][stagingIndex] );
        cudaSetDevice(dstDevice);
        cudaMemcpyAsync(dst, g_hostBuffers[stagingIndex], copySize, cudaMemcpyHostToDevice, NULL);

        dst += copySize;
        src += copySize;
        N -= copySize;
        stagingIndex = 1 - stagingIndex;
    }

}

But I am missing somehow an essential step as it is not working as expected. I do not find any place where to overthink my plans. Does anyone know what I could do ? Thanks for help, hope my question isn't too dumb.

Это было полезно?

Решение

on the host, the staging buffer is allocated and the memcpy starts by having the source GPU copy source data into the staging buffer and recording a event.
But: Unlike the host2device memcpy, the CPU doesn't need to synchronize as all synchronization will be done by the GPUs. Because memcpy and the event-record are asynchronous, directly after the initial memcpy, the CPU can request the destination-GPU to wait on that initial event and start a memcpy of the same buffer. In order to let the two GPUs can use the staging buffers concurrently, two staging buffers and two CUDA events are needed. The CPU loops over the input buffer and output buffers, issuing memcpy and event-record commands, until it has requested copies for all bytes, waiting for both GPUs to finish processing.

cudaError_t chCpyP2P(void *_dst,int dstDevice,const void *_src,int srcDevice,size_t N)
{
    cudaError_t status;
    char *dst = (char *) _dst;
    const char *src = (const char *) _src;
    int stg_idx = 0; // staging-index
    while (N) {
        size_t sz_cpy = min(N,STAGING_BUFFER_SIZE);
        cudaSetDevice( srcDevice );
        cudaStreamWaitEvent(0,g_events[dstDevice][stg_idx],0);
        cudaMemcpyAsync(g_hostBuffers[stg_idx],src,sz_cpy,cudaMemcpyDeviceToHost,NULL);
        cudaEventRecord(g_events[srcDevice][stg_idx]);
        cudaSetDevice(dstDevice);
        cudaStreamWaitEvent(0,g_events[srcDevice][stg_idx],0);
        cudaMemcpyAsync(dst,g_hostBuffers[stg_idx],sz_cpy,cudaMemcpyHostToDevice,NULL);
        cudaEventRecord(g_events[dstDevice][stg_idx]);
        dst += sz_cpy;
        src += sz_cpy;
        N -= sz_cpy;
        stg_idx = 1 - stg_idx;
    }
    cudaSetDevice(srcDevice);
    cudaDeviceSynchronize();
    cudaSetDevice(dstDevice);
    cudaDeviceSynchronize();
    return status;
}

You also need to define size_t sz_cpy outside the loop ;-)

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top