Any CUDA operation after cudaStreamSynchronize blocks until all streams are finished

StackOverflow https://stackoverflow.com/questions/23203886

  •  07-07-2023
  •  | 
  •  

質問

While profiling my CUDA application with NVIDIA Visual Profiler I noticed that any operation after cudaStreamSynchronize blocks until all streams are finished. This is very odd behavior because if cudaStreamSynchronize returns that means that the stream is finished, right? Here is my pseudo-code:

std::list<std::thread> waitingThreads;

void startKernelsAsync() {
    for (int i = 0; i < 200; ++i) {
        cudaHostAlloc(cpuPinnedMemory, size, cudaHostAllocDefault);
        memcpy(cpuPinnedMemory, data, size);
        cudaMalloc(gpuMemory);

        cudaStreamCreate(&stream);
        cudaMemcpyAsync(gpuMemory, cpuPinnedMemory, size, cudaMemcpyHostToDevice, stream);
        runKernel<<<32, 32, 0, stream>>>(gpuMemory);
        cudaMemcpyAsync(cpuPinnedMemory, gpuMemory, size, cudaMemcpyDeviceToHost, stream);

        waitingThreads.push_back(std::move(std::thread(waitForFinish, cpuPinnedMemory, stream)));
    }

    while (waitingThreads.size() > 0) {
        waitingThreads.front().join();
        waitingThreads.pop_front();
    }
}

void waitForFinish(void* cpuPinnedMemory, cudaStream_t stream, ...) {
    cudaStreamSynchronize(stream);
    cudaStreamDestroy(stream);  // <== This blocks until all streams are finished.
    memcpy(data, cpuPinnedMemory, size);
    cudaFreeHost(cpuPinnedMemory);
    cudaFree(gpuMemory);
}

If I put cudaFreeHost before cudaStreamDestroy then it becomes the blocking operation.

Is there anything conceptually wrong here?

EDIT: I found another weird behavior, sometimes it un-blocks in the middle of processing of streams and then processes the rest of streams.

Normal behavior:

Normal behavior

Strange behavior (happens quite often):

Strange behavior

EDIT2: I am testing on Tesla K40c card with compute capability 3.5 on CUDA 6.0.

As suggested in comments, it may be viable to reduce number of streams however in my application the memory transfers are quite fast and I want to use streams mainly to dynamically schedule work to GPU. The problem is that after stream finishes I need to download data from pinned memory and clear allocated memory for further streams which seems to be blocking operation.

I am using one stream per data-set because every data-set has different size and processing takes unpredictably long time.

Any ideas how to solve this?

役に立ちましたか?

解決

I haven't found why the operations are blocking but I concluded that I can not do anything about it so I decided ti implement memory and streams pooling (as suggested in comments) to re-use GPU memory, pinned CPU memory and streams to avoid any kind of deletion.

In case anybody would be interested here is my solution. The start kernel behaves as asynchronous operation that schedules kernel and callback is called after the kernel is finished.

std::vector<Instance*> m_idleInstances;
std::vector<Instance*> m_workingInstances;

void startKernelAsync(...) {
    // Search for finished stream.
    while (m_idleInstances.size() == 0) {
        findFinishedInstance();
        if (m_idleInstances.size() == 0) {
            std::chrono::milliseconds dur(10);
            std::this_thread::sleep_for(dur);
        }
    }

    Instance* instance = m_idleInstances.back();
    m_idleInstances.pop_back();

    // Fill CPU pinned memory

    cudaMemcpyAsync(..., stream);
    runKernel<<<32, 32, 0, stream>>>(gpuMemory);
    cudaMemcpyAsync(..., stream);

    m_workingInstances.push_back(clusteringInstance);
}

void findFinishedInstance() {
    for (auto it = m_workingInstances.begin(); it != m_workingInstances.end();) {
        Instance* inst = *it;
        cudaError_t status = cudaStreamQuery(inst->stream);
        if (status == cudaSuccess) {
            it = m_workingInstances.erase(it);
            m_callback(instance->clusterGroup);
            m_idleInstances.push_back(inst);
        }
        else {
            ++it;
        }
    }
}

And at the and just wait for everybody to finish:

virtual void waitForFinish() {
    while (m_workingInstances.size() > 0) {
        Instance* instance = m_workingInstances.back();
        m_workingInstances.pop_back();
        m_idleInstances.push_back(instance);
        cudaStreamSynchronize(instance->stream);
        finalizeInstance(instance);
    }
}

And here is a graph form profiler, works as a charm!

Graph of streams

他のヒント

Check out the list of "Implicit Synchronization" rules in the Cuda C Programming Guide PDF that comes with the toolkit. (Section 3.2.5.5.4 in my copy, but you might have a different version.)

If your GPU is "compute capability 3.0 or lower", there are some special rules that apply. My guess would be that cudaStreamDestroy() is hitting one of those limitations.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top