In my answer to your previous post (Concurrently running two for loops with same number of loop cycles involving GPU and CPU tasks on two GPU), I pointed out that you would not have achieved a speedup of 2
when using 2
GPUs.
To explain why, let us consider the following code snippet
Kernel1<<<...,...>>>(...); // assume Kernel1 takes t1 seconds
// assume CPUFunction + cudaMemcpys take tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyDeviceToHost); // copy the results of Kernel1 to CPU
CPUFunction(...); // assume it takes tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyHostToDevice); // copy data from the CPU to Kernel2
Kernel2<<<...,...>>>(...); // assume it takes t2 seconds
It doesn't matter if I'm using cudaDeviceSynchronize()
or cudaMemcpy
to obtain synchronization.
The cost of executing the above code snippet in the for
loop on one GPU only is
t1 + tCPU + t2 + t1 + tCPU + t2 = 2t1 + 2tCPU + 2t2
In the case of 2
GPUs, if you were able to achieve perfect concurrency of the execution of Kernel1
and Kernel2
on the two different GPUs, then the cost of executing the above code snippet would be
t1 (concurrent execution of Kernel1
on the two GPUs) + 2*tCPU (you need two calls to the CPU function, each for a different instance of the output of Kernel1
) + t2 (concurrent execution of Kernel2
on the two GPUs)
Accordingly, the speedup achieved by using two GPUs instead of one would be
(2*(t1 + tCPU + t2))/(t1 + 2tCPU + t2)
When tCPU is equal to zero, the speedup becomes 2
.
This is an expression of Amdahl's law.