Code running on two GPUs does not reach concurrent execution and has irrelevant speedup

Question 1

In your comment you say:

RunDll has two kernels and they are being launched one by one. Kernels do have cudaThreadSynchronize()

Note that cudaThreadSynchronize() is equivalent to cudaDeviceSynchronize() (and the former is actually deprecated) which means that you will run on one GPU, synchronise, then run on the other GPU. Also note that cudaMemcpy() is a blocking routine, you would need the cudaMemcpyAsync() version to avoid all blocking (as pointed out by @JackOLantern in comments).

In general, you will need to post more details of what is inside RunDLL() since without that your questions does not have enough information to give a definitive answer. Ideally follow these guidelines.

Question 2

In my answer to your previous post (Concurrently running two for loops with same number of loop cycles involving GPU and CPU tasks on two GPU), I pointed out that you would not have achieved a speedup of 2 when using 2 GPUs.

To explain why, let us consider the following code snippet

Kernel1<<<...,...>>>(...); // assume Kernel1 takes t1 seconds

// assume CPUFunction + cudaMemcpys take tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyDeviceToHost); // copy the results of Kernel1 to CPU
CPUFunction(...); // assume it takes tCPU seconds
cudaMemcpy(...,...,...,cudaMemcpyHostToDevice); // copy data from the CPU to Kernel2

Kernel2<<<...,...>>>(...); // assume it takes t2 seconds

It doesn't matter if I'm using cudaDeviceSynchronize() or cudaMemcpy to obtain synchronization.

The cost of executing the above code snippet in the for loop on one GPU only is

t1 + tCPU + t2 + t1 + tCPU + t2 = 2t1 + 2tCPU + 2t2

In the case of 2 GPUs, if you were able to achieve perfect concurrency of the execution of Kernel1 and Kernel2 on the two different GPUs, then the cost of executing the above code snippet would be

t1 (concurrent execution of Kernel1 on the two GPUs) + 2*tCPU (you need two calls to the CPU function, each for a different instance of the output of Kernel1) + t2 (concurrent execution of Kernel2 on the two GPUs)

Accordingly, the speedup achieved by using two GPUs instead of one would be

(2*(t1 + tCPU + t2))/(t1 + 2tCPU + t2)

When tCPU is equal to zero, the speedup becomes 2.

This is an expression of Amdahl's law.