Any thing I can do to speed it up?
I believe you can replace your copy loop with:
cudaMemcpy2DAsync(dA, num_row*sizeof(double), &A, LDA*sizeof(double), num_row*sizeof(double), ldu, cudaMemcpyHostToDevice, streams[0]);
which should speed things up (at least from a call overhead standpoint) significantly.
You may have to play with your parameters a bit, as your names are somewhat confusing to me (maybe you are using column-major storage). The cudaMemcpy2DAsync
function is documented here.