Multiple matrix-vector calls with CUBLAS

Question 1

Updated

In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.

As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].

On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.

Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.

1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
   B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] )  by 1 dgmm()

2. C[(2048*128)x8] = transpose( B[8x(2048*128)] )                  by 1 geam()

3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
   O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)]                  by 1 gemv()

4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
   output[8x128] = transpose( OO[128x8] )                          by 1 geam()

This col-major output[8x128] is what you want.

Since you need adding rather then replacing, you may need one more call to add the orginal values to output

Question 2

I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).

For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.

For the NVIDIA Kepler K20c, I have obtained

sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);

dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);

Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.

Disclaimers

I'm not accounting for transposition, as you do;
The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.

I hope that those partial results could provide you with some useful information.