Question

I currently have to perform 128 independent sequential matrix-vector CUBLAS operations. All the matrices and vectors are different. Each independent matrix is stored right after the next in memory and the vectors are likewise stored contiguously in memory (all in row-major form).

A bit more context: The matrices are (2048 X 8) and the vector is length 2048. The outputs are all independent. Because I have super matrices, I have the following:

matrix[(2048*128)x8]
vector[(2048*128)x1]
output[(8*128)x1]

With cublasSgemv I'm doing a transpose on the each mini matrix first and then adding (rather than replacing) the result into memory with:

cublasSgemv(*handle, CUBLAS_OP_T, Bdim, Adim, scale1, d_matrix + offset1, Bdim, d_vector + offset2, 1, scale2, out + offset3, 1);

I am making 128 such calls which I would like to do in one.

The profiler shows significant performance degradation from making these multiple calls. What is the best way to do multiple matrix-vector operations? Is there a way to batch them together into one fast call?

Are streams the best way to go or is there some way to make a call with relevant offsets (to index into my array of matrices and vectors)? The only other efficient option seemed to be to use a CUSPASE call and stick all the matrices on the diagonal.

NOTE: I'm not interested in getting the transposes or row/column major ordering in the gemv call correct for this particular question.

Was it helpful?

Solution

Updated

In fact you have to pay special attention to the r/c major ordering if your want to speed up your code in this case.

As shown in your revised question, you use row-major matrices. then you have a super-matrix A[(2048*128)x8] and a super vector V[(2048*128)x1]. And here I assume that you want a col-major matrix output[8x128] (can be seen as a super-vector [(8*128)x1]), where each col is the result of transpose( miniA[2048x8] ) * miniV[2048x1].

On the other hand, CUBLAS assumes that matrices are stored in column-major. So it may need some extra matrix transpose routines to change the ordering.

Since you need 128 independent [8x1] results, it should be able to calculate the result in 4 cuda API calls, which should be more efficient than your original 128 calls.

1. Row-major A[(2048*128)x8] can be seen as colum-major AA[8x(2048*128)]
   B[8x(2048*128)] = AA[8x(2048*128)] * diag( V[[(2048*128)x1]] )  by 1 dgmm()

2. C[(2048*128)x8] = transpose( B[8x(2048*128)] )                  by 1 geam()

3. Col-major C[(2048*128)x8] can be seen as col-major CC[2048x(8*128)]
   O[1x(8*128)] = ones[1x2048] * CC[2048x(8*128)]                  by 1 gemv()

4. Row vector O[1x(8*128)] can be seen as col-major matrix OO[128x8]
   output[8x128] = transpose( OO[128x8] )                          by 1 geam()

This col-major output[8x128] is what you want.

Since you need adding rather then replacing, you may need one more call to add the orginal values to output

OTHER TIPS

I have done a very quick launch of the batchCUBLAS SDK example. I have considered 128 independent runs for matrices of size 2048x8 and 8x1. Here are the results on an NVIDIA GeForce GT 540M (compute capability 2.1) and on a Kepler K20c (compute capability 3.5).

For the NVIDIA GeForce GT 540M case, there is no relevant improvement for the "streamed" and "batched" versions against the "non-streamed" cuBLAS execution.

For the NVIDIA Kepler K20c, I have obtained

sgemm 1.87 GFlops (non-streamed); 3.08 GFlops (streamed); 6.58 GFlops (batched);

dgemm 1.00 GFlops (non-streamed); 1.43 GFlops (streamed); 6.67 GFlops (batched);

Streamed and batched cases seem to relevantly improve the non-streamed case for single precision.

Disclaimers

  1. I'm not accounting for transposition, as you do;
  2. The SDK example considers matrix-matrix multiplications, whereas you are needing matrix-vector multiplications; streaming is possible for gemv, but not batching.

I hope that those partial results could provide you with some useful information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top