Doing multiple matrix-matrix multiplications in one operation

https://stackoverflow.com/questions/9221985

28-04-2021
|

Question

I'm implementing an algorithm that, in essence, is a series of matrix-matrix multiplications like this:

Res = M₁.M₂.M₃. ... .M_n

My matrices are really small 100x100 floats, but the sequence is really long, in the order of billions.

I tried using CUBLAS to to the matrix multiplications but this was slow, I did however notice something interesting.

multiplying a 100x100 with a 100x100 matrix was slow, but multiplying a 1.000.000x100 with a 100x100 was relatively fast, this made me think .If I instead of having a scan from left to right had 10.000 scans in parallel. This should be pretty fast, and if I multiplied my matrices when I was done with this, I would get the same result -- just faster.

Res₁ = M₁.M₂.M₃. ... .M_n/1000-1
Res₁ = M_1+n/1000.M_2+n/1000.M_3+n/1000. ... .M_2(n/1000)-1
...
Res₁  = M_1+999*n/1000.M_2+999*n/1000.M_3+999*n/1000. ... .M_{1000*(n/1000)-1}
Res = Res₁*Res₂* ... *Res₉₉₉

Its worth nothing that M_1 ... M_n are in a set of about 100 different matrices, so space consumption isn't really a problem, all I need to to is be to do multiple multiplies in one operation.

Now here is my problem. I've done a matrix-matrix(sgemm) implementation inspired by the one nvidia demonstrates in their documentation but it is an order of about 4 times as slow as cublas. Do anyone know how CUBLAS works? And if the code is available somewhere?

Solution

Have you looked at the latest CUBLAS (version 4.1)? It includes a new batched GEMM mode specifically intended for large batches of small matrix-matrix multiplies. I would suggest doing a pairwise multiplication tree as Jonathan Dursi suggested in his answer, using the CUBLAS batched API to accelerate it, rather than writing your own custom kernel as he suggests.

CUBLAS 4.1 is included with the CUDA Toolkit v4.1.

CUBLAS BATCHED GEMM API IMPROVES PERFORMANCE OF BATCHES OF SMALL MATRICES

OTHER TIPS

The problem is that cublas, etc are designed for using all of the SMs to multiply large matrices. That's not what you want; you want to do lots of little matrix multiplications.

There may be some way to cast this into something CUBLAS could do well for you, but I'm not seeing it. My suggestion would be the following:

Write a kernel that uses one thread block to multiply two of your small matrices, and output the result.

Then launch the kernel log₂N with tonnes of blocks and tackle the multiplication pairwise:

Step 1: multiply M₁ x M₂, M₃ x M₄ ... M_{N - 2} x M_N-1, outputting M'₁,M'₂.. M'_N/2
Step 2: multiply M'₁ x M'₂, M'₃ x M'₄ ... M'_{N/2 - 2} x M_N/2-1, outputting M''₁,M''₂.. M''_N/4...

etc.

There'll be a factor of 50% memory overhead, but I think you'll make better use of your cores this way.

Update

Ok, if you really don't want to do this in stages, you could do it this way but it'll require more coding, and performance will probably be worse compared to what you could get with something like cuBLAS and asynchronous transfer. I'm assuming you're using a Fermi, and you've turned off L1 cache so you have 48K shared mem per SM.

Store the 100 matrices in 2x2 block form, with each block contiuous in memory. So matrix[matrixnum,i,j] starts at matricies[matrixnum*100*100 + i*100*50 + j*50*50]. Note that each block is 50*50*4 bytes ~ 10K, so 4 comfortably fit in shared memory.

Assign each 4 threadblock an (Nmatricies/Nblocks) long chain of the matrices to multiply, with one of the four being responsible for each block of the multiplication.

Let's say you're threadblock 1 of 4 and the first of the matricies you're to multiply is AxB. You're responsible for (1,1) of the result - (AB)_1,1 = A_1,1 B_1,1+ A_1,2*B_2,1. You pre-load in A_1,1 into myblock[0] in shared memory.

load in myblock[1] = B_1,1 from global memory
myblock[3] = myblock[0] * myblock[1] (matrix mult, all in shared memory)
load in myblock[1] = A_1,2 from global
load in myblock[2] = B_2,1 from global
myblock[0] = myblock[3] + (myblock[1] * myblock[2]) (matrix mult and addition, all in shared memory).

Now you can repeat this for the rest of the sequence of matrices in your part of the chain, outputting only when done.

When you're done, you'll end up with (#SMs) matricies in global memory, which still have to be multiplied, but there won't be any additional temporary storage in global memory, and you won't have had to copy data into global memory other than the original matricies and the lists of which ones to tackle.

Again, there's no real reason to do this except that you can't be bothered to ship data to the GPU in stages, and performance will almost certainly be worse; there's fewer global memory writes, but you'll probably pay for that with a hand-rolled GEMM. The good news is that 50 isn't a multiple of 8, so you probably won't have too much in the way of shared memory bank conflicts.

Again, for bonus points, you can precompute all the blocks of all pairwise matrix products first and then half the length of your list.

LIBXSMM - a library targeting Intel Architecture for small, dense or sparse matrix multiplications, and small convolutions is exactly meant to exploit best performance for small matrix multiplications.

In contrast to NVidia CUBLAS (or Intel MKL), LIBXSMM does not rely on a batch interface. Instead, one can arrange for individual calls and also supply "next locations" i.e., where the operands/matrices of the next multiplications are located (in memory). The advantage is that an explicit data structure or index format describing the batch is not needed.

#include <libxsmm.h>

int main()
{
  const libxsmm_gemm_prefetch_type prefetch = LIBXSMM_PREFETCH_AUTO;
  const double alpha = 1.0, beta = 1.0; /* accumulate C-matrix */
  const int m = 23, n = 23, k = 23;     /* some problem size */
  libxsmm_dmmfunction xmm = NULL;       /* function pointer */

  xmm = libxsmm_dmmdispatch(23, 23, 23, /* some problem size */
          NULL/*lda*/, NULL/*ldb*/, NULL/*ldc*/,
          &alpha, &beta, NULL/*flags*/,
          NULL/*&prefetch*/);

  if (xmm) { /* JiT'ted code has been generated */
#   pragma omp parallel for
    for (int i = 0; i < nbatch; ++i) {
      const double *const ai = a + i * asize;
      const double *const bi = b + i * bsize;
      /* e.g., matrix C is accumulated (instead of streamed) */
      double *const ci = c /*+ i * csize*/;
      /* optionally provide "next locations" */
      xmm(ai, bi, ci/*,
          ai + 1 * asize,
          bi + 1 * bsize,
          ci + 0 * csize
      */);
    }
  }
}

LIBXSMM produces highly optimized and specialized code (JiT), which exploits latest instruction set extensions (SSE3, AVX, AVX2, and AVX-512). LIBXSMM is available under a non-permissive license (BSD-3 clause).

NOTE: This is not about CUBLAS (as originally asked for).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow