Parallel multiplication of many small matrices by fixed vector

Question 1

One possible approach:

flatten the arrays (matrices) into a single data vector. This is an advantageous step for enabling general thrust processing anyway.
use a strided range mechanism to take your scaling vector and extend it to the overall length of your flattened data vector
use thrust::transform with thrust::multiplies to multiply the two vectors together.

If you need to access the matrices later out of your flattened data vector (or result vector), you can do so with pointer arithmetic, or a combination of fancy iterators.

If you need to re-use the extended scaling vector, you may want to use the method outlined in step 2 exactly (i.e. create an actual vector using that method, length = N matrices, repeated). If you are only doing this once, you can achieve the same effect with a counting iterator, followed by a transform iterator (modulo the length of your matrix in elements), followed by a permutation iterator, to index into your original scaling vector (length = 1 matrix).

The following example implements the above, without using the strided range iterator method:

#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>

#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024

struct my_modulo_functor : public thrust::unary_function<int, int>
{
  __host__ __device__
  int operator() (int idx) {
    return idx%(H_MAT*W_MAT);}
};

int main(){

  thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
  thrust::host_vector<int> scale(H_MAT*W_MAT);
  // synthetic; instead flatten/copy matrices into data vector
  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
  for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;

  thrust::device_vector<int> d_data = data;
  thrust::device_vector<int> d_scale = scale;
  thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);

  thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());

  thrust::host_vector<int> result = d_result;

  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
    if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
  std::cout << "Success!" << std::endl;
  return 0;
}

EDIT: Responding to a question below:

The benefit of fancy iterators (i.e. transform(numbers, iterator)) is that they often allow for eliminaion of extra data copies/data movement, as compared to assembling other number (which requires extra steps and data movement) and then passing it to transform(numbers, other numbers). If you're only going to use other numbers once, then the fancy iterators will generally be better. If you're going to use other numbers again, then you may want to assemble it explicitly. This preso is instructive, in particular "Fusion".

For a one-time use of other numbers the overhead of assembling it on the fly using fancy iterators and the functor is generally lower than explicitly creating a new vector, and then passing that new vector to the transform routine.

Question 2

When looking for a software library which is concisely made for multiplying small matrices, then one may have a look at https://github.com/hfp/libxsmm. Below, the code requests a specialized matrix kernel according to the typical GEMM parameters (please note that some limitations apply).

double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;

xmm = libxsmm_dmmdispatch(m, n, k,
  &m/*lda*/, &k/*ldb*/, &m/*ldc*/,
  &alpha, &beta, &flags, &prefetch);

Given the above code, one can proceed and run "xmm" for an entire series of (small) matrices without a particular data structure (below code also uses "prefetch locations").

if (0 < n) { /* check that n is at least 1 */
  # pragma parallel omp private(i)
  for (i = 0; i < (n - 1); ++i) {
    const double *const ai = a + i * asize;
    const double *const bi = b + i * bsize;
    double *const ci = c + i * csize;
    xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
  }
  xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
  /* pseudo prefetch for last element of batch (avoids page fault) */
      a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}

In addition to the manual loop control as shown above, libxsmm_gemm_batch (or libxsmm_gemm_batch_omp) can be used (see ReadTheDocs). The latter is useful if data structures exist that describe the series of operands (A, B, and C matrices).

There are two reasons why this library gives superior performance: (1) on-the-fly code specialization using an in-memory code generation technique, and (2) loading the next matrix operands while calculating the current product.

( Given one is looking for something that blends well with C/C++, this library supports it. However, it does not aim for CUDA/Thrust. )