Multiply one fixed matrix by a huge number of vectors

Question 1

You can put your vectors in a matrix, 200 * 10^7 is perhaps to much space at once depending on our system, so you can split it. And then you use any code that is optimized for matrix matrix multiplication, like BLAS. There are many implementations on CPUs, GPUs (cuBLAS, MAGMA,...), multicores (PLASMA,...), or distributed memory. Since you will have big matrices you vill have a better acceleration than by doing matrix vector multiplications.

Question 2

You're going to multiply 10 million big vectors by a huge matrix that is the same for all of them. It would be fastest if all possible decision-making could be compiled-out ahead of time. In other words, there are lots of index calculations and loop testing that would be identically repeated millions of times. This sounds like a perfect case for pre-compilation:

Write a small program that would take as input your 200x200 matrix data values, and have it print out a piece of program text defining a function capable of inputting the input vector and outputting the result vector. It could look something like this:

void multTheMatrixByTheVector(double a[200], double b[200]){
  b[0] = 0
    + a[0] * <a constant, the value of mat[0][0]>
    + a[1] * <a constant, the value of mat[1][0]>
    ...
    + a[199] * <a constant, the value of mat[199][0]>
    ;
  b[1] = 0
    + a[0] * <a constant, the value of mat[0][1]>
    + a[1] * <a constant, the value of mat[1][1]>
    ...
    + a[199] * <a constant, the value of mat[199][1]>
    ;
  ...
  b[199] = etc. etc.
}

You see, that function will be around 40000 lines long, but a decent compiler should be able to handle it. Of course, if any of the matrix elements are zero, i.e. there's some sparsity, you can omit those lines (or let the compiler optimizer do it). To do this on CUDA or vectorized instructions, you'd have to modify it accordingly, but that should be do-able.

When you include that function in your main program, it should be able to run about as fast as the machine can go. It's not wasting any cycles doing index calculations, loop testing, or multiplying by empty matrix cells.

Then if it takes 10ns per multiply and add, my back-of-the envelope says it should take 400 usec per vector, or 4000 seconds overall - a little over an hour.