Pregunta

I've got an algorithm that does tree steps of linear algebra over and over again,

loop{
  first I multiply a Vector and a Matrix, 
  Second I calculate the sum of elements in the Vector 
  and Thirdly I scale the vector using the sum, making sure the vectors elements scale to one.
}

I'm using BLAS to do the operations, and this is somewhat quick, but it takes tree runs over the data, one for each step. Now I'm wondering if there would be something to gain by combining the steps into one, there by only running over the data once.

Do anyone have any exprience on how to implement these calls in an optimal way, my matrix is about 100*100 and the vector is has 100 elements.

I'm thinking that the vector can fit into something like 8 128byte mmx registers. making the multiplications pretty fast, any ideas?

¿Fue útil?

Solución

An optimized BLAS library is very tricky code, and you're unlikely to do better unless you are an expert in asm programming and understand the cache performance of your CPU, and are willing to spend a lot of time on testing various approaches. If you want to see how it's done, you can download and look at the source code for GOTO BLAS (implemented in asm, yes).

I'm not sure how to do any substantial optimization of your code. I suspect that already at N=100, the O(N^2) of the matrix-vector product will dominate the runtime and the second and third steps in your algorithm are rather insignificant. So trying to combine all 3 steps doesn't look that useful.

I suppose one minor thing you can do, unless you're already doing it, is that in the 3rd step multiply by the reciprocal of the sum instead of dividing by the sum; division is a lot more expensive than multiplication. E.g.


double my_sum = sum(my_vector);
double tmp = 1 / my_sum;
for (i=...) {
   my_vector[i] *= tmp;
}

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top