What is the best matrix multiplication algorithm? [closed]

https://stackoverflow.com/questions/4455645

10-10-2019
|

سؤال

What is the best matrix multiplication algorithm? What means 'the best'for me? It means the fastest and ready for todays machines.

Please give links to pseudocode if you can.

المحلول

BLAS is the best ready-to-use efficient matrix multiplication library. There are many different implementation. Here is a benchmark I made for some implementations on a MacBook Pro with dual-core Intel Core 2 Duo 2.66 GHz :

gotoBLAS2 (open-source) : https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2
ATLAS (open-source) : http://math-atlas.sourceforge.net/
Accelerate.framework (Apple) : http://developer.apple.com/performance/accelerateframework.html
a non-optimized, but portable, implementation that I called 'vanilla' (from the GSL)

alt text

There are also other commercial implementations that I didn't test here :

MKL (Intel) : http://software.intel.com/en-us/articles/intel-mkl/
ACML (AMD) : http://developer.amd.com/cpu/Libraries/acml/Pages/default.aspx

نصائح أخرى

The best matrix multiplication algorithm is the one that someone with detailed architectural knowledge has already hand-tuned for your target platform.

There are lots of good libraries that supply tuned matrix-multiply implementations. Use one of them.

There are probably better ones but these are the ones I've head of (better than the standard cubic complexity algorithm).

Strassen's - O(N^2.8)

Coppersmith Winograd - O(N^2.376)

Why pseudocode? Why implement it yourself? If speed is your concern, there are highly optimized algorithms available that include optimizations for specific instruction sets (e.g. SIMD), implementing those all by yourself offers no real benefit (apart from maybe learning),

Take a look at different BLAS implementations, like:

http://www.netlib.org/blas/

http://math-atlas.sourceforge.net/

Here is algorithms course of MIT and the matrix multiplication lecture

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-19-shortest-paths-iii-all-pairs-shortest-paths-matrix-multiplication-floyd-warshall-johnson/

matrix multiplication - O(n^3)

Strassen’s algorithm - O(n^2.8) http://en.wikipedia.org/wiki/Strassen_algorithm

Coppersmith–Winograd - O(n^2.376) http://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm

Depends on the size of the matrix, and whether it's sparse or not.

For small-to-medium-sized dense matrices, I believe that some variation on the "naive" O(N^3) algorithm is a win, if you pay attention to cache-coherence and use the platform's vector instructions.

Data arrangement is important -- for cases where your standard matrix layout is cache-unfriendly (e.g., column-major * row-major), you should try binary decomposition of your matrix multiplication -- even if you don't use Strassen's or other "fast" algorithms, this order of operations can yield a "cache-oblivious" algorithm that automatically makes good use of every level of cache. If you have the luxury to rearrange your matrices, you might try combining this with a bit-interleaved (or "Z-order") ordering of data elements.

Finally, remember: premature optimization is the root of all evil. And when it's not premature any more, always profile & benchmark before, during, and after optimizing....

There is an algorithm call the Cannon's algorithm a distributed matrix multiplication algorithm. More here

There is no "best algorithm" for all matrices on all modern CPUs.

You will need to do some research into the many methods available, and then find a best-fit solution to the particular problems you are calculating on the particular hardware you are dealing with.

For example, the "fastest" way on your hardware platform may be to use a "slow" algorithm but ask your GPU to apply it to 256 matrices in parallel. Or using a "fast" general-purpose (mxn) algorithm may produce much slower results than using an optimised 3x3 matrix multiply. If you really want it to be fast then you may want to consider getting down to the bare metal to make sure you make best use of specific CPU features like SIMD instructions, branch prediction and cache coherence, at the expense of portability.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow