How to optimize this product of three matrices in C++ for x86?

https://stackoverflow.com//questions/22073230

23-12-2019
|

Question

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:

A*A'*Y, where: A is an m-by-n matrix, 
               A' is its conjugate transpose,
               Y is an m-by-k matrix

Typical characteristics:
    - k is much smaller than both m or n (k is typically < 10)
    - m in the range [500, 2000]
    - n in the range [100, 1000]

Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.

My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.

However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:

Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.
A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.

Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.

Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).

At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.

Solution

This is not a full answer (yet - and I'm not sure it will become one).

Let's think of the math first a little. Since matrix multiplication is associative we can either do (A*A')Y or A(A'*Y).

Floating point operations for (A*A')*Y

2*m*n*m + 2*m*m*k //the twos come from addition and multiplication

Floating point operations for A*(A'*Y)

2*m*n*k + 2*m*n*k = 4*m*n*k

Since k is much smaller than m and n it's clear why the second case is much faster.

But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to

m*n*m + 2*m*m*k.

We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:

z*z*z + 2*z*z*k = 4*z*z*k  //now simplify
z = 2*k.

So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.

In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).

I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.

You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is

frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)

For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.

OTHER TIPS

Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.

There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.

I guess that perform the following

result = A * (A.adjoint() * Y)

will be the same as do that

temp = A.adjoint() * Y
result = A * temp;

If your matrix Y fits in the cache, you can probably take advantage of doing it like that

result = A * (Y.adjoint() * A).adjoint()

or, if the previous notation is not allowed, like that

temp = Y.adjoint() * A
result = A * temp.adjoint();

Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.

If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow