Matrix Multiply > for loop > bsxfun - Odd Speed Results

https://stackoverflow.com/questions/12553064

03-07-2021
|

Question

I have an n by n matrix A, a set of n coefficients k (n by 1), and matrix called row (1 by n). My goal is to subtract row weighted by the ith coefficient in k from the ith row of A. I have three questions: Why does my for-loop perform better than the built-in matrix multiply, in general, what explains each method's superiority over the next, and is there a better method than the three with which I came up?

% Define row & coefficients used in each method
k = (1:1000).';
row = 1:1000;

% Method 1 (matrix multiply) ~15 seconds
A = magic(1e3);
tic;
for z = 1:1e3
    A = A - k*row;
end
toc;
% Method 2 (for loop) ~11 seconds
B = magic(1e3);
tic;
for z = 1:1e3
    for cr = 1:1000
        B(cr,:) = B(cr,:) - k(cr)*row;
    end
end
toc;
% method 3 (bsxfun) ~ 4 seconds
C = magic(1e3);
tic;
for z = 1:1e3
    C = C - bsxfun(@times, k, row);
end
toc

isequal(A,B)
isequal(A,C)

Note, I am doing these row subtractions in an algorithm. I simplified the code a bit, creating this toy test case, but the crux of the computation is still present. Also, to avoid confusion, the for loop with z is used to make the time larger.

Solution

Short answer: faster version of your code is this:

tic;
for z = 1:1e3
    for cr = 1:1000
        B(:,cr) = B(:,cr) - k*row(cr);
    end
end
toc;

You might want to have a look at my previous answer to this question. In short, your loop operated on rows, while MATLAB is column based. This means that columns are contiguous in the memory. Your original loop iterated over rows, which is inefficient.

Execution times on my computer:

% A - k*row
Elapsed time is 4.370238 seconds.
% B(cr,:) = B(cr,:) - k(cr)*row;
Elapsed time is 9.537338 seconds.
% C = C - bsxfun(@times, k, row);
Elapsed time is 3.039836 seconds.
B(:,cr) = B(:,cr) - k*row(cr);
Elapsed time is 2.028186 seconds.

Explanation. Your first version is not a matrix multiply, but an outer product of two vectors which results in a matrix of size 1000 x 1000. The hole computation is a BLAS2 rank 1 update (A=alphaxy'+A is a GER function). The problem most likely is that MATLAB does not recognize it as such and instead runs the code as it understands it, i.e. explicitly executing all operations including k*row. This is exactly the problem with this solution. Outer product has to allocate additional memory of size equal to the size of your matrix, which itself takes time. Consider this:

memory allocation - since we do not know how MATLAB manages memory allocation, this could mean a lot of overhead.
read vectors k, row
writing of the result matrix (KR)
reading the KR to subtract it from A
reading and writing of A

Two 1000*1000 matrices is 16 MB - I doubt that you have so much cache. That is the reason why this version is not the best, and might actually be slower than the 'memory inefficient' loop, depending on the available memory bandwidth and CPU cache size.

There is no need to allocate the KR matrix and store the values explicitly in the memory - you can compute the needed products in a loop. Hence, you effectively half the memory bandwidth requirements and remove the memory allocation overhead!

read vectors k, row
read and write A

Assuming that one vector fits into the cache (which it does - 1000*8 bytes is not much) you read k and row from the memory only once. Now the algorithm with the loop over columns makes perfect sense (this is likely how BLAS implements this computation)

read k and row and keep them in the cache
stream A with full memory bandwidth to the CPU, subtract k*row product, stream back to memory

Now the final efficiency considerations. Take my loop structure. In every iteration we read and write A, and read the vectors. That is 16MB of data moved per iteration. 1000 iterations gives 16 GB of data moved in total. Two seconds required to compute the result gives 8GB/s effective memory bandwidth. My system has 16GB/s stream bandwidth when using 2 CPU cores, around 11-12 GB/s when using one. Hence, this sequential loop runs at 60-70% efficiency on one core. Not bad, considering this is a MATLAB loop :)

Note also that, at least on my computer, column-based loop version is (a bit more than) twice faster than the A-krow implementation (2s vs 4.37s). This strongly indicates that krow is indeed explicitly executed and constructed by MATLAB, and the total memory traffic is two times larger than in the looped version. Hence twice worse performance.

Edit You can still try to do it like in the first algorithm by calling directly the corresponding BLAS function. Have a look at this FEX contribution. It allows you to call BLAS and LAPACK functions directly from MATLAB. Might be useful..

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow