matlab code optimization - multiply all 3x3 matrices in a large cell array with each other

Question 1

As I commented above, the fastest method strongly depends on the properties of your matrices. For example, some algorithm can strongly benefit from the matrix being symmetric, but is rather slow if it is not.

So without further information, I can only make some general statements, and compare some methods on random matrices (which usually does not give a good comparison in the context of matrices inverses).

Depending on your MATLAB version (the JIT in R2011a or so greatly improved it), pre-allocating A and B gives you a large gain in loop performance; growing arrays dynamically inside a loop is usually very inefficient.

Along the same lines is the call to SpinConv: as this is an external function (MEX or m, doesn't matter), JIT cannot compile this loop, so you're limited by the speed of the interpreter. Which is rather low. If at all possible, you can avoid that by simply copy-pasting the relevant parts of SpinConv into the loop body. I know, it's hugely annoying (and I sure hope this is automated in future versions of MATLAB), but for now it is the only way to get JIT understand the loop structure and compile it (really, factors of 100 or more are not uncommon).

So, having said all that, I tested 2 different methods:

compute and store the LU decompositions of C and D, and re-use them in the loop
Solve the systems Cn \ [C{1} C{2} ... C{n-1}] for all n = 2:N, and re-order

In code:

clc
clear all

Nframe = 500;

%// Some random data
C = cellfun(@(~)rand(3), cell(Nframe,1), 'UniformOutput', false); 
D = cellfun(@(~)rand(3), cell(Nframe,1), 'UniformOutput', false);


%// Your original method 
tic

count  = 1;
for i=1:Nframe
    for j=1:Nframe
        if j <= i
            continue;
        else
            %// Main place to optimize (avoid looping?)!!
            %// In DCM representation (ie. each cell is 3x3 matrix)
            A{count,1} = C{j}\C{i}; %=inv(C{i+x})*C{i}
            B{count,1} = D{j}\D{i};

            count=count+1;
        end
    end
end

toc 
A1 = A;



%// First method: compute all LU decompositions and re-use them in the loop
%// ------------------------------------------------------------------------

tic

%// Compute LU decompositions of all C and D
Clu = cell(Nframe, 2);
Dlu = cell(Nframe, 2);
for ii = 1:Nframe
    [Clu{ii,1:2}]  = lu(C{ii});
    [Dlu{ii,1:2}]  = lu(D{ii});
end

%// improvement: pre-allocate A and B
A = cell(Nframe*(Nframe-1)/2, 1);
B = cell(Nframe*(Nframe-1)/2, 1);

%// improvement: don't use i and j as variable names
count  = 1;
for ii = 1:Nframe

    %// improvement: instead of continue if j<=i, just use different range
    for jj = ii+1 : Nframe

        %// mldivide for LU is equal to backwards substitution, which is
        %// trivial and thus fast
        A{count} = Clu{jj,2}\(Clu{jj,1}\C{ii});
        B{count} = Dlu{jj,2}\(Dlu{jj,1}\D{ii});

        count = count+1;

    end
end

toc
A2 = A;



%// Second method: solve all systems simultaneously by concatenation
%// ------------------------------------------------------------------------

tic

% Pre-allocate temporary matrices
Aa = cell(Nframe-1, 1);
Bb = cell(Nframe-1, 1);

for ii = 2:Nframe   
    % Solve Cn \ [C1 C2 C3 ... Cn]
    Aa{ii-1} = C{ii}\[C{1:ii-1}];
    Bb{ii-1} = D{ii}\[D{1:ii-1}];
end
toc

%// Compared to the order in which data is stored in one of the other
%// methods, the order of data in Aa and Bb is different. So, we have to
%// re-order to get the proper order back: 

tic

A = cell(Nframe*(Nframe-1)/2, 1);
B = cell(Nframe*(Nframe-1)/2, 1);
for ii = 1:Nframe-1

     A( (1:Nframe-ii) + (Nframe-1-(ii-2)/2)*(ii-1) ) = ...
         cellfun(@(x) x(:, (1:3) + 3*(ii-1)), Aa(ii:end), 'UniformOutput', false);

     B( (1:Nframe-ii) + (Nframe-1-(ii-2)/2)*(ii-1) ) = ...
         cellfun(@(x) x(:, (1:3) + 3*(ii-1)), Bb(ii:end), 'UniformOutput', false);
end

toc
A3 = A;

% Check validity of outputs
allEqual = all( cellfun(@(x,y,z)isequal(x,y)&&isequal(x,z), A1,A2,A3) )

Results:

Elapsed time is 44.867630 seconds.  %// your original method 
Elapsed time is 1.267333 seconds.   %// with LU decomposition
Elapsed time is 0.183950 seconds.   %// solving en-masse by concatenation
Elapsed time is 1.871149 seconds.   %// re-ordering the output of that

allEqual = 
    1

Note that I'm on R2010a, so that the slowness of your original method is primarily due to A and B not being pre-allocated. Note that performance on newer MATLAB versions will be better in this regard, but still, performance will be better if you pre-allocate.

Intuitively (and as others will probably suggest), you could compute the explicit inverses,

Cinv = cellfun(@inv, C, 'UniformOutput', false);

or even

Cinv = cellfun(@(x) [...
    x(5)*x(9)-x(8)*x(6)  x(7)*x(6)-x(4)*x(9)  x(4)*x(8)-x(7)*x(5) 
    x(8)*x(3)-x(2)*x(9)  x(1)*x(9)-x(7)*x(3)  x(7)*x(2)-x(1)*x(8)
    x(2)*x(6)-x(5)*x(3)  x(4)*x(3)-x(1)*x(6)  x(1)*x(5)-x(4)*x(2)] / ...
        (x(1)*x(5)*x(9) + x(4)*x(8)*x(3) + x(7)*x(2)*x(6) - ...
         x(7)*x(5)*x(3) - x(4)*x(2)*x(9) - x(1)*x(8)*x(6)),...
    C, 'UniformOutput', false);

(which will be faster and more accurate), and then simply multiply inside the loop. As you will see, this is significantly slower than both the en-masse solve Cn\[C1 C2 ... Cn-1] and LU (although that depends on the nature of the matrices). Also, it fails produce allEqual == true; sometimes the differences are small, but often (especially for near-singular matrices and other specials), the differences are huge.

As mentioned in a lot of other questions here on SO, and as any refined Google search or advanced Linear Algebra book will tell you, using explicit inverses in numerical applications will usually be slow, always be inaccurate, and sometimes even dangerous. The inverse is a very nice theoretical construct, but pretty much useless in any practical application of that theory. Therefore, it is better you use one of the other methods mentioned above.

In conclusion:

If you can live with the data out of order (which probably necessitates more complex indexing later on), solving the systems en-masse by concatenation is fastest by far. Granted, the way I re-order the data can be improved upon, but I suspect LU will still always be faster if you need to re-order.
If this is not the case, but your matrices are suited for LU decomposition, use that. To find out if this is the case, just use it on your real data and profile. You can also try the additional outputs of LU (most notably, the permutation matrix P, or for sparse matrices, the column reordering matrix Q).
Of course, if QR decomposition is more appropriate, use qr. Same for chol, or pcg, etc. Experiment with different methods a bit.

EDIT:

As you mention, all the matrices are SO(3) rotation matrices. Wow, is that ever important information!! In that case, the inverse is just the transpose, which is one or two orders of magnitude faster than any variant of the inverse. Also, you indicate that you want to convert those rotation matrices to axis-angle representation. The kernel should then be changed to something along the lines of

A = C{ii}.'*C{jj};
B = D{ii}.'*D{jj};

[V,lam] = eig(A);
axis  = V(:,imag(diag(lam))==0);
theta = acos(min(max(-1, (trace(A)-1)/2), 1));
At{count, :} = {axis theta*180/pi};

[V,lam] = eig(B);
axis  = V(:,imag(diag(lam))==0);
theta = acos(min(max(-1, (trace(B)-1)/2), 1));
Bt{count, :} = {axis theta*180/pi};

This uses only built-in functions, so this should be pretty efficient. At least it's better than copy-pasting SpinConv, since SpinConv uses a lot of non-builtin functions (null, isequalf, acosd, sind). Note that the method above uses the eigenvalue method; you can make it slightly more efficient if you use the determinant method used in the SpinConv function, provided you update it to have no calls to non-built-ins.

BEWARE: it appears that that version of SpinConv has an incorrect sign of the axis; the sign of the axis computed in the elseif is opposite that of the axis computed in the if.

Question 2

I would try computing and storing inv(C{j}) since C{j} appears in a matrix divide multiple times. Ditto D{j}. Or are your 3x3 matrices singular?

matlab code optimization - multiply all 3x3 matrices in a large cell array with each other

--------------------------------------------------------------------------------------------