Optimizing repetitive estimation (currently a loop) in MATLAB

Question 1

Results

I sped up your original version, since your edit 3 was actually not working (and also does something different).

So, on my PC:

Your (original) version: 8.428473 seconds.
My obfuscated one-liner given below: 0.964589 seconds.

First, for no other reason than to impress, I'll give it as I wrote it:

%%// Some example data
xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example
estimate = zeros(xdim,ydim); %// initialization with explicit size

picture = randn(xdim,ydim,n_timepoints);


%%// Your original solution
%// (slightly altered to make my version's results agree with yours)

tic

Y = randn(n_timepoints,xdim*ydim);
ii = 1;
for x = 1:xdim
    for y = 1:ydim

        X = squeeze(picture(x,y,:)); %// or similar creation of X matrix

        B = (X'*X)^(-1)*X' * Y(:,ii);
        ii = ii+1;

        %// sometimes you keep everything and do
        %// estimate(x,y,:) = B(:);
        %// sometimes just the first element is important and you do
        estimate(x,y) = B(1);

    end
end

toc


%%// My version 

tic

%// UNLEASH THE FURY!!
estimate2 = reshape(sparse(1:xdim*ydim*n_timepoints, ...
    builtin('_paren', ones(n_timepoints,1)*(1:xdim*ydim),:), ...
    builtin('_paren', permute(picture, [3 2 1]),:))\Y(:), ydim,xdim).';  %'

toc

%%// Check for equality

max(abs(estimate(:)-estimate2(:)))  % (always less than ~1e-14)

Breakdown

First, here's the version that you should actually use:

%// Construct sparse block-diagonal matrix
%// (Type "help sparse" for more information)
N  = xdim*ydim;      %// number of columns
M  = N*n_timepoints; %// number of rows
ii = 1:N;
jj = ones(n_timepoints,1)*(1:N);
s  = permute(picture, [3 2 1]);
X  = sparse(ii,jj(:), s(:));

%// Compute ALL the estimates at once
estimates = X\Y(:);

%// You loop through the *second* dimension first, so to make everything
%// agree, we have to extract elements in the "wrong" order, and transpose:
estimate2 = reshape(estimates, ydim,xdim).';  %'

Here's an example of what picture and the corresponding matrix X looks like for xdim = ydim = n_timepoints = 2:

>> clc, picture, full(X)

picture(:,:,1) =
   -0.5643   -2.0504
   -0.1656    0.4497
picture(:,:,2) =
    0.6397    0.7782
    0.5830   -0.3138

ans =
   -0.5643         0         0         0
    0.6397         0         0         0
         0   -2.0504         0         0
         0    0.7782         0         0
         0         0   -0.1656         0
         0         0    0.5830         0
         0         0         0    0.4497
         0         0         0   -0.3138

You can see why sparse is necessary -- it's mostly zeros, but will grow large quickly. The full matrix would quickly consume all your RAM, while the sparse one will not consume much more than the original picture matrix does.

With this matrix X, the new problem

X·b = Y

now contains all the problems

X1 · b1 = Y1
X2 · b2 = Y2
...

where

b = [b1; b2; b3; ...]
Y = [Y1; Y2; Y3; ...]

so, the single command

X\Y

will solve all your systems at once.

This offloads all the hard work to a set of highly specialized, compiled to machine-specific code, optimized-in-every-way algorithms, rather than the interpreted, generic, always-two-steps-away from the hardware loops in MATLAB.

It should be straightforward to convert this to a version where X is a matrix; you'll end up with something like what blkdiag does, which can also be used by mldivide in exactly the same way as above.

Question 2

You can achieve a ~factor of 2 speed up by precomputing the transposition of X. i.e.

for x=1:size(picture,2) % second dimension b/c already transposed

    X = picture(:,x);
    XX = X';
    Y = randn(n_timepoints,1);
    %B = (X'*X)^-1*X'*Y; ;
    B = (XX*X)^-1*XX*Y; 
    est(x) = B(1);

end

Before: Elapsed time is 2.520944 seconds.
After: Elapsed time is 1.134081 seconds.

EDIT: Your code, as it stands in your latest edit, can be replaced by the following

tic
xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

% Actual work
picture = randn(xdim,ydim,n_timepoints);
picture = reshape(picture, [xdim*ydim,n_timepoints])'; % note transpose
YR = randn(n_timepoints,size(picture,2));

% (XX*X).^-1 = sum(picture.*picture).^-1;
% XX*Y = sum(picture.*YR);
est = sum(picture.*picture).^-1 .* sum(picture.*YR);

est = reshape(est,[xdim,ydim]);
toc

Elapsed time is 0.127014 seconds.

This is an order of magnitude speed up on the latest edit, and the results are all but identical to the previous method.

EDIT2:

Okay, so if X is a matrix, not a vector, things are a little more complicated. We basically want to precompute as much as possible outside of the for-loop to keep our costs down. We can also get a significant speed-up by computing XT*X manually - since the result will always be a symmetric matrix, we can cut a few corners to speed things up. First, the symmetric multiplication function:

function XTX = sym_mult(X) % X is a 3-d matrix

n = size(X,2);
XTX = zeros(n,n,size(X,3));
for i=1:n
    for j=i:n
        XTX(i,j,:) = sum(X(:,i,:).*X(:,j,:));
        if i~=j
            XTX(j,i,:) = XTX(i,j,:);
        end
    end
end

Now the actual computation script

xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

Y = randn(10,xdim*ydim);
picture = randn(xdim,ydim,n_timepoints); % 500x500x10

% Actual work  
tic  % start timing

picture = reshape(picture, [xdim*ydim,n_timepoints])';

% Here we precompute the (XT*Y) calculation to speed things up later
picture_y = [sum(Y);sum(Y.*picture)]; 

% initialize
est = zeros(size(picture,2),1); 

picture = permute(picture,[1,3,2]);
XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture);
XTX = sym_mult(XTX); % precompute (XT*X) for speed

X = zeros(2,2); % preallocate for speed
XY = zeros(2,1);

for x=1:size(picture,2) % second dimension b/c already transposed

    %For some reason this is a lot faster than X = XTX(:,:,x);
    X(1,1) = XTX(1,1,x);
    X(2,1) = XTX(2,1,x);
    X(1,2) = XTX(1,2,x);
    X(2,2) = XTX(2,2,x);
    XY(1) = picture_y(1,x);
    XY(2) = picture_y(2,x);

    % Here we utilise the fact that A\B is faster than inv(A)*B
    % We also use the fact that (A*B)*C = A*(B*C) to speed things up
    B = X\XY;
    est(x) = B(1); 
end
est = reshape(est,[xdim,ydim]); 
toc % end timing

Before: Elapsed time is 4.56 seconds.
After: Elapsed time is 2.24 seconds.

This is a speed up of about a factor of 2. This code should be extensible to X being any dimensions you want. For instance, in the case where X = [1 x x^2], you would change picture_y to the following

picture_y = [sum(Y);sum(Y.*picture);sum(Y.*picture.^2)];

and change XTX to

XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture,picture.^2);

You would also change a lot of 2s to 3s in the code, and add XY(3) = picture_y(3,x) to the loop. It should be fairly straight-forward, I believe.

Question 3

I had a wee play around with an idea, and I decided to stick it as a separate answer, as its a completely different approach to my other idea, and I don't actually condone what I'm about to do. I think this is the fastest approach so far:

Orignal (unoptimised): 13.507176 seconds.
Fast Cholesky-decomposition method: 0.424464 seconds

First, we've got a function to quickly do the X'*X multiplication. We can speed things up here because the result will always be symmetric.

function XX = sym_mult(X)

n = size(X,2);
XX = zeros(n,n,size(X,3));
for i=1:n
    for j=i:n
        XX(i,j,:) = sum(X(:,i,:).*X(:,j,:));
        if i~=j
            XX(j,i,:) = XX(i,j,:);
        end
    end
end

The we have a function to do LDL Cholesky decomposition of a 3D matrix (we can do this because the (X'*X) matrix will always be symmetric) and then do forward and backwards substitution to solve the LDL inversion equation

function Y = fast_chol(X,XY)

n=size(X,2);
L = zeros(n,n,size(X,3));
D = zeros(n,n,size(X,3));
B = zeros(n,1,size(X,3));
Y = zeros(n,1,size(X,3));
% These loops compute the LDL decomposition of the 3D matrix
for i=1:n
    D(i,i,:) = X(i,i,:);
    L(i,i,:) = 1;
    for j=1:i-1
        L(i,j,:) = X(i,j,:);
        for k=1:(j-1)
            L(i,j,:) = L(i,j,:) - L(i,k,:).*L(j,k,:).*D(k,k,:);
        end
        D(i,j,:) = L(i,j,:);
        L(i,j,:) = L(i,j,:)./D(j,j,:);
        if i~=j
            D(i,i,:) = D(i,i,:) - L(i,j,:).^2.*D(j,j,:);
        end
    end
end

for i=1:n
    B(i,1,:) = XY(i,:);
    for j=1:(i-1)
        B(i,1,:) = B(i,1,:)-D(i,j,:).*B(j,1,:);
    end
    B(i,1,:) = B(i,1,:)./D(i,i,:);
end

for i=n:-1:1
    Y(i,1,:) = B(i,1,:);
    for j=n:-1:(i+1)
        Y(i,1,:) = Y(i,1,:)-L(j,i,:).*Y(j,1,:);
    end
end

Finally, we have the main script which calls all of this

xdim = 500; 
ydim = 500; 
n_timepoints = 10; % for example

Y = randn(10,xdim*ydim);
picture = randn(xdim,ydim,n_timepoints); % 500x500x10

tic  % start timing

picture = reshape(pr, [xdim*ydim,n_timepoints])';
% Here we precompute the (XT*Y) calculation
picture_y = [sum(Y);sum(Y.*picture)];

% initialize
est2 = zeros(size(picture,2),1); 

picture = permute(picture,[1,3,2]);
% Now we calculate the X'*X matrix
XTX = cat(2,ones(n_timepoints,1,size(picture,3)),picture);
XTX = sym_mult(XTX);

% Call our fast Cholesky decomposition routine
B = fast_chol(XTX,picture_y);
est2 = B(1,:);

est2 = reshape(est2,[xdim,ydim]); 
toc

Again, this should work equally well for a Nx3 X matrix, or however big you want.

Question 4

I use octave, thus I can't say anything about the resulting performance in Matlab, but would expect this code to be slightly faster:

pictureT=picture'
est=arrayfun(@(x)( (pictureT(x,:)*picture(:,x))^-1*pictureT(x,:)*randn(n_ti
mepoints,1)),1:size(picture,2));