How to demean the data for PCA

https://stackoverflow.com/questions/21506994

05-10-2022
|

Question

I have a 75x60 array in Matlab. I'm trying to do PCA. I'm trying to check my work by making sure the largest eigenvalue returned by eig(matrix) returns the same thing as the d(1)*d(1) in [u d v] = svd(matrix). They are wildly off. The only thing I can see that could be going wrong is the demeaning.

Here is how I'm handling the demeaning:

 %v is a 75x60 array
 %rowS is 75
 avgVector= mean(v,1);
 muMatrix = repmat(avgVector,rowS,1);
 v = v-muMatrix;

If I were to call SVD(v) it would return extremely different values than eig(cov(v)), whether v has undergone the above demeaning or not.

Solution

If your matrices follow the standard Matlab convention that variables are columns and samples are rows, then your approach, basically

v = v - repmat(mean(v, 1), size(v, 1), 1);

is correct. It would be more memory efficient though not to blow up the means to a full-sized matrix:

v = bsxfun(@minus, v, mean(v, 1));

There's also the possibility to use

v = detrend(v, 'constant')

which internally uses the previous code.

The problem lies somewhere else: The singular values of v are the square roots of the eigenvalues of v'*v. If v is mean-free, then v'*v (in this case called "scatter matrix") is identical to the (unbiased-estimator) covariance matrix cov(v) – up to a factor size(v, 1) - 1. If you use the code

[V, D] = eig(cov(v));
[U, S, V] = svd(bsxfun(@minus, v, mean(v, 1)));

you will find that

sort(diag(D), 'descend')

and

diag(S) .^ 2 / (size(x, 1) - 1)

are identical up to rounding error. The additional sort is necessary because eig does not guarantee ordered eigenvalues.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow