Python: Why are eigenvectors not the same as first PCA weights?

https://stackoverflow.com/questions/19936255

30-07-2022
|

Pergunta

Let's generate an array:

import numpy as np

data = np.arange(30).reshape(10,3)
data=data*data
array([[  0,   1,   4],
       [  9,  16,  25],
       [ 36,  49,  64],
       [ 81, 100, 121],
       [144, 169, 196],
       [225, 256, 289],
       [324, 361, 400],
       [441, 484, 529],
       [576, 625, 676],
       [729, 784, 841]])

Then find the eigenvalues of the covariance matrix:

mn = np.mean(data, axis=0)
data -= mn
C = np.cov(data.T)
evals, evecs = la.eig(C)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
print evecs
array([[-0.53926461, -0.73656433,  0.40824829],
       [-0.5765472 , -0.03044111, -0.81649658],
       [-0.61382979,  0.67568211,  0.40824829]])

Now let's run the matplotlib.mlab.PCA function on the data:

import matplotlib.mlab as mlab
mpca=mlab.PCA(data)
print mpca.Wt
[[ 0.57731894  0.57740574  0.57732612]
 [ 0.72184459 -0.03044628 -0.69138514]
 [ 0.38163232 -0.81588947  0.43437443]]

Why are the two matrices different? I thought that in finding the PCA, first one had to find the eigenvectors of the covariance matrix, and that this would be exactly equal to the weights.

Solução

You need to normalize your data, not just center it, and the output of np.linalg.eig has to be transposed to match that of mlab.PCA:

>>> n_data = (data - data.mean(axis=0)) / data.std(axis=0)
>>> evals, evecs = np.linalg.eig(np.cov(n_data.T))
>>> evecs = evecs[:, np.argsort(evals)[::-1]].T
>>> mlab.PCA(data).Wt
array([[ 0.57731905,  0.57740556,  0.5773262 ],
       [ 0.72182079, -0.03039546, -0.69141222],
       [ 0.38167716, -0.8158915 ,  0.43433121]])
>>> evecs
array([[-0.57731905, -0.57740556, -0.5773262 ],
       [-0.72182079,  0.03039546,  0.69141222],
       [ 0.38167716, -0.8158915 ,  0.43433121]])

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow