PCA on matrix with large M and N

https://datascience.stackexchange.com/questions/15397

16-10-2019
|

Pergunta

Based on this answer, we know that we can perform build covariance matrix incrementally when there are too many observations, whereas we can perform randomised SVD when there are too many variables.

The answer provide are clear and helpful. However, what if we have a large amount of observations AND variables? e.g. 500,000 samples with 600,000 observations. In this case, the covariance matrix will be huge (e.g. 2,000 GB, assuming 8byte float, and if my calculation is correct) and will be impossible for us to fit it into memory.

In such scenario, is there anything that we can do to calculate the PCA, assuming we only want the top PCs (e.g. 15 PCs)?

Solução

There are a couple of things you can do.

Sample a representative, but small set of your data, which will allow you to compute PCA in memory. But seeing as you have 600,00 observations this will most likely not result in any meaningful results.
Use incremental PCA, here is a link: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA

But the main problem you have is that a number of samples are less than the amount of observations you have. I would recommend a different approach to dimensionality reduction. Autoencoders would be my recommendation to you. Autoencoders can be trained in an iterative fashion, circumventing your memory issue, and can learn more complicated projections than PCA (which is a linear transform). In case you want a linear projection you can have an autoencoder with one hidden layer, and the solution found by the neural network will be equal to the solution found by PCA.

Here are a couple of links you will find helpful:

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange