What is the significance of covariance matrix constructed through term document matrix in PCA?

https://stackoverflow.com/questions/13307555

28-11-2021
|

Question

I'm working on neural networks and for reducing the dimensions of the term-document matrix constructed through documents and the various terms in it bearing the values of tf-idf , I need to apply PCA. Something Like this

           Term 1       Term 2       Term 3       Term 4. ..........
Document 1 

Document 2            tfidf values of terms per document

Document 3 
.
.
.
.
.

PCA works by getting the mean of the data and then subtracting the mean and then using the following formula for the covariance matrix

Let the matrix M be the term-document matrix of dimension NxN

The Covariance matrix becomes

( M x transpose(M))/N-1

We then calculate the eigen values and the eigen vectors to feed as feature vectors in neural networks. What I'm not able to comprehend is the importance of covariance matrix and what dimensions is it finding the covariance of.

Because if we consider simple 2 dimensions X,Y,can be understood. What dimensions are being correlated here?

Thank you

La solution

Latent semantic analysis describes this relation pretty well. It also explains how one uses first the full doc-term matrix, then the reduced one, to map lists (vectors) of terms to near-match docs -- i.e. why reduce.
See also making-sense-of-PCA-eigenvectors-eigenvalues. (The many different answers there suggest that no single one is intuitive for everybody.)

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow