Question

I'm working on neural networks and for reducing the dimensions of the term-document matrix constructed through documents and the various terms in it bearing the values of tf-idf , I need to apply PCA. Something Like this

           Term 1       Term 2       Term 3       Term 4. ..........
Document 1 

Document 2            tfidf values of terms per document

Document 3 
.
.
.
.
.

PCA works by getting the mean of the data and then subtracting the mean and then using the following formula for the covariance matrix

Let the matrix M be the term-document matrix of dimension NxN

The Covariance matrix becomes

( M x transpose(M))/N-1 

We then calculate the eigen values and the eigen vectors to feed as feature vectors in neural networks. What I'm not able to comprehend is the importance of covariance matrix and what dimensions is it finding the covariance of.

Because if we consider simple 2 dimensions X,Y,can be understood. What dimensions are being correlated here?

Thank you

Était-ce utile?

La solution

Latent semantic analysis describes this relation pretty well. It also explains how one uses first the full doc-term matrix, then the reduced one, to map lists (vectors) of terms to near-match docs -- i.e. why reduce.
See also making-sense-of-PCA-eigenvectors-eigenvalues. (The many different answers there suggest that no single one is intuitive for everybody.)

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top