Question

I am studying some machine-learning and I have come across, in several places, that Latent Semantic Indexing may be used for feature selection. Can someone please provide a brief, simplified explanation of how this is done? Ideally both theoretically and in commented code. How does it differ from Principal Component Analysis?

What language it is written in doesn't really worry me, just that I can understand both code and theory.

Was it helpful?

Solution

LSA is conceptually similar to PCA, but is used in different settings.

The goal of PCA is to transform data into new, possibly less-dimensional space. For example, if you wanted to recognize faces and use 640x480 pixel images (i.e. vectors in 307200-dimensional space), you would probably try to reduce this space to something reasonable to both - make it computationally simpler and make data less noisy. PCA does exactly this: it "rotates" axes of your high-dimensional space and assigns "weight" to each of new axes, so that you can throw away least important of them.

LSA, on other hand, is used to analyze semantic similarity of words. It can't handle images, or bank data, or some other custom dataset. It is designed specifically for text processing, and works specifically with term-document matrices. Such matrices, however, are often considered too large, so they are reduced to form lower-rank matrices in a way very similar to PCA (both of them use SVD). Feature selection, though, is not performed here. Instead, what you get is feature vector transformation. SVD provides you with some transformation matrix (let's call it S), which, being multiplied by input vector x gives new vector x' in a smaller space with more important basis. This new basis is your new features. Though, they are not selected, but rather obtained by transforming old, larger basis.

For more details on LSA, as long as implementation tips, see this article.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top