문제

I have this SVD decomposition of the document

SVD Decomposition

I've read this page, but I don't understand how can I compute the best feature for document separation.

I know that:

S x Vt gives me relation between documents and features

U x S gives me relation between terms and features

But what is the key for the best feature selection?

도움이 되었습니까?

해결책

SVD is concerned only with inputs, and not with their labels. In other words, it can be seen as an unsupervised technique. As such, it cannot tell you what features are good for separation, without making any further assumptions.

What it does tell you, is what 'basis vectors' are more important then others, in terms of reconstructing the original data using only a subset of the basis vectors.

Nevertheless, you can think about LSA in the following manner (this is only interpretation, the math is what important): A document is generated by a mixture of topics. Each topic is represented by a vector of length n, which tells you how likely is each word in this topic. For example, if the topic is sports, then words like football or game are more likely than bestseller or movie. These topic-vectors are the columns of U. In order to generate a document (a column of A), you take a linear combination of topics. The coefficients of the linear combination are the columns of Vt - each column tells you what proportion of topics to take in order to generate a document. In addition, each topic has an overall 'gain' factor, which tells you how much this topic is important in your set of documents (maybe you have just one document about sports out of 1000 total documents). These are the singular values == the diagonal of S. If you throw away the smaller ones, you can represent your original matrix A with less topics, and small amount of information lost. Of course, 'small' is a matter of application.

One drawback of LSA is that it is not entirely clear how to interpret the numbers - they are not probabilities, for example. It makes sense to have "0.5" units of sports in a document, but what does it mean to have "-1" units?

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top