Frage

I've performed χ² feature selection on my training documents already transformed to TF*IDF feature vectors using sklearn.feature_extraction.text.TfidfVectorizer, which produces normalized vectors by default. However, after selecting the top-K most informative features, the vectors are no longer normalized due to the removing of dimensions (all vectors are now with length < 1).

Is it advisable to re-normalize the feature vectors after feature selection? I'm also not very clear of the main difference B/T normalization and scaling. Do they server similar purposes for learners such as SVC?

Thank you in advance for your kind answer!

War es hilfreich?

Lösung

This is actually a lot of questions in one. The main reason for doing normalization on tf-idf vectors is so that their dot products (used by SVMs in their decision function) are readily interpretable as cosine similarities, the mainstay of document vector comparisons in information retrieval. Normalization makes sure that

"hello world"             -> [1 2]
"hello hello world world" -> [2 4]

become the same vector, so concatenating a document onto itself doesn't change the decision boundary and the similarity between these two documents is exactly one (although with sublinear scaling, sublinear_tf in the vectorizer constructor, this is no longer true).

The main reasons for doing scaling is to avoid numerical instability issues. Normalization takes care of most of those because features will already be in the range [0, 1]. (I think it also relates to regularization, but I don't use SVMs that often.)

As you've noticed, chi² "denormalizes" feature vectors, so to answer the original question: you can try renormalizing them. I did that when adding chi² feature selection to the scikit-learn document classification example, and it helped with some estimators and hurt with others. You can also try doing the chi² on unnormalized tf-idf vectors (in which case I recommend you try setting sublinear_tf) and do either scaling or normalization afterwards.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top