Re-normalize feature vectors after feature selection

Question

This is actually a lot of questions in one. The main reason for doing normalization on tf-idf vectors is so that their dot products (used by SVMs in their decision function) are readily interpretable as cosine similarities, the mainstay of document vector comparisons in information retrieval. Normalization makes sure that

"hello world"             -> [1 2]
"hello hello world world" -> [2 4]

become the same vector, so concatenating a document onto itself doesn't change the decision boundary and the similarity between these two documents is exactly one (although with sublinear scaling, sublinear_tf in the vectorizer constructor, this is no longer true).

The main reasons for doing scaling is to avoid numerical instability issues. Normalization takes care of most of those because features will already be in the range [0, 1]. (I think it also relates to regularization, but I don't use SVMs that often.)

As you've noticed, chi² "denormalizes" feature vectors, so to answer the original question: you can try renormalizing them. I did that when adding chi² feature selection to the scikit-learn document classification example, and it helped with some estimators and hurt with others. You can also try doing the chi² on unnormalized tf-idf vectors (in which case I recommend you try setting sublinear_tf) and do either scaling or normalization afterwards.