Question

I am working with text classification using support vector machine, but basically I am confused with computation of feature vector for test set.

For training feature vector, I took TF-IDF vector for each training data, and constructed a feature matrix [docs x terms] using the TF-IDF values.

But how about computing the test set's feature vector? Should I just use the TF-IDF values in training set to compute it?

eg: In training set for a particular word "apple", the doc frequency is 5. For test set, should I use the value 5 for "apple"? Or recompute the TF-IDF based on test set?? Or rather, am I going the wrong way in computing the feature vector??

Thanks in advance!

Was it helpful?

Solution

You should compute the IDF (inverse document frequency) for every term using the training set. You should then use the same IDF for the documents in your test set. The TF on the other hand depends on the concrete document at hand that you try to classify, so it will be different for different documents in the test and train set.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top