I am working with text classification using support vector machine, but basically I am confused with computation of feature vector for test set.

For training feature vector, I took TF-IDF vector for each training data, and constructed a feature matrix [docs x terms] using the TF-IDF values.

But how about computing the test set's feature vector? Should I just use the TF-IDF values in training set to compute it?

eg: In training set for a particular word "apple", the doc frequency is 5. For test set, should I use the value 5 for "apple"? Or recompute the TF-IDF based on test set?? Or rather, am I going the wrong way in computing the feature vector??

Thanks in advance!

有帮助吗?

解决方案

You should compute the IDF (inverse document frequency) for every term using the training set. You should then use the same IDF for the documents in your test set. The TF on the other hand depends on the concrete document at hand that you try to classify, so it will be different for different documents in the test and train set.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top