Question

I am trying to build an SVM classifier in SVM Light using the Vector Space Model. I have 1000 documents and a dictionary of terms I will be using to vectorize each document. Of the 1000 documents, 600 will be for my training set, while the remaining 400 will be split evenly (200 each) for my cross-validation set and my test set.

Now suppose that I were to train my SVM classifier using my training set of 600 (vectorized using tf-idf) in order to generate a model for classification.

When I apply the model to my cross-validation set, would i use the same idf (since the model corresponds to my training set), or would I need to compute a new idf based on the cross-validation set? Also, if I was to apply the model to a single document, how would I apply idf, as this set would only contain 1 document?

Was it helpful?

Solution

You build the idf in your training documents, and use it whenever a new test document comes. For each test document, you can create a word list for the query using the idf of each term in the query. If a word is not included in idf, the query will return 0. The classification is acquired based on the established idf.

OTHER TIPS

You should use the same idf as your training set because you built your classifier corresponding to that idf and thus your results will be different with a new idf.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top