Pergunta

I am trying to build an SVM classifier in SVM Light using the Vector Space Model. I have 1000 documents and a dictionary of terms I will be using to vectorize each document. Of the 1000 documents, 600 will be for my training set, while the remaining 400 will be split evenly (200 each) for my cross-validation set and my test set.

Now suppose that I were to train my SVM classifier using my training set of 600 (vectorized using tf-idf) in order to generate a model for classification.

When I apply the model to my cross-validation set, would i use the same idf (since the model corresponds to my training set), or would I need to compute a new idf based on the cross-validation set? Also, if I was to apply the model to a single document, how would I apply idf, as this set would only contain 1 document?

Foi útil?

Solução

You build the idf in your training documents, and use it whenever a new test document comes. For each test document, you can create a word list for the query using the idf of each term in the query. If a word is not included in idf, the query will return 0. The classification is acquired based on the established idf.

Outras dicas

You should use the same idf as your training set because you built your classifier corresponding to that idf and thus your results will be different with a new idf.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top