Question

In my SVM, i am using tf-idf on the documents for feature extraction. These tf-idf are calculated on the whole of training documents.

Now when i get a test-document that i want to classify, how do i generate the vector for it ?

I used stemming before calculating tf-idf. I can perform that on test-document too. I have count_of_words for train-documents.

Should i increment count of words that are in the train-document count_of_words for calculating the tf-idf of test-document or should i use it directly ?

Was it helpful?

Solution

Calculate them the same way as during training but: use idf based on the training documents and tf from the test documents. If you have many new documents coming in, just update the training data time to time and retrain your model.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top