Question

I want to classify the news article into the category it belongs to. I have 4 categories of news eg." Technology,Sports,Politics and Health." And i have collected around 50 documents for each category as a Training Set

**Is the Training data enough for classification ??? And Which Algorithm should i use for classification?? SVM, Random Forest,Knn, ??

I am using Scikit-learn http://scikit-learn.org/ [python] library for my task

Thanks

Was it helpful?

Solution

There are many ways to attack this problem form CRFs to Random Forests.

With your limited training data, I would suggest going with a model with high bias such as the linear SVM. Start with training one vs all models for each class and predicting the class with the highest probably. This will give you a baseline for how hard your problem is with the given training data.

OTHER TIPS

I prefer you to use Naive-Bayes classification. There is a tool called Ling-pipe where this is already implemented. What you want to do is just refer

http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html

There you have a small sample program Classifynews.java. Run that program by training the data and apply testing .A training data sample is given as "20 newsgroups"

http://qwone.com/~jason/20Newsgroups/

Training can be applied by training the data and if needed you can build an intermediate model and then apply the test data into that model. Naive-Bayes is good for the cases where training data is small.

But its accuracy increases as the size of training data increases. So try to include more news groups. Good luck. Try this and let me know

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top