Which classification algorithm can be used for document categorization?

https://stackoverflow.com/questions/12031477

27-06-2021
|

Question

Hey, Here is my problem,

Given a set of documents I need to assign each document to a predefined category.

I was going to use the n-gram approach to represent the text-content of each document and then train an SVM classifier on the training data that I have.
Correct me if I miss understood something please.

The problem now is that the categories should be dynamic. Meaning, my classifier should handle new training data with new category.

So for example, if I trained a classifier to classify a given document as category A, category B or category C, and then I was given new training data with category D. I should be able to incrementally train my classifier by providing it with the new training data for "category D".

To summarize, I do NOT want to combine the old training data (with 3 categories) and the new training data (with the new/unseen category) and train my classifier again. I want to train my classifier on the fly

Is this possible to implement with SVM ? if not, could u recommend me several classification algorithms ? or any book/paper that can help me.

Thanks in Advance.

Solution

Naive-Bayes is relatively fast incremental calssification algorithm.
KNN is also incremental by nature, and even simpler to implement and understand.

Both algorithms are implemented in the open source project Weka as NaiveBayes and IBk for KNN.

However, from personal experience - they are both vulnerable to large number of non-informative features (which is usually the case with text classification), and thus some kind of feature selection is usually used to squeeze better performance from these algorithms, which could be problematic to implement as incremental.

OTHER TIPS

This blog post by Edwin Chen describes infinite mixture models to do clustering. I think this method supports automatically determining the number of clusters, but I am still trying to wrap my head all the way around it.

The class of algorithms that matches your criteria are called "Incremental Algorithms". There are incremental versions of almost any methods. The easiest to implement is naive bayes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow