Question

I have about 15k text instances which are hand categorized into one of the 120 categories. The text pieces are emails from customers. The frequency of classes varies from 1 to 2000. I would like to train a classifier on this data so that subsequent emails can be classified automatically.

I have tried Naive Bayes' and SVM both but report percentage of correctly classified instances as 51% and 57% respectively. I have used stemmer, removed stop words and changed text to lower case.

I am sure such a text classification task with large number of categories and uneven distribution, has to be approached differently but I could not find any reference for such a case...Any recommendations?

Thanks in advance!

Was it helpful?

Solution

I assume that classes are not overlapping (that is, exactly one class per message).

A useful approach in the case of imbalanced classes is using asymetric miss-classification costs in order to enforce the classifier to focus on the less represented class, as its cost is assigned much bigger figure than other classes.

This is relatively easy to do in WEKA (see e.g. Class imbalanced distribution and WEKA cost sensitive learning) in the case of binary classifiers, but it is much harder to setup in the case of 120 classes. In consequence, one approach would be to turn this problem into 120 binary problems (one-against-the-rest) and setting up the appropriate cost matrixes for each problem.

A more viable alternative in my experience, and given the high number of classes, is to collapse the unfrequent classes into a bigger other class. This seems more useful for a practical setting; there is a "other" folder to check by a human expert while most of the time the classifier is correctly assigning the emails to the rest of --well populated-- classes.

As a final note, an accuracy of about 50-60% may be not to bad after all, depending on the distribution of classes. For instance, the majority classifier (the one that assigns every instance to the most populated class) in a 99%-1% would be 99% accurate; however it is absolutely useless, because it misses the interesting examples. In the real life this happens in email spam filtering, fraud spotting, and quite many other domains.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top