Naive Bayes Text Classifier - determining when a document should be labelled 'unclassified'

https://stackoverflow.com/questions/16479987

21-04-2022
|

Question

I have designed and implemented a Naive Bayes Text Classifier (in Java). I am primarily using it to classify tweets into 20 classes. To determine the probability that a document belongs to a class I use

foreach(class)
{
   Probability = (P(bag of words occurring for class) * P(class)) / P(bag of words occurring globally)
}

What is the best way to determine if a bag of words really shouldn't belong to any class? I'm aware I could just sent a minimum threshold for P(bag of words occurring for class) and if all the classes are under that threshold then to class the document as unclassifed, however I'm realising this prevents this classifier from being sensitive.

Would an option be to create an Unclassified class and train that with document I deem to be unclassifiable?

Thanks,

Mark

--Edit---

I just had thought - I could set a maximum threshold for P(bag of words occurring globally)*(number of words in document) . This would mean that any documents which mainly consisted of common words (typically the tweets I want to filter out) eg. "Yes I agree with you". Would be filtered out. - Your thoughts on this would be appreciated also.

Or perhaps I should find the standard deviation and if it is low determine it should be unclassified?

Solution

I see two different options, seeing the problem as a set of 20 binary classification problems.

You can compute the likelihood of P(doc being in class)/P(doc not being in class). Some Naive Bayes implementations use this kind of method.
Assuming that you have some evaluation measure, you can compute a threshold per class and optimise it based on a cross-validation process. This is the standard way of applying text classification. You would use thresholds (one per class) but they would be based on your data. In your case SCut or ScutFBR would be the best option as explained in this paper.

Regards,

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow