Naive bayes text classification fails in one category. Why? [closed]

https://stackoverflow.com/questions/16601418

29-05-2022
|

Question

I am implementing Naive Bayes classifier for text category detection. I have 37 categories and I've got accuracy about 36% on my test set.

I want to improve accuracy, so I decided to implement 37 two-way classifiers as suggested in many sources (Ways to improve the accuracy of a Naive Bayes Classifier? is one of them), these classifiers would answer for a given text:

specific_category OR everything_else

and I would determine text's category by applying them sequentally.

But I've got a problem with first classifier, it always fails in "specific_category" category.

I have training data - 37 categories, 100 documents for each category of the same size. For each category I found list of 50 features I selected by mutual information criteria (features are just words).

For the sake of example, I use two categories "agriculture" and "everything_else" (except agriculture).

For category "agriculture":

number of words in all documents of this class 
(first term in denominator in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf, (13.7))
W_agriculture = 31649.

Size of vocabulary V_agriculture = 6951.
Log probability of Unknown word (UNK) P(UNK|agriculture) = -10.56
Log probability of class P(agriculture) = log(1/37) = -3.61 (we have 37 categories of same-size documents)

For category "everything_else":

W_everything_else = 1030043
V_everything_else = 44221
P(UNK|everything_else) = -13.89
P(everything_else) = log(36/37) = -0.03

Then I have a text not related to agriculture, let it consist mostly of Unknown words (UNK). It has 270 words, they are mostly unknown for both categories "agriculture" and "everything_else". Let's assume 260 words are UNK for "everything_else", other 10 is known.

Then, when I calculate probabilities

P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times) 
P(text|everything_else) = P(everything_else) + SUM(P(UNK|everything_else) for 260 times) + SUM(P(word|everything_else) for 10 times)

In the last line we counted 260 words as UNK and 10 as known for a category.

Main problem. As P(UNK|agriculture) >> P(everything_else) (for log it is much greater), the influence of those 270 terms P(UNK|agriculture) outweighs influence of sum for P(word|everything_else) for each word in text. Because

SUM(P(UNK|agriculture) for 270 times) = -2851.2
SUM(P(UNK|everything_else) for 260 times) = -3611.4

and first sum is much larger and can't be corrected not with P(agriculture) nor SUM(P(word|everything_else) for 10 words), because the difference is huge. Then I always fail in "agriculture" category though the text does not belong to it.

The questions is: Am I missing something? Or how should I deal with big number of UNK words and their probability being significantly higher for small categories?

UPD: Tried to enlarge tranining data for "agriculture" category (just concatenating the document 36 times) to be equal in number of documents. It helped for few categories, not much for others, I suspect due to fewer number of words and dictionary size, P(UNK|specific_category) gets bigger and outweighs P(UNK|everything_else) when summing 270 times.

So it seems such method is very sensitive on number of words in training data and vocabulary size. How to overcome this? Maybe bigrams/trigrams would help?

Solution

Right, ok. You're pretty confused, but I'll give you a couple of basic pointers.

Firstly, even if you're following a 1-vs-all scheme, you can't have different vocabularies for the different classes. If you do this, the event spaces of the random variables are different, so probabilities are not comparable. You need to decide on a single common vocabulary for all classes.

Secondly, throw out the unknown token. It doesn't help you. Ignore any words that aren't part of the vocabulary you decide upon.

Finally, I don't know what you're doing with summing probabilities. You're confused about taking logs, I think. This formula is not correct:

P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)

Instead it's:

p(text|agriculture) = p(agriculture) * p(unk|agriculture)^270 * p(all other words in doc|agriculture)

If you take logs, this becomes:

log( p(t|a) ) = log(p(agriculture)) + 270*log(p(unk|agriculture)) + log(p(all other words|agriculture))

Finally, if your classifier is right, there's no real reason to believe that one-vs-all will work better than just a straight n-way classification. Empirically it might, but theoretically their results should be equivalent. In any case, you shouldn't apply decisions sequentially, but do all n 2-way problems and assign to the class where the positive probability is highest.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow