Question

Hi I am using Spark ML to optimise a Naive Bayes multi-class classifier.

I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category.

All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that often when classifying a new document, the classifier is often assigning a very high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).

What are the possible reasons for this phenomenon?

  • I can think of one possible reason, which is all the words in some documents did not appear with some categories in the training dataset? but I am not totally convinced of that especially that for most of the results, there is always one category who has very high probability and all the others have very low probability.
  • Is there any other explanations or reasons?

I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.

Was it helpful?

Solution

The reason NB is called "Naive" is that is makes the assumption that the predictive variables are all independent. This assumption usually skews the model scores (which, under the above naive assumption are unbiased probability estimates) towards 0 or 1.

In your case, e.g., the presence of words flower and petal indicate gardening category, but, because the presence of these words is not independent (if one is present, the other is likely to be present too), the model will over-value their appearance. Taking the extreme case, if words A and B appear only together, then

P(Category=X | A & B) = P(Category=X | B) = P(Category=X | A)

and thus one should multiply by the odds ratio of A&B once, not twice, as the Naive Bayes algorithm requires.

Your remedy is to use calibration (a separate model which maps model scores to probabilities).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top