Question

I am trying to Classify a sample using Naive Bayes. My sample size is 2.8million records, 90% of the records have Class Label(dependent variable) = "0" and the rest have it as "1". The distribution in the testing set is also the same(90% - 10%) The Naive Bayes Classifier labels the entire testing set to "0". How do I deal with this case? Are there any other Algorithms which can be implemented in such cases.

No correct solution

OTHER TIPS

Your problem may or may not be solved by using a better classifier. The issue here is that your problem is unbalanced. If the data is non-separable then 90% accuracy might represent good performance, which the classifier achieves by always making the same prediction. If this is not the behaviour you want, you should make use of a cost function or resample from your positives so that you have a more even number of positives.

There are dozens of classifiers, including:

  • Logistic regression
  • SVM
  • Decision tree
  • Neural Network
  • Random forest
  • many, meny more...

most of which can handle class disproportions using some custom technique, for example in SVM it is a "class weighting" (avaliable in scikit-learn).

So why does NB fail? Naive Bayes is very Naive, it assumes independence of each feature, which is rarely the case, so it is just a simple idea to understand, but very weak classifier in general.

Almost all classification methods actually don't return a binary result, but a propensity score (usually between 0 and 1) of how likely the given case falls within the category. Binary results are then created by picking a cut-off point, usually at .5.

When you want to identify rare cases using weak predictors any classification method may be unable to find cases with a propensity score higher than .5 resulting in all 0s as in your case.

There are 3 things you can do in such a situation:

  • I recommend finding stronger predictors if at all possible
  • A different statistical method like may be better at identifying patterns in your data set
  • Lowering the cut-off point will increase the number of true positives at the expense of more false positives
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top