How can I change threshold for classification in NaiveBayesMultinomial or compute confusion matrix manually in Weka

StackOverflow https://stackoverflow.com/questions/22968151

Frage

I am working on a spam filter mining project and I am currently using the NaiveBayesMultinomial classifier for classifying spam from non-spam by counting the frequency of word occurrences.

The problem is that WEKA sets the threshold for classification to 0.5 by default. However, misclassifying a non-spam as spam is more harmful than vice versa.

I want to adjust the threshold of WEKA's NaiveBayesMultinomial algorithm to see how the confusion matrix changes. If that is not directly possible, how do I utilize the output from WEKA to compute a confusion matrix for a different threshold?


Here is a summary of the project's current results when evaluated on the test split:

Summary:

Correctly Classified Instances        2715               98.4766 %
Incorrectly Classified Instances        42                1.5234 %
Kappa statistic                          0.9679
Mean absolute error                      0.0184
Root mean squared error                  0.1136
Relative absolute error                  3.8317 %
Root relative squared error             23.2509 %
Total Number of Instances             2757     `

Detailed Accuracy By Class:

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.998     0.035      0.978     0.998     0.988      0.998    ham
                 0.965     0.002      0.996     0.965     0.98       0.999    spam
Weighted Avg.    0.985     0.022      0.985     0.985     0.985      0.998

Confusion Matrix:

   a    b   <-- classified as
1669    4 |   a = ham
  38 1046 |   b = spam
War es hilfreich?

Lösung

I searched around google and it seems it is unlikely to do so in WEKA.

But this is still feasible to do by 'Test option' -> 'More option' -> 'output predictions' Then it will give me the possibility result of each test sample.

From there I can use another tool for the rest of the work.

Andere Tipps

You can change it in the cost benefit analysis screen. You right click on your results in the result list and select visualize threshold curve.

Inside of there is a slider to move the threshold and your new confusion matrix is in the bottom left hand corner.

enter image description here

The probability threshold can be adjusted by using cost-sensitive classification.

If the desired threshold is k, set the cost of false positives μ and the cost of false negatives λ such that:

k = μ / (μ + λ)

For example, if you want a threshold of 0.4, set μ to 2 and λ to 3. In other words, use a cost matrix of:

0 3
2 0

Reference: More Data Mining with Weka — Lesson 4.6 Cost-sensitive classification vs. cost-sensitive learning (slides).


Explanation of formula:

In Naive Bayes with two classes, if class A has a probability of p, then class B has a probability of (1 - p).

If the threshold is 0.5, we classify as class A if we get p > 0.5, or in other words, p > (1 - p).

Suppose the cost of misclassifying A as B (false negative) is Ca, and the cost of misclassifying B as A (false positive) is Cb. Then, we only classify as class A if the probability-weighted cost of misclassifying A as B is greater than the probability-weighted cost of misclassifying B as A. In other words, classify as A if this is true:

Ca * p > Cb * (1 - p)

Rearranging the inequality, we get:

p > Cb / (Ca + Cb)

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top