How can I change threshold for classification in NaiveBayesMultinomial or compute confusion matrix manually in Weka

StackOverflow https://stackoverflow.com/questions/22968151

문제

I am working on a spam filter mining project and I am currently using the NaiveBayesMultinomial classifier for classifying spam from non-spam by counting the frequency of word occurrences.

The problem is that WEKA sets the threshold for classification to 0.5 by default. However, misclassifying a non-spam as spam is more harmful than vice versa.

I want to adjust the threshold of WEKA's NaiveBayesMultinomial algorithm to see how the confusion matrix changes. If that is not directly possible, how do I utilize the output from WEKA to compute a confusion matrix for a different threshold?


Here is a summary of the project's current results when evaluated on the test split:

Summary:

Correctly Classified Instances        2715               98.4766 %
Incorrectly Classified Instances        42                1.5234 %
Kappa statistic                          0.9679
Mean absolute error                      0.0184
Root mean squared error                  0.1136
Relative absolute error                  3.8317 %
Root relative squared error             23.2509 %
Total Number of Instances             2757     `

Detailed Accuracy By Class:

               TP Rate   FP Rate   Precision   Recall  F-Measure   ROC Area  Class
                 0.998     0.035      0.978     0.998     0.988      0.998    ham
                 0.965     0.002      0.996     0.965     0.98       0.999    spam
Weighted Avg.    0.985     0.022      0.985     0.985     0.985      0.998

Confusion Matrix:

   a    b   <-- classified as
1669    4 |   a = ham
  38 1046 |   b = spam
도움이 되었습니까?

해결책

I searched around google and it seems it is unlikely to do so in WEKA.

But this is still feasible to do by 'Test option' -> 'More option' -> 'output predictions' Then it will give me the possibility result of each test sample.

From there I can use another tool for the rest of the work.

다른 팁

You can change it in the cost benefit analysis screen. You right click on your results in the result list and select visualize threshold curve.

Inside of there is a slider to move the threshold and your new confusion matrix is in the bottom left hand corner.

enter image description here

The probability threshold can be adjusted by using cost-sensitive classification.

If the desired threshold is k, set the cost of false positives μ and the cost of false negatives λ such that:

k = μ / (μ + λ)

For example, if you want a threshold of 0.4, set μ to 2 and λ to 3. In other words, use a cost matrix of:

0 3
2 0

Reference: More Data Mining with Weka — Lesson 4.6 Cost-sensitive classification vs. cost-sensitive learning (slides).


Explanation of formula:

In Naive Bayes with two classes, if class A has a probability of p, then class B has a probability of (1 - p).

If the threshold is 0.5, we classify as class A if we get p > 0.5, or in other words, p > (1 - p).

Suppose the cost of misclassifying A as B (false negative) is Ca, and the cost of misclassifying B as A (false positive) is Cb. Then, we only classify as class A if the probability-weighted cost of misclassifying A as B is greater than the probability-weighted cost of misclassifying B as A. In other words, classify as A if this is true:

Ca * p > Cb * (1 - p)

Rearranging the inequality, we get:

p > Cb / (Ca + Cb)

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top