Kappa near to 60% in unbalanced (1:10) data set
-
16-10-2019 - |
Question
As mentioned before, I have a classification problem and unbalanced data set. The majority class contains 88% of all samples.
I have trained a Generalized Boosted Regression model using gbm()
from the gbm
package in R
and get the following output:
interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
1 50 0.906 0.523 0.00978 0.0512
1 100 0.91 0.561 0.0108 0.0517
1 150 0.91 0.572 0.0104 0.0492
2 50 0.908 0.569 0.0106 0.0484
2 100 0.91 0.582 0.00965 0.0443
2 150 0.91 0.584 0.00976 0.0437
3 50 0.909 0.578 0.00996 0.0469
3 100 0.91 0.583 0.00975 0.0447
3 150 0.911 0.586 0.00962 0.0443
Looking at the 90% accuracy I assume that model has labeled all the samples as majority class. That's clear. And what is not transparent: how Kappa is calculated.
- What does this Kappa values (near to 60%) really mean? Is it enough to say that the model is not classifying them just by chance?
- What do
Accuracy SD
andKappa SD
mean?
Solution
The Kappa is Cohen's Kappa score for inter-rater agreement. It's a commonly-used metric for evaluating the performance of machine learning algorithms and human annotaters, particularly when dealing with text/linguistics.
What it does is compare the level of agreement between the output of the (human or algorithmic) annotater and the ground truth labels, to the level of agreement that would occur through random chance. There's a very good overview of how to calculate Kappa and use it to evaluate a classifier in this stats.stackexchange.com answer here, and a more in-depth explanation of Kappa and how to interpret it in this paper, entitled "Understanding Interobserver Agreement: The Kappa Statistic" by Viera & Garrett (2005).
The benefit of using Kappa, particularly in an unbalanced data set like yours, is that with a 90-10% imbalance between the classes, you can achieve 90% accuracy by simply labeling all of the data points with the label of the more commonly occurring class. The Kappa statistic is describing how well the classifier performs above that baseline level of performance.
Kappa ranges from -1 to 1, with 0 indicating no agreement between the raters, 1 indicating a perfect agreement, and negative numbers indicating systematic disagreement. While interpretation is somewhat arbitrary (and very task-dependent), Landis & Koch (1977) defined the following interpretation system which can work as a general rule of thumb:
Kappa Agreement
< 0 Less than chance agreement
0.01–0.20 Slight agreement
0.21– 0.40 Fair agreement
0.41–0.60 Moderate agreement
0.61–0.80 Substantial agreement
0.81–0.99 Almost perfect agreement
Which would indicate that your algorithm is performing moderately well. Accuracy SD and Kappa SD are the respective Standard Deviations of the Accuracy and Kappa scores. I hope this is helpful!
OTHER TIPS
This may provide some answer: http://cran.r-project.org/web/packages/caret/vignettes/caret.pdf
You may also check out Max Kuhn's "Applied Predictive Modeling" book. He talks about the caret package at length in this book, including the kappa statistics and how to use it. This may be of some help to you.