Balanced Accuracy vs. F1 Score

https://datascience.stackexchange.com/questions/73974

11-12-2020
|

Question

I've read plenty of online posts with clear explanations about the difference between accuracy and F1 score in a binary classification context. However, when I came across the concept of balanced accuracy, explained e.g. in the following image (source) or in this scikit-learn page, I was a bit puzzled as I was trying to compare it with F1 score.

I know that it is probably impossible to establish which is better between balanced accuracy and F1 score as it could be situation-dependent, but I would like to understand some pros/cons of the two performance metrics, as well as some applications in which one could be more suitable and useful than the other (especially in an imbalanced binary classification context).

Solution

One major difference is that the F1-score does not care at all about how many negative examples you classified or how many negative examples are in the dataset at all; instead, the balanced accuracy metric gives half its weight to how many positives you labeled correctly and how many negatives you labeled correctly.

When working on problems with heavily imbalanced datasets AND you care more about detecting positives than detecting negatives (outlier detection / anomaly detection) then you would prefer the F1-score more.

Let's say for example you have a validation set than contains 1000 negative samples and 10 positive samples. If a model predicts there are 15 positive examples (5 truly positive and 10 it incorrectly labeled) and predicts the rest as negative, thus

TP=5; FP=10; TN=990; FN=5

Then its F1-score and balanced accuracy will be

$Precision = \frac{5}{15}=0.33...$

$Recall = \frac{5}{10}= 0.5$

$F_1 = 2 * \frac{0.5*0.33}{0.5+0.3} = 0.4$

$Balanced\ Acc = \frac{1}{2}(\frac{5}{10} + \frac{990}{1000}) = 0.745$

You can see that balanced accuracy still cares about the negative datapoints unlike the F1 score.

For even more analysis we can see what the change is when the model gets exactly one extra positive example correctly and one negative sample incorrectly:

TP=6; FP=9; TN=989; FN=4

$Precision = \frac{6}{15}=0.4$

$Recall = \frac{6}{10}= 0.6$

$F_1 = 2 * \frac{0.6*0.4}{0.6+0.4} = 0.48$

$Balanced\ Acc = \frac{1}{2}(\frac{6}{10} + \frac{989}{1000}) = 0.795$

Correctly classifying an extra positive example increased the F1 score a bit more than the balanced accuracy.

Finally let's look at what happens when a model predicts there are still 15 positive examples (5 truly positive and 10 incorrectly labeled); however, this time the dataset is balanced and there are exactly 10 positive and 10 negative examples:

TP=5; FP=10; TN=0; FN=5

$Precision = \frac{5}{15}=0.33...$

$Recall = \frac{5}{10}= 0.5$

$F_1 = 2 * \frac{0.5*0.33}{0.5+0.3} = 0.4$

$Balanced\ Acc = \frac{1}{2}(\frac{5}{10} + \frac{0}{0}) = 0.25$

You can see that the F1-score did not change at all (compared to the first example) while the balanced accuracy took a massive hit (decreased by 50%).

This shows how F1-score only cares about the points the model said are positive, and the points that actually are positive, and doesn't care at all about the plathero points that are negative.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange