Machine Learning - Precision and Recall - differences in interpretation and preferring one over other

https://datascience.stackexchange.com/questions/86555

17-12-2020
|

Question

I have summarising this from lot of blogs about Precision and Recall.

Precision is:

Proportion of actual positives that classifier has predicted as positive.

meaning out of the sample identified as positives by classifier as positive, how many are actually positive?

and Recall is:

Proportion of actual positives were predicted as positive correctly.

meaning out of the ground truth positives, how many were identified correctly by the classifier as positive?

That sounded very confusing to me. I couldn't interpret difference between both of them and relate each to real examples. some very small questions about interpretation I have are:

if avoiding false-positives matter the most to me, i should be measuring precision; And if avoiding false-negatives matters the most to me, i should be measuring recall. Is my understanding correct?
Suppose, I am predicting if a patient should be given a vaccine, that when given to healthy person is catastrophic and hence should only be given to an affected person; and I can't afford giving vaccine to healthy people. assuming positive stands for should-give-vaccine and negative is should-not-give-vaccine, should I be measuring Precision? or Recall of my classifier?
Suppose, I am predicting if an email is spam(+ve) or non-spam(-ve). and I can't afford a spam email being classified as non-spam, meaning can't afford false-negatives, should I be measuring Precision? or Recall of my classifier?
What does it mean to have high precision(> 0.95) and low recall(< 0.05)? And what does it mean to have low precision(> 0.95) and high recall(< 0.05)?

Put simply, in what kind of cases is to preferable or good choice to use Precision over Recall as metric and vice versa. I get the definition and I can't relate it to real examples to answer when one is preferable over other, so I would really like some clarification.

Solution

To make sure everything is clear let me quickly summarize what we are talking about. precision and recall are evaluation measures for binary classification, in which every instance has a ground truth class (also called gold standard class, I'll call it 'gold') and a predicted class, both being either positive or negative (note that it's important to clearly define which one is the positive one). Therefore there are four possibilities for every instance:

gold positive and predicted positive -> TP
gold positive and predicted negative -> FN (also called type II errors)
gold negative and predicted positive -> FP (also called type I errors)
gold negative and predicted negative -> TN

$$Precision=\frac{TP}{TP+FP}\ \ \ Recall=\frac{TP}{TP+FN}$$

In case it helps, I think a figure such as the one on the Wikipedia Precision and Recall page summarizes these concepts quite well.

About your questions:

if avoiding false-positives matter the most to me, i should be measuring precision; And if avoiding false-negatives matters the most to me, i should be measuring recall. Is my understanding correct?

Correct.

Suppose, I am predicting if a patient should be given a vaccine, that when given to healthy person is catastrophic and hence should only be given to an affected person; and I can't afford giving vaccine to healthy people. assuming positive stands for should-give-vaccine and negative is should-not-give-vaccine, should I be measuring Precision? or Recall of my classifier?

Here one wants to avoid giving the vaccine to somebody who doesn't need it, i.e. we need to avoid predicting a positive for a gold negative instance. Since we want to avoid FP errors at all cost, we must have a very high precision -> precision should be used.

Suppose, I am predicting if an email is spam(+ve) or non-spam(-ve). and I can't afford a spam email being classified as non-spam, meaning can't afford false-negatives, should I be measuring Precision? or Recall of my classifier?

We want to avoid false negative -> recall should be used.

Note: the choice of the positive class is important, here spam = positive. This is the standard way, but sometimes people confuse "positive" with a positive outcome, i.e. mentally associate positive with non-spam.

What does it mean to have high precision(> 0.95) and low recall(< 0.05)? And what does it mean to have low precision(> 0.95) and high recall(< 0.05)?

Let's say you're a classifier in charge of labeling a set of pictures based on whether they contain a dog (positive) or not (negative). You see that some pictures clearly contain a dog so you label them as positive, and some clearly don't so you label them as negative. Now let's assume that for a large majority of pictures you are not sure: maybe the picture is too dark, blurry, there's an animal but it is masked by another object, etc. For these uncertain cases you have two possible strategies:

Label them as negative, in other words favor precision. Best case scenario, most of them turn out to be negative so you will get both high precision and high recall. But if most of these uncertain cases turn out to be actually positive, then you have a lot of FN errors: your recall will be very low, but your precision will still be very high since you are sure that all/most of the ones you labeled as positive are actually positive.
Label them as positive, in other words favor recall. Now in the best case scenario most of them turn out to be positive, so high precision and high recall. But if most of the uncertain cases turn out to be actually negative, then you have a lot of FP errors: your precision will be very low, but your recall will still be very high since you're sure that all/most the true positive are labeled as positive.

Side note: it's not really relevant to your question but the example of spam is not very realistic for a case where high recall is important. Typically high recall is important in tasks where the goal is to find all the potential positive cases: for instance a police investigation to find everybody susceptible of being at a certain place at a certain time. Here FP errors don't matter since detectives are going to check afterwards but FN errors could cause missing a potential suspect.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange