Question

I am new to data science and I am trying to understand the use/importance of accuracy, precision, recall, sensitivity and f1-score when I have a confusion matrix.

I know how to compute all of them but I cannot really understand which of them to use each time.

Could you give examples where for instance precision is a better metric that recall or where the f1-score gives essential information that I cannot get from the other terms ? In other words, in which cases should I use each of the aforementioned terms ?

Was it helpful?

Solution

First, let's be clear about the fact that all these measures are only for evaluating binary classification tasks.

The way to understand the differences is to look at examples where the number of instances is (very) different in the two classes, either the true classes (gold) or predicted classes.

For instance imagine a task to detect cities names among the words in a text. It's not very common, so in your test set you may have 1000 words, only 5 of them are cities names (positive). Now imagine two systems:

  • Dummy system A which always says "negative" for any word
  • Real system B (e.g. which works with a dictionary of cities names). Let's say that B misses 2 real cities and mistakenly identifies 8 other words as cities.

System A gets an accuracy of 995/1000 = 99.5%, even though it does nothing. System B has 990/1000=99.0%. It looks like A is better, that's why accuracy rarely gives the full picture.

Precision represents how correct a system is in its positive predictions: system A always says negative so it has 0% precision. System B has 3/11 = 27%.

Recall represents the proportion of true positive instances which are retrieved by a system: system A doesn't retrieve anything so it has 0% recall. System B has 3/5 = 60%.

F1-score is a way to have a single value which represents the harmonic mean of the precision and recall. It's used as a "summary" of these two values, which is convenient when one needs to order different systems by their performance.

The choice of an evaluation measure depends on the task: for instance, if predicting a FN has life-threatening consequences (e.g. cancer detection), then recall is crucial. If on the contrary it's very important to avoid FP cases, then precision makes more sense (say for instance if an automatic missile system would mistaken identify a commercial flight as a threat). The most common case though is certainly F1-score (or more generally F$\alpha$-score), which is suited to most binary classification tasks.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top