How to correctly calculate average F1 score, precision and recall of a Named Entity Recognition system?

https://datascience.stackexchange.com/questions/67360

08-12-2020
|

Question

My Named Entity Recognition (NER) pipeline built with Apache uimaFIT and DKPro recognizes named entities (called datatypes for now) in texts (e.g. persons, locations, organizations and many more).

I have a gold corpus and a result corpus which I want to calculate precision, recall and F1 score on. As of now, I calulate these metrics like this:

1. Calculate precision, recall and F1 score from TP, FP and FN per datatype per document
2. Average precision, recall and F1 score per datatype for all documents

In the table you can see the results of step 2 in the corresponding datatype rows.

Regarding step 2: I think the way of calulating the F1 score is neither macro nor micro. I calculate precision and recall in the macro way (like explained here). But I don't calculate the F1 score as the harmonic mean of the average precision and recall (macro way), but as the average F1 score for every datatype for all documents. I am getting higher results for the macro-way F1 score compared to the way I am doing it at the moment.

Question: What is the right way to calculate the average F1 score for every datatype? Both ways seem to be correct to me. Please name sources for your answers.

Solution

There is a quite detailed comparison with references here: https://towardsdatascience.com/a-tale-of-two-macro-f1s-8811ddcf8f04

Basically the two definitions are used and both can be considered valid. For the sake of clarity I would recommend mentioning which definition you are using when you report your results.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange