NLP - Is Gazetteer a cheat?

https://datascience.stackexchange.com/questions/9950

16-10-2019
|

Question

In NLP, there is the concept of Gazetteer which can be quite useful for creating annotations. As far as I understand:

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to ﬁnd occurrences of these names in text, e.g. for the task of named entity recognition.

So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, I would want to detect named entities using NLP techniques. Otherwise, how is it any better than a regex pattern matcher?

Solution

Gazetteer or any other option of intentionally fixed size feature seems a very popular approach in academic papers, when you have a problem of finite size, for example NER in a fixed corpora, or POS tagging or anything else. I would not consider it cheating unless the only feature you will be using is Gazetteer matching.

However, when you train any kind of NLP model, which does rely on dictionary while training, you may get real world performance way lower than your initial testing would report, unless you can include all objects of interest into the gazetteer (and why then you need that model?) because your trained model will rely on the feature at some point and, in a case when other features will be too weak or not descriptive, new objects of interest would not be recognized.

If you do use a Gazetteer in your models, you should make sure, that that feature has a counter feature to let model balance itself, so that simple dictionary match won't be the only feature of positive class (and more importantly, gazetteer should match not only positive examples, but also negative ones).

For example, assume you do have a full set of infinite variations of all person names, which makes general person NER irrelevant, but now you try to decide whether the object mentioned in text is capable of singing. You will rely on features of inclusion into your Person gazetteer, which will give you a lot of false positives; then, you will add a verb-centric feature of "Is Subject of verb sing", and that would probably give you false positives from all kind of objects like birds, your tummy when you're hungry and a drunk fellow who thinks he can sing (but let's be honest, he can not) -- but that verb-centric feature will balance with your person gazetteer to assign positive class of 'Singer' to persons and not animals or any other objects. Though, it doesn't solve the case of drunk performer.

OTHER TIPS

Using a list of entities has few disadvantages:

The list is closed
The list is not context sensitive. You need context in order to differ between "a white house" and "the white house".
List building require a lot of labor
List might also contain errors.
It does feel like cheating (or at list no NLP insights are used).

You can cope with these disadvantages by going along the direction @emre suggested and use the list in order to learn a classifier.

For example, you can use tokens near the entity and learn rule like that "I live at X" is an indicator of a place and "I talked with X" is an indicator of a person. You can play this game few rounds by increasing your list by the hits of the rules and use the new list to learn more rule.

Please not that in this learning you will introduce noisy into the data so in most cases the learning should be so straight forward.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange