How to find references to dates in natural text?

https://stackoverflow.com//questions/9675633

12-12-2019
|

Question

What I want to do is to parse raw natural text and find all the phrases that describe dates.

I've got a fairly big corpus with all the references to dates marked up:

I met him <date>yesterday</date>.
Roger Zelazny was born <date>in 1937</date>
He'll have a hell of a hangover <date>tomorrow morning</date>

I don't want to interpret the date phrases, just locate them. The fact that they're dates is irrelevant (in real life they're not even dates but I don't want to bore you with the details), basically it's just an open-ended set of possible values. The grammar of the values themselves can be approximated as context-free, however it's quite complicated to build manually and with increasing complexity it gets increasingly hard to avoid false positives.

I know this is a bit of a long shot so I'm not expecting an out-of-the-box solution to exist out there, but what technology or research can I potentially use?

Solution

One of the generic approaches used in academia and in industry is based on Conditional Random Fields. Basically, it is a special probabilistic model, you train it first with your marked up data and then it can label certain types of entities in a given text.

You can even try one of the systems from Stanford Natural Language Processing Group: Stanford Named Entity Recognizer

When you download the tool, note there are several models, you need the last one:

Included with the Stanford NER are a 4 class model trained for CoNLL, a 7 class model trained for MUC, and a 3 class model trained on both data sets for the intersection of those class sets.

3 class Location, Person, Organization

4 class Location, Person, Organization, Misc

7 class Time, Location, Organization, Person, Money, Percent, Date

Update. You can actually try that tool online here. Select the muc.7class.distsim.crf.ser.gz classifier and try some text with dates. It doesn't seem to recognize "yesterday", but it recognizes "20th century", for example. In the end, this is a matter of CRF training.

Stanford NER screenshot

OTHER TIPS

Keep in mind CRFs are rather slow to train and require human-annotated data, so doing it yourself is not easy. Read the answers to this for another example of how people often do it in practice- not much in common with current academic research.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow