Algorithms for named entity recognition

https://stackoverflow.com/questions/1026925

06-07-2019
|

Question

I would like to use named entity recognition (NER) to find adequate tags for texts in a database.

I know there is a Wikipedia article about this and lots of other pages describing NER, I would preferably hear something about this topic from you:

What experiences did you make with the various algorithms?
Which algorithm would you recommend?
Which algorithm is the easiest to implement (PHP/Python)?
How to the algorithms work? Is manual training necessary?

Example:

"Last year, I was in London where I saw Barack Obama." => Tags: London, Barack Obama

I hope you can help me. Thank you very much in advance!

Solution

To start with check out http://www.nltk.org/ if you plan working with python although as far as I know the code isn't "industrial strength" but it will get you started.

Check out section 7.5 from http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html but to understand the algorithms you probably will have to read through a lot of the book.

Also check this out http://nlp.stanford.edu/software/CRF-NER.shtml. It's done with java,

NER isn't an easy subject and probably nobody will tell you "this is the best algorithm", most of them have their pro/cons.

My 0.05 of a dollar.

Cheers,

OTHER TIPS

It depends on whether you want:

To learn about NER: An excellent place to start is with NLTK, and the associated book.

To implement the best solution: Here you're going to need to look for the state of the art. Have a look at publications in TREC. A more specialised meeting is Biocreative (a good example of NER applied to a narrow field).

To implement the easiest solution: In this case you basically just want to do simple tagging, and pull out the words tagged as nouns. You could use a tagger from nltk, or even just look up each word in PyWordnet and tag it with the most common wordsense.

Most algorithms required some sort of training, and perform best when they're trained on content that represents what you're going to be asking it to tag.

There's a few tools and API's out there.

There's a tool built on top of DBPedia called DBPedia Spotlight (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki). You can use their REST interface or download and install your own server. The great thing is it maps entities to their DBPedia presence, which means you can extract interesting linked data.

AlchemyAPI (www.alchemyapi.com) have an API that will do this via REST as well, and they use a freemium model.

I think most techniques rely on a bit of NLP to find entities, then use an underlying database like Wikipedia, DBPedia, Freebase, etc to do disambiguation and relevance (so for instance, trying to decide whether an article that mentions Apple is about the fruit or the company... we would choose the company if the article includes other entities that are linked to Apple the company).

You may want to try Yahoo Research's latest Fast entity Linking system - the paper also has updated references to new approaches to NER using neural network based embeddings:

https://research.yahoo.com/publications/8810/lightweight-multilingual-entity-extraction-and-linking

One can use artificial neural networks to perform named-entity recognition.

Here is an implementation of a bi-directional LSTM + CRF Network in TensorFlow (python) to perform named-entity recognition: https://github.com/Franck-Dernoncourt/NeuroNER (works on Linux/Mac/Windows).

It gives state-of-the-art results (or close to it) on several named-entity recognition datasets. As Ale mentions, each named-entity recognition algorithm has its own downsides and upsides.

ANN architecture:

As viewed in TensorBoard:

I don't really know about NER, but judging from that example, you could make an algorithm that searched for capital letters in the words or something like that. For that I would recommend regex as the most easy to implement solution if you're thinking small.

Another option is to compare the texts with a database, wich yould match string pre-identified as Tags of interest.

my 5 cents.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow