Question

What are the recommended methods for extracting locations from free text?

What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?

Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.

Does anybody know of better approaches?

Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.

Was it helpful?

Solution

All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)

This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml

You can easily find implementations in other programming languages.

OTHER TIPS

Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.

Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.

Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.

As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.

How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries? A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer. Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer. I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top