Extracting list of locations from text using R

https://datascience.stackexchange.com/questions/8519

16-10-2019
|

Question

I have a string containing many words [not sentences], I want to know how I can extract all the words that correspond to a location in that string for example:

text<-c("China","Japan","perspective","United Kingdom","formatting","clear","India","Sudan","United States of America","Bagel","Mongolian",...)

The output should be:

 > China, Japan, United Kingdom, Mongolian

something of the type. Basically I am looking at extracting locative information from random text. This is a very general problem I am looking for guidance on how to model my solution, is there any dataset or something I can use to compare or extract information from. I don't want to carry out word by word comparison.

I have looked up OpenNLP but I am not sure how to use it's location-models for carrying out Named Entity Recognition in R. In the above example there are only countries but I would like to identify other places, such as provinces, states, counties, cities, etc. as well. I am new to machine learning and R-programming, any guidance is greatly appreciated.

Solution

This might be better for opendata, but nonetheless, you have a few options. One would be to go to geohive which has other pages, including this one. There is also the UN categorization, available on wikipedia which uses membership within the United Nations system divides the 206 listed states into three categories: 193 member states,[1] two observer states, and 11 other states. The sovereignty dispute column indicates states whose sovereignty is undisputed (190 states) and states whose sovereignty is disputed (16 states).

You can read.table or rvest those sources and grab them at runtime.

OTHER TIPS

I am not entirely clear on the evaluation criteria that are using but, given your example, you can use %in% to return matching elements in a vector. The %in% statement returns the same Boolean output that Daniel's grep example does. If you want the actually position of the matching elements you can use %in% and which.

text<-c("China","Japan","perspective","United Kingdom","formatting","clear", "India","Sudan","United States of America","Bagel","Mongolian")
eval <- c("China", "Japan", "United Kingdom", "Mongolian")

# Return matching elements
text[text %in% eval] 

# Return the index (position) of matching elements
which(text %in% eval)

I use the function grepl() for this.

text<-c("China","Japan","perspective","United Kingdom","formatting","clear",
"India","Sudan","United States of America","Bagel","Mongolian")

#Make a logical vector
arg<-grepl("China",text)|grepl("Japan",text)|grepl("UnitedKingdom",text)|
grepl("Mongolian",text)

#Pick out from text
text[arg]

This is an interesting problem! You're right in thinking the easiest approach to this is to use a look-up table for identifying locations. Of course, this approach is going to be only as good as your data set and is still going to be somewhat prone to misclassification. One resource I've found is the Free World Cities Database. One caveat—this isn't still actively maintained, and the countries are listed as country code, which may require further resolution on your part. Another possibility is the geonames data set. There, it looks like you'd want their allCountries.zip dataset.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange