Pregunta

I am working on Entity Extraction in R. I have a UniqueID and Text field - need to extract location information from the text field. My Text field has description with location names

text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE  4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.") 

I have a list of Locations ;

Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")

Need to match the loc and extract those location from the text field.In the text field SENOKO INDUSTRIAL ESTATE is spelt in different ways Senoko Estate or Senoko (Half Names) or with spelling mistake senok Est. .for all the above mis-spelt and half spelt words - i need to get the exact name from loc ie. SENOKO INDUSTRIAL ESTATE.

My output would look like:(Extract location from Text field -get correct words for half- spelt and misspelt words)

ID   Location
123  SERANGOON JUNIOR COLLEGE|Block 4|SHELL TAMPINES AVENUE 4|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE|SENOKO INDUSTRIAL ESTATE
¿Fue útil?

Solución

I don't think this is the prettiest way to answer it, but..

text <- c("SERANGOON JC","Blk 4","SHELL TAMPINES AVE  4","SENOKO INDUSTRIAL ESTATE","Senoko Estate","Senoko","senok Est.") 

Loc <- c("SERANGOON JUNIOR COLLEGE","Block 4","SHELL TAMPINES AVENUE 4","SENOKO INDUSTRIAL ESTATE")

text <- gsub(".*serang.*", "SERANGOON JUNIOR COLLEGE", text, ignore.case=TRUE)
text <- gsub(".*bl.* 4.*", "Block 4", text, ignore.case=TRUE)
text <- gsub(".*shell.*", "SHELL TAMPINES AVENUE 4", text, ignore.case=TRUE)
text <- gsub(".*senok.*", "SENOKO INDUSTRIAL ESTATE", text, ignore.case=TRUE)


print(text)

I didn't put it exactly in the format you requested, but that would be the contents of the second column (aka Location). I used the regex expression ".*" before and after the strings you were looking for in case there are other possibilities/typos. This would make it more robust.

Hope this helps!

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top