Question

I have a dataset of restaurants and the variable "CONAME" contains the name of each establishment. Unfortunately, there are quite a few misspellings, and I'd like to correct them. I've tried agrep for fuzzy set matching using the following code (which I'll repeat for all major chains):

rest2012$CONAME <- agrep("MC DONALD'S", rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)

I'm getting the following error message: Error in $<-.data.frame(*tmp*, "CONAME", value = c(35L, 40L, 48L, : replacement has 3074 rows, data has 67424

Is there another way I can replace the misspelled names or am I simply using the agrep function wrong?

No correct solution

OTHER TIPS

When you use agrep with value = FALSE the result is "a vector giving the indices of the elements that yielded a match". That is, the position of matches in the vector of names that you fed agrep with. You then try to replace the entire name variable in your data frame (67424 rows) with a shorter vector of indices (3074 of them). Not what you want. Here is a small example which perhaps can guide you in the right direction. You may also read ?Extract and this. The details of agrep itself (e.g. max.distance), I leave to you.

# create a data frame with some MC DONALD's-ish names, and some other names.
rest2012 <- data.frame(CONAME = c("MC DONALD'S", "MCC DONALD'S", "SPSS Café", "GLM RONALDO'S", "MCMCglmm"))
rest2012

# do some fuzzy matching with 'agrep'
# store the indices in an object named 'idx'
idx <- agrep(pattern = "MC DONALD'S", x = rest2012$CONAME, ignore.case = FALSE, value = FALSE, max.distance = 3)
idx

# just look at the rows in the data frame that matched
# indexing with a numeric vector 
rest2012[idx, ]

# replace the elements that matches 
rest2012[idx, ] <- "MC DONALD'S"
rest2012
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top