Question

I have to do softmatch in one column of data frame with the given input string, like

col <- c("John Collingson","J Collingson","Dummy Name1","Dummy Name2")

inputText <- "J Collingson"
#Vice-Versa
inputText <- "John Collingson"

I want to retrieve both "John Collingson" & "J Collingson" from the provided colname "col"

Kindly help

Was it helpful?

Solution

agrep is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions) led me to the RecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:

library(RecordLinkage)
col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
inputText <- data.frame(names2 = c("J Collingson"))
g1 <- compare.linkage(inputText, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6) 
# id          names2 Weight
# 1  1    J Collingson       
# 2  2    J Collingson  1.000
# 3                          
# 4  1    J Collingson       
# 5  1 John Collingson  0.815

inputText2 <- data.frame(names2 = c("Jon Collinson"))
g1 <- compare.linkage(inputText2, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id          names2    Weight
# 1  1   Jon Collinson          
# 2  1 John Collingson 0.9644444
# 3                             
# 4  1   Jon Collinson          
# 5  2    J Collingson 0.7924825

Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.

OTHER TIPS

It seems that agrep is the function you are looking for. It does Approximate String Matching (Fuzzy Matching). It returns the closest match to the input pattern according to some distance measure, i.e. the generalized Levenshtein edit distance. See ?agrep for more details.

agrep("J Collingson", col, value = TRUE)
[1] "John Collingson" "J Collingson"  
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top