How to measure similarity between strings?

https://stackoverflow.com/questions/11342931

19-06-2021
|

Pregunta

I have a bunch of names, and I want to obtain the unique names. However, due to spelling errors and inconsistencies in the data the names might be written down wrong. I am looking for a way to check in a vector of strings if two of them are similair.

For example:

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")

I want to find that " Obama, B." and "Obama, B.H." are very similar. Is there a way to do this?

Solución

This can be done based on eg the Levenshtein distance. There are multiple implementations of this in different packages. Some solutions and packages can be found in the answers of these questions:

But most often agrep will do what you want :

> sapply(pres,agrep,pres)
$` Obama, B.`
[1] 1 3

$`Bush, G.W.`
[1] 2

$`Obama, B.H.`
[1] 1 3

$`Clinton, W.J.`
[1] 4

Otros consejos

Maybe agrep is what you want? It searches for approximate matches using the Levenshtein edit distance.

lapply(pres, agrep, pres, value = TRUE)

[[1]]
[1] " Obama, B."  "Obama, B.H."

[[2]]
[1] "Bush, G.W."

[[3]]
[1] " Obama, B."  "Obama, B.H."

[[4]]
[1] "Clinton, W.J."

Add another duplicate to show it works with more than one duplicate.

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.", "Bush, G.")

adist shows the string distance between 2 character vectors

adist(" Obama, B.", pres)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    9    3   10    7

For example, to select the closest string to " Obama, B." you can take the one which has the minimal distance. To avoid the identical string, I took only distances greater than zero:

d <- adist(" Obama, B.", pres)
pres[min(d[d>0])]
# [1] "Obama, B.H."

To obtain unique names, taking into account spelling errors and inconsistencies, you can compare each string to all previous ones. Then if there is a similar one, remove it. I created a keepunique() function that performs this. keepunique() is then applied to all elements of the vector successively with Reduce().

keepunique <-  function(previousones, x){
    if(any(adist(x, previousones)<5)){
        x <- NULL
    }
    return(c(previousones, x))
}
Reduce(keepunique, pres)
# [1] " Obama, B."    "Bush, G.W."    "Clinton, W.J."

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow