Extracting the top match from string comparison in R

https://stackoverflow.com/questions/21803128

12-10-2022
|

Question

I am currently using the 'agrep' function with 'lapply' in a data.table code to link entries from a user-provided VIN# list to a DMV VIN# database. Please see the following two links for all data/code so far:

Accelerate performance and speed of string match in R

Imperfect string match using data.table in R

Is there a way to extract the "best" match from my list that is being generated by:

dt <- dt[lapply(car.vins, function(x) agrep(x,vin.vins, max.distance=c(cost=2, all=2), value=T)), list(NumTimesFound=.N), vin.names]

because as of now, the 'agrep' function gives me multiple matches, even with a lot of modification of the cost, all, substitution, ect. variables.

I have also tried using the 'adist' function instead of 'agrip' but because 'adist' does not have an option for value=TRUE like 'agrep', it throws out the same

Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.

that I was receiving with the 'agrep' before.

Is there perhaps some other package I could use?

Thanks!

Solution

Tom, this isn't strictly a data.table problem. Also, it's hard to know exactly what you want without having the data you are using. I tried to figure out what you want, and I came up with this solution:

vin.match <- vapply(car.vins, function(x) which.min(adist(x, vin.vins)), integer(1L))
data.frame(car.vins, vin.vins=vin.vins[vin.match], vin.names=vin.names[vin.match])
#   car.vins vin.vins vin.names
# 1  abcdekl   abcdef     NAME1
# 2   abcdeF   abcdef     NAME1
# 3  laskdjg  laskdjf     NAME2
# 4  blerghk  blerghk     NAME3

And here is the data:

vin.vins <- c("abcdef", "laskdjf", "blerghk")
vin.names <- paste0("NAME", 1:length(vin.vins))
car.vins <- c("abcdekl", "abcdeF", "laskdjg", "blerghk")

This will find the closest match for every value in car.vins in vin.vins, as per adist. I'm not sure data.table is needed for this particular step. If you provide your actual data (or a representative sample), then I can provide a more targeted answer.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow