Question

I have a long list of identification codes, at some point it was discovered that some but not all of the identification codes had been mixed up by mistake, the mistake was mapped out and the correct ID codes their incorrect partners identified. Now everything has to be made correct.

However the list of codes (both correct and mixed up) is very long and their are multiple entry for each ID code as well as being a lot of ID codes to correct. I have found various solutions for replacing multiple values but they mostly seem to involve typing in the mapping instead of comparing two vectors, see: Dictionary style replace multiple items in R

That is fine if you can do 1 to 1 mapping of everything or don't mind writing everything out when there are a lot of entries that stops being so great. The solution I have made is the following:

Set up data set and "translation" vectors:

y <- cbind(paste(letters, letters, sep=""), seq(1:26))
y[6,1] <- "a"
current <- c( "aa", "ee", "kk", "mm")
tmp <- c("11", "22", "33", "44")
correct <-c("ee", "mm", "zz", "aa")

replacement solution:

for (i in 1:length(unique(current))) {
y[,1] <- sub(current[i], tmp[i],y[,1])
}
for (i in 1:length(unique(current))) {
y[,1] <- sub(tmp[i], correct[i],y[,1])
}

Is there a way to make this more efficient?

Thanks for the help!

Was it helpful?

Solution

Here is an alternative approach using match that does all the swapping at once do you don't need the temp variable

swap <- match(y[,1], current)
y[which(!is.na(swap)),1] <- correct[na.omit(swap)]

which produces the same results are your code. If appears to be more efficient by this benchmark

OTHER TIPS

Here is one approach:

library(gsubfn)
tmp2 <- as.list(correct)
names(tmp2) <- current

pat <- paste(current, collapse='|')

y[,1] <- gsubfn(pat,tmp2, y[,1])

This looks for any of the wrong codes, then looks up the current code in the conversion list (tmp2) and replaces it with the correct value.

One way to do this is to set the names of correct to current, then you can assign new values to them easily

names(correct) <- current
y[y[,1] %in% current,1] <- correct[y[y[,1] %in% current,1]]

breaking this down a bit:

y[,1] %in% current is a vector of which variables need to change

y[y[,1] %in% current,1] is the values to change

correct[y[y[,1] %in% current,1]] is the new value to insert ordered by how thy appear in y.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top