R + match values at scale (using apply?)

https://stackoverflow.com/questions/22667470

21-06-2023
|

Question

Is there a way to make matching values at scale more programmatic? Basically what I want to do is add a bunch of columns for value lookups onto a dataframe, but I don't want to write the match[] argument every time. It seems like this would be a use case for mapply but I can't quite figure out how to use it here. Any suggestions?

Here's the data:

data <- data.frame(
    region = sample(c("northeast","midwest","west"), 50, replace = T),
    climate = sample(c("dry","cold","arid"), 50, replace = T),
    industry = sample(c("tech","energy","manuf"), 50, replace = T))

And the corresponding lookup tables:

lookups <- data.frame(
    orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
    look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
    )

So now what I want to do is: First add a column to "data" that's called "reg_lookups" and it will match the region to its appropriate value in "lookups". Do the same for "climate_lookups" and so on.

Right now, I've got this mess:

data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]

I've tried using a function to do this, but the function doesn't seem to work, so then applying that to mapply is a no-go (plus I'm confused about how the mapply syntax would work here):

match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
    df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
    return(df)
}

data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)

Solution

I think you're just trying to do this:

data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)

But it would be much better to store the lookups either in separate data frames. That way you can control the names of the new columns more easily. It would also allow you to do something like this:

lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')

do.call(cbind,mapply(merge,
        x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
        y = lookups1,
        moreArgs = list(all.x = TRUE),
        SIMPLIFY = FALSE))

and you should be able to wrap that do.call bit in a function.

I used data[,1,drop = FALSE] in order to preserve them as one column data frames.

The way you structure mapply calls is to pass named arguments as lists (the x = and y = parts). I wanted to be sure to preserve all the rows from data, so I passed all.x = TRUE via moreArgs, so that gets passed each time merge is called. Finally, I need to stitch them all together myself, so I turned off SIMPLIFY.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow