select rows in one data frame that partially match rows in another data frame

https://stackoverflow.com/questions/23145087

05-07-2023
|

Frage

I wish to select rows in one data frame, data.1, that partially match rows in a second data frame, keep.these, to obtain the desired.result. I have found several questions here that match based on one column, but I want to match on three columns: STATE, COUNTY and CITY. I have come up with three solutions so far, but none seem ideal.

Note that each row contains a unique combination of STATE, COUNTY and CITY in my real data.

When I use merge I must re-order. The function match seems to work, but I am not familiar with it and do not know if my use of this function is as intended. The apply solution below is clearly too complex.

The merge approach would be ideal if I did not have to reorder the result. Reordering can be time consuming with large data sets. The match approach seems okay if someone can confirm this is a reasonable approach.

Is there a better solution, ideally in base R?

data.1 <- read.table(text= "
     CITY     COUNTY   STATE        AA
       1          1         1        2
       2          1         1        4
       1          2         1        6
       2          2         1        8
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         3      200
       2          1         3      400
       1          2         3      600
       2          2         3      800
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
       1          1         5    20000
       2          1         5    40000
       1          2         5    60000
       2          2         5    80000
", header=TRUE, na.strings=NA)

keep.these <- read.table(text= "
     CITY     COUNTY     STATE      BB
       1          1         2      -10
       2          1         2      -11
       1          2         2      -12
       2          2         2      -13
       1          1         4      -14
       2          1         4      -15
       1          2         4      -16
       2          2         4      -17
", header=TRUE, na.strings=NA)

desired.result <- read.table(text= "
     CITY     COUNTY    STATE       AA
       1          1         2       20
       2          1         2       40
       1          2         2       60
       2          2         2       80
       1          1         4     2000
       2          1         4     4000
       1          2         4     6000
       2          2         4     8000
", header=TRUE, na.strings=NA)

##########

# this works, but I need to reorder

new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'))

new.data.a <- new.data.a[order(new.data.a$STATE, new.data.a$COUNTY, new.data.a$CITY),]

rownames(desired.result) <- NULL
rownames(new.data.a)     <- NULL

all.equal(desired.result, new.data.a)

##########

# this seems to work, but match is unfamiliar

new.data.2 <- data.1[match(data.1$CITY  , keep.these$CITY  , nomatch=0) & 
                     match(data.1$STATE , keep.these$STATE , nomatch=0) & 
                     match(data.1$COUNTY, keep.these$COUNTY, nomatch=0),]

rownames(desired.result) <- NULL
rownames(new.data.2)     <- NULL

all.equal(desired.result, new.data.2)

##########

# this works, but is too complex

data.1b      <- data.frame(my.group = apply(    data.1[,1:3], 1, paste, collapse = "."),     data.1)
keep.these.b <- data.frame(my.group = apply(keep.these[,1:3], 1, paste, collapse = "."), keep.these)

data.1b <- data.1b[apply(data.1b, 1, function(x) {x[1] %in% keep.these.b$my.group}),]
data.1b <- data.1b[,-1]

rownames(desired.result) <- NULL
rownames(data.1b)        <- NULL

all.equal(desired.result, data.1b)

##########

Lösung

Here is a generic solution for this type of problem which is very efficient:

data.1.ID <- paste(data.1[,1],data.1[,2],data.1[,3])

keep.these.ID <- paste(keep.these[,1],keep.these[,2],keep.these[,3])

desired.result <- data.1[data.1.ID %in% keep.these.ID,]

I have simply created an unique ID for each record, and then searched it. Note: This will change the row names, and you may want to add the following:

row.names(desired.result) <- 1:nrow(desired.result)

EDIT:

Here is another way to solve the same problem.

If you have a very large data set, say millions of rows, another very efficient solution is using the package data.table. It works nearly 50-100 times faster than merge, depending on how much data you have.

All you have to do is the following:

library(data.table)

Step1: Convert data.frame to data.table, with first three columns as keys.

d1 <- data.table(data.1, key=names(data.1)[1:3])
kt <- data.table(keep.these, key=names(keep.these)[1:3])

Step2: A merge using data.table's binary search:

d1[kt]

Note1: The simplicity of execution. Note2: This will sort the data by key. To avoid that try following:

data.1$index <- 1:nrow(data.1)  # Add index to original data
d1 <- data.table(data.1,key=names(data.1)[1:3]) # Step1 as above
kt <- data.table(keep.these,key=names(keep.these)[1:3])  # Step1 as above
d1[kt][order(index)]  # Step2 as above

If you want to remove the last two columns (index, BB), that's straight forward too:

d1[kt][order(index)][,-(5:6),with=F] #Remove index

Try this with large data sets, and compare the timing with merge. It's typically about 50-100 times faster.

To learn more about data.table, try:

vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

Or see it in action:

example(data.table)

Hope this helps!!

Andere Tipps

I'm not sure how this will do in terms of time, compared to reordering, but you can simply add an option to merge to not change the sort.

new.data.a <- merge(keep.these[,1:3], data.1, by=c('CITY', 'COUNTY', 'STATE'), sort = FALSE)
rownames(desired.result) <- NULL
rownames(new.data.a)     <- NULL
all.equal(desired.result, new.data.a)

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow