Question

I am trying to populate a binary vector based on the intersection of two data.frames on multiple criteria.

I have the code working but I feel that it is memory excessive just to get the binary vector.

When I apply my code to my full data (40mm+ rows). I begin to have memory problems.

Is there a simpler way to produce the vector?

Here is some sample data (e.g., sub sample will only include obs. in full sample):

ob1_1 <- as.data.frame(cbind(c(1999),c("111","222","666","777")),stringsAsFactors=FALSE)
ob2_1 <- as.data.frame(cbind(c(2000),c("111","333","555","777")),stringsAsFactors=FALSE)
ob3_1 <- as.data.frame(cbind(c(2001),c("111","222","333","777")),stringsAsFactors=FALSE)
ob4_1 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

full_sample <-  rbind(ob1_1,ob2_1,ob3_1,ob4_1)
colnames(full_sample) <- c("yr","ID")

ob1_2 <- as.data.frame(cbind(c(1999),c("111","222","777")),stringsAsFactors=FALSE)
ob2_2 <- as.data.frame(cbind(c(2000),c("333")),stringsAsFactors=FALSE)
ob3_2 <- as.data.frame(cbind(c(2001),c("888")),stringsAsFactors=FALSE)
ob4_2 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

sub_sample <-  rbind(ob1_2,ob2_2,ob3_2,ob4_2)
colnames(sub_sample) <- c("yr","ID")

Here is my working code:

q_intersect <- ""
q_intersect <- paste(q_intersect , "select       a.yr, a.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         full_sample a  ", sep=" ")
q_intersect <- paste(q_intersect , "intersect                     ", sep=" ")
q_intersect <- paste(q_intersect , "select       b.yr, b.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         sub_sample b  ", sep=" ")
q_intersect <- trim(gsub(" {2,}", " ", q_intersect ))

intersect_temp <- cbind(sqldf(q_intersect ),1)
colnames(intersect_temp ) <- c("yr","ID","in_both")

q_expand <- ""
q_expand <- paste(q_expand , "select       in_both            ", sep=" ")
q_expand <- paste(q_expand , "from         full_sample a      ", sep=" ")
q_expand <- paste(q_expand , "left join    intersect_temp  b  ", sep=" ")
q_expand <- paste(q_expand , "on           a.yr=b.yr          ", sep=" ")
q_expand <- paste(q_expand , "and          a.ID=b.ID          ", sep=" ")
q_expand <- trim(gsub(" {2,}", " ", q_expand ))

solution <- as.integer(sqldf(q_expand)[,1])
solution [is.na(solution )] <- 0 

Thanks ahead of time for any help!

Was it helpful?

Solution

It's not altogether clear what you are trying to accomplish, but I believe something like this would be a lot simpler.

library(data.table)
fullDT <- data.table(full_sample, key=c("yr", "ID"))
subDT  <- data.table(sub_sample,  key=c("yr", "ID"))

fullDT[ , intersect := 0L]
fullDT[subDT, intersect := 1, nomatch=0]

The idea is that you set the key of each data.table to be the columns you want to intersect. When you call full[sub], nomatch=0] you get your inner join, and we set only those values to 1; the values not identified in the inner join are left as 0, as set in the line prior.

fullDT
#        yr  ID intersect
#   1: 1999 111         1
#   2: 1999 222         1
#   3: 1999 666         0
#   4: 1999 777         1
#   5: 2000 111         0
#   6: 2000 333         1
#   7: 2000 555         0
#   8: 2000 777         0
#   9: 2001 111         0
#  10: 2001 222         0
#  11: 2001 333         0
#  12: 2001 777         0
#  13: 2002 111         1
#  14: 2002 444         1
#  15: 2002 555         1
#  16: 2002 777         1

OTHER TIPS

Simpler SQL I gather that you wish to create a one column data frame with the same number of rows as full_sample such that a given row in the output contains 1 if the corresponding row in full_sample has a matching sub_sample row and 0 otherwise.

In that case, the multiple SQL statements can be condensed into a single simpler SQL statement as shown below. The left join ensures that all rows of full_sample are included and the natural join causes the join to occur on all column names that are common between the two input data frames.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s")

(By the way, note that string literals can flow over multiple lines as this shows so its not necessary to paste multiple lines together.)

Out of Memory Database sqldf by default uses an in memory database but you can specify a file name (which need not exist ahead of time) via the dbname= argument to use as as out of memory database. In that case you won't be limited by memory.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s", dbname = "mydb")

(Also you can improve performance in some cases by using indexes. See the sqldf home page for examples.)

UPDATE: added simpler sql solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top