R intersect data.frame on multiple criteria

Question 1

It's not altogether clear what you are trying to accomplish, but I believe something like this would be a lot simpler.

library(data.table)
fullDT <- data.table(full_sample, key=c("yr", "ID"))
subDT  <- data.table(sub_sample,  key=c("yr", "ID"))

fullDT[ , intersect := 0L]
fullDT[subDT, intersect := 1, nomatch=0]

The idea is that you set the key of each data.table to be the columns you want to intersect. When you call full[sub], nomatch=0] you get your inner join, and we set only those values to 1; the values not identified in the inner join are left as 0, as set in the line prior.

fullDT
#        yr  ID intersect
#   1: 1999 111         1
#   2: 1999 222         1
#   3: 1999 666         0
#   4: 1999 777         1
#   5: 2000 111         0
#   6: 2000 333         1
#   7: 2000 555         0
#   8: 2000 777         0
#   9: 2001 111         0
#  10: 2001 222         0
#  11: 2001 333         0
#  12: 2001 777         0
#  13: 2002 111         1
#  14: 2002 444         1
#  15: 2002 555         1
#  16: 2002 777         1

Question 2

Simpler SQL I gather that you wish to create a one column data frame with the same number of rows as full_sample such that a given row in the output contains 1 if the corresponding row in full_sample has a matching sub_sample row and 0 otherwise.

In that case, the multiple SQL statements can be condensed into a single simpler SQL statement as shown below. The left join ensures that all rows of full_sample are included and the natural join causes the join to occur on all column names that are common between the two input data frames.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s")

(By the way, note that string literals can flow over multiple lines as this shows so its not necessary to paste multiple lines together.)

Out of Memory Database sqldf by default uses an in memory database but you can specify a file name (which need not exist ahead of time) via the dbname= argument to use as as out of memory database. In that case you won't be limited by memory.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s", dbname = "mydb")

(Also you can improve performance in some cases by using indexes. See the sqldf home page for examples.)

UPDATE: added simpler sql solution