How can I perform full outer joins of large data sets in R?

Question 1

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A single Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question sure suggests that full outer joins are possible there as well.

Question 2

Without sqldf, here is a smart and simple solution :

merge(a, b, by = "col", all = T)

FX

Question 3

If you are using ffbase, you can get to your desired result of a full outer join if you combine expand.ffgrid with merge.ffdf. expand.ffgrid is like expand.grid but works with ff vectors so it will not overblow your RAM and merge.ffdf allows to merge with another ffdf without overblowing your RAM and storing data on disk. An example below.

require(ffbase)
x <- ffseq(1, 10000)
y <- ff(factor(LETTERS))
allcombinations <- expand.ffgrid(x, y)
addme <- data.frame(Var1 = c(1, 2), Var2 = c("A","B"), measure = rnorm(2))
addme <- as.ffdf(addme)
myffdf <- merge(allcombinations, addme, by.x=c("Var1","Var2"), by.y=c("Var1","Var2"),  all.x=TRUE)
myffdf[1:10,]

Next, look at delete rows ff package on how to subset that resulting myffdf.

Do have a look at ?ffbase::expand.ffgrid and ?ffbase::merge.ffdf

Question 4

This might work (note: the key column must be the first column in every dataset).

library(ff)
library(ffbase)

fullouterjoin <- function(ffdf1, ffdf2){

    # do a left outer join
    leftjoin <- merge(ffdf1, ffdf2, by = "key", all.x = TRUE)

    # do a right outer join (it's just a left outer join with the objects swapped)
    rightjoin <- merge(ffdf2, ffdf1, by = "key", all.x = TRUE)

    # swap the column orders (make ffd1 columns first and ffd2 columns later)
    srightjoin <- rightjoin[c(names(ffdf1), names(ffdf2)[2:length(ffdf2)])]

    # stack left outer join on top of the (swapped) right outer join
    stacked <- rbind(leftjoin, srightjoin)

    # remove duplicate rows
    uniques <- unique(stacked)

    # that's it
    return(uniques)
}

usage:

newffdf <- fullouterjoin(some_ffdf, another_ffdf)

I'm not saying it's fast, but it might overcome the memory barrier.