Question

I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?

Was it helpful?

Solution

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A single Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question sure suggests that full outer joins are possible there as well.

OTHER TIPS

Without sqldf, here is a smart and simple solution :

merge(a, b, by = "col", all = T)

FX

If you are using ffbase, you can get to your desired result of a full outer join if you combine expand.ffgrid with merge.ffdf. expand.ffgrid is like expand.grid but works with ff vectors so it will not overblow your RAM and merge.ffdf allows to merge with another ffdf without overblowing your RAM and storing data on disk. An example below.

require(ffbase)
x <- ffseq(1, 10000)
y <- ff(factor(LETTERS))
allcombinations <- expand.ffgrid(x, y)
addme <- data.frame(Var1 = c(1, 2), Var2 = c("A","B"), measure = rnorm(2))
addme <- as.ffdf(addme)
myffdf <- merge(allcombinations, addme, by.x=c("Var1","Var2"), by.y=c("Var1","Var2"),  all.x=TRUE)
myffdf[1:10,]

Next, look at delete rows ff package on how to subset that resulting myffdf.

Do have a look at ?ffbase::expand.ffgrid and ?ffbase::merge.ffdf

This might work (note: the key column must be the first column in every dataset).

library(ff)
library(ffbase)

fullouterjoin <- function(ffdf1, ffdf2){

    # do a left outer join
    leftjoin <- merge(ffdf1, ffdf2, by = "key", all.x = TRUE)

    # do a right outer join (it's just a left outer join with the objects swapped)
    rightjoin <- merge(ffdf2, ffdf1, by = "key", all.x = TRUE)

    # swap the column orders (make ffd1 columns first and ffd2 columns later)
    srightjoin <- rightjoin[c(names(ffdf1), names(ffdf2)[2:length(ffdf2)])]

    # stack left outer join on top of the (swapped) right outer join
    stacked <- rbind(leftjoin, srightjoin)

    # remove duplicate rows
    uniques <- unique(stacked)

    # that's it
    return(uniques)
}

usage:

newffdf <- fullouterjoin(some_ffdf, another_ffdf)

I'm not saying it's fast, but it might overcome the memory barrier.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top