ffdfdply, splitting and memory limit in R

Question

The most difficult part about using ff/ffbase is making sure your data stays in ff and not accidently put it in RAM. As once you will have put your data in RAM (mostly due to some misunderstanding of when data is put in RAM and when it is not), it is hard to get your RAM back from R and if you are working on your RAM limit, a small extra request of RAM will get your 'Error: cannot allocate vector of size'.

Now, I think you misspecified the input to ikey. Look at ?ikey, it requires as input argument an ffdf, not several ff vectors. Probably this has put your data in RAM while what you wanted is probably to use ikey(x[c("id_1","id_2","month","year")])

It simulated some data on my computer as follows to get an ffdf with 24Mio rows, and the following does not give me RAM troubles (it uses approx 3.5Gb of RAM in my machine)

require(ffbase)
require(data.table)
x <- expand.ffgrid(id_1 = ffseq(1, 1000), id_2 = ffseq(1, 1000), year = as.ff(c(2012,2013)), month = as.ff(1:12))
x$Amount <- ffrandom(nrow(x), rnorm, mean = 10, sd = 5)
x$key <- ikey(x[c("id_1","id_2","month","year")])
x$key <- as.character(x$key)
summary <- ffdfdply(x, split=x$key, FUN=function(df) {
  df <- data.table(df)
  df <- df[, list(
    id_1 = id_1[1], 
    id_2 = id_2[1],
    month = month[1],
    year = year[1],
    withdraw = sum(Amount*(Amount>0), na.rm=T)
  ), by = key]
  df
}, trace=TRUE)

Another reason might be that you have too much other data in RAM which you are not talking about. Mark also that in ff, all your factor levels are in RAM, this might also be an issue if you are working with a lot of character/factor data - in that case you need to be asking yourself whether you really need these data in your analysis or not.