Question

What is the best (fastest) way to implement a sliding window function with the data.table package?

I'm trying to calculate a rolling median but have multiple rows per date (due to 2 additional factors), which I think means that the zoo rollapply function wouldn't work. Here is an example using a naive for loop:

library(data.table)
df <- data.frame(
  id=30000,
  date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

dt = data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))

get_window <- function(date, factor1, factor2) {
  criteria <- data.table(
    date=as.IDate((date - 7):(date - 1), origin="1970-01-01"),
    factor1=as.integer(factor1),
    factor2=as.integer(factor2)
  )
  return(dt[criteria][, value])
}

output <- data.table(unique(dt[, list(date, factor1, factor2)]))[, window_median:=as.numeric(NA)]

for(i in nrow(output):1) {
  print(i)
  output[i, window_median:=median(get_window(date, factor1, factor2))]
}
Was it helpful?

Solution

data.table doesn't have any special features for rolling windows, currently. Further detail here in my answer to another similar question here :

Is there a fast way to run a rolling regression inside data.table?

Rolling median is interesting. It would need a specialized function to do efficiently (same link as in earlier comment) :

Rolling median algorithm in C

The data.table solutions in the question and answers here are all very inefficient, relative to a proper specialized rollingmedian function (which isn't available for R afaik).

OTHER TIPS

I managed to get the example down to 1.4s by creating a lagged dataset and doing a huge join.

df <- data.frame(
  id=30000,
  date=rep(as.IDate(as.IDate("2012-01-01")+0:29, origin="1970-01-01"), each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

dt2 <- data.table(df)
setkeyv(dt, c("date", "factor1", "factor2"))

unique_set <-  data.table(unique(dt[, list(original_date=date, factor1, factor2)]))
output2 <- data.table()
for(i in 1:7) {
  output2 <- rbind(output2, unique_set[, date:=original_date-i])
}    

setkeyv(output2, c("date", "factor1", "factor2"))
output2 <- output2[dt]
output2 <- output2[, median(value), by=c("original_date", "factor1", "factor2")]

That works pretty well on this test dataset but on my real one it fails with 8GB of RAM. I'm going to try moving up to one of the High Memory EC2 instance (with 17, 34 or 68GB RAM) to get it working. Any ideas on how to do this in a less memory intensive way would be appreciated

This solution works but it takes a while.

df <- data.frame(
  id=30000,
  date=rep(seq.Date(from=as.Date("2012-01-01"),to=as.Date("2012-01-30"),by="d"),each=1000),
  factor1=rep(1:5, each=200),
  factor2=1:5,
  value=rnorm(30, 100, 10)
)

myFun <- function(dff,df){
    median(df$value[df$date>as.Date(dff[2])-8 & df$date<as.Date(dff[2])-1 & df$factor1==dff[3] & df$factor2==dff[4]])
}

week_Med <- apply(df,1,myFun,df=df)

week_Med_df <- cbind(df,week_Med)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top