Question

I have written a code to run several time-series rolling-regressions for multiple securities. Since the number of securities is more than 10,000, and having more than 200 rolling windows for each security, the runtime for a sequential set-up (using foreach %do%) is about 30min.

I would like to implement foreach %dopar% for parallel computing instead, using the "doParrallel" backend. Simply changing %do% with %dopar% in the code doesn't do the trick. I am very new to this parrallel computing method, and would hope to get some help.

Here is the foreach %do% code:

sec = ncol(ret.zoo)
num.factors = 2
rows = nrow(ret.zoo) - 60 + 1
beta.temp = matrix(nc = num.factors + 1, nr = sec*rows)
gvkey.vec = matrix(nc = 1, nr = sec*rows)

d = 1
foreach(i=1:sec) %do% {
      df = merge(ret.zoo[,i], data)
      names(df) <- c("return", names(data))
      gvkey = substr(colnames(ret.zoo)[i],2,9)

      reg = function(z) {
          z.df = as.data.frame(z)
          ret = z.df[,which(names(z.df) ==  "return")]
          ret.no.na = ret[!is.na(ret)]
          if(length(ret.no.na) >= 30) {
             coef(lm(return ~ VAL + SIZE, data = as.data.frame(z), na.action = na.omit))
          }
          else {
             as.numeric(rep(NA,num.factors + 1))   ## the "+1" is for the intercept value
          }     
     }

     beta = rollapply(df, width = 60, FUN = reg, by.column = FALSE, align = "right")
     beta.temp[d:(d+rows-1),] = beta
     gvkey.vec[d:(d+rows-1),] = gvkey
     d = d+rows
}
beta.df = data.frame(secId = gvkey.vec, date = rep(index(beta), sec), beta.temp)
colnames(beta.df) <- c("gvkey", "date", "intercept", "VAL", "SIZE")

In order to enable parallel computing using %dopar%, I have called and registered the backend "doParallel".

Thank you very much!

UPDATE

Here is my first try:

library(doParallel) ## parallel backend for the foreach function
registerDoParallel()

sec = ncol(ret.zoo)
num.factors = 2
rows = nrow(ret.zoo) - 60 + 1

result <- foreach(i=1:sec) %dopar% {
    library(zoo)
    library(stats)

    df = merge(ret.zoo[,i], data)
    names(df) <- c("return", names(data))
    gvkey = substr(colnames(ret.zoo)[i],2,9)

    reg = function(z) {
        z.df = as.data.frame(z)
        ret = z.df[,which(names(z.df) ==  "return")]
        ret.no.na = ret[!is.na(ret)]
        if(length(ret.no.na) >= 30) {
            coef(lm(return ~ VAL + SIZE, data = as.data.frame(z), na.action = na.omit))
        }
        else {
            as.numeric(rep(NA,num.factors + 1))   ## the "+1" is for the intercept value
        }   
    }

    rollapply(df, width = 60, FUN = reg, by.column = FALSE, align = "right")
}
beta.df = do.call('combine', result)

This works perfectly up until the end of the loop. However, the beta.df = do.call('combine', result) gives the following error: Error in do.call("combine", result) : could not find function "combine".

How can I combine the output of result. Now it is a list rather than a dataframe.

Thanks,

Was it helpful?

Solution

Here is the way of combining the results from the different clusters into a dataframe (very efficient from a runtime standpoint):

lstData <- Map(as.data.frame, result)
dfData <- rbindlist(lstData)
beta.df = as.data.frame(dfData)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top