Question

I am using the Hmisc Package to calculate the quantiles of two continous variables and compare the results in a crosstable. You find my code below.

My problem is that the calculation of the quantiles takes a considerable amount of time if the number of observations increases.

Is there any possibility to speed up this procedure by using the data.table, ddply or any other package?

Thanks.

library(Hmisc)

# Set seed
set.seed(123)

# Generate some data
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)
c <- data.frame(a,b)

# Calculate quantiles
c$a.quantile <- cut2(a, g=5)
c$b.quantile <- cut2(b, g=5)

# Output some descriptives
summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      25.13        3.47       28.73 
Was it helpful?

Solution

As stated by jlhoward and Ricardo Saporta data.table doesn't seem to speed up things too much in this case. The cut2 function is clearly the bottleneck here. I used another function to calculate the quantiles (see Is there a better way to create quantile "dummies" / factors in R?) and was able to decrease the calculation time by half:

qcut <- function(x, n) {
  if(n<=2)
    { 
    stop("The sample must be split in at least 3 parts.")
  }
  else{
    break.values <- quantile(x, seq(0, 1, length = n + 1), type = 7)
    break.labels <- c(
      paste0(">=",break.values[1], " & <=", break.values[2]),
      sapply(break.values[3:(n)], function(x){paste0(">",break.values[which(break.values == x)-1], " & <=", x)}),
      paste0(">",break.values[(n)], " & <=", break.values[(n+1)]))
    cut(x, break.values, labels = break.labels,include.lowest = TRUE)
  }
}

c$a.quantile.2 <- qcut(c$a, 5)
c$b.quantile.2 <- qcut(c$b, 5)
summaryM(a.quantile.2 ~ b.quantile.2, data=c, overall=TRUE)

# Time spent for calculation:
#       User      System verstrichen 
#      10.22        1.47       11.70 

Using data.table would reduce the calculation time by another second, but I like the summary by the Hmisc package better.

OTHER TIPS

You can use data.table's .N built in variable, to quickly tabulate.

library(data.table)
library(Hmisc)

DT <- data.table(a,b)
DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]

DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]

You can break that last line down into two steps, to see what is going on. The second "[ " simply reshapes the data in a clean format.

DT.tabulated <- DT[, .N, keyby=list(b.quantile, a.quantile)]
DT.tabulated

DT.tabulated[, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]

Data tables don't seem to improve things here:

library(Hmisc)
set.seed(123)
a <- sample(1:25, 1e7, replace=TRUE)
b <- sample(1:25, 1e7, replace=TRUE)

library(data.table)
# original approach
system.time({
  c <- data.frame(a,b)
  c$a.quantile <- cut2(a, g=5)
  c$b.quantile <- cut2(b, g=5)
  smry.1 <-summaryM(a.quantile ~ b.quantile, data=c, overall=TRUE)
})
   user  system elapsed 
  72.79    6.22   79.02 

# original data.table approach
system.time({
  DT <- data.table(a,b)
  DT[, paste0(c("a", "b"), ".quantile") := lapply(.SD, cut2, g=5), .SDcols=c("a", "b")]
  smry.2 <- DT[, .N, keyby=list(b.quantile, a.quantile)][, setNames(as.list(N), as.character(b.quantile)), by=a.quantile]
})
   user  system elapsed 
  66.86    5.11   71.98 

# different data.table approach (simpler, and uses table(...))
system.time({
  dt     <- data.table(a,b)
  smry.3 <- table(dt[,lapply(dt,cut2,g=5)])
})
   user  system elapsed 
  67.24    5.02   72.26 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top