Question

I have an application that requires me to bin data according X numbers of tiers. To keep things simple, say that I want to create a new vector that assigns a bin of 1 - 4 to each observation, depending on that observation's quartile.

Here's the solution I've come up with so far:

binner <- function(N){ 
  start <- Sys.time()
  vec <- runif(N)
  cuts <- quantile(vec, seq(0, 0.75, 0.25)) 
  bins <- sapply(vec, function(x) max(which(x >= cuts)))
  end <- Sys.time()

  cat('Run time:', end - start) 
  bins
}
tmp <- binner(100)
tmp

Works great for lightweight implementations, but try experimenting with the values of N. It gets inefficient really quick (run these one at a time: your computer might start hanging):

tmp <- binner(1000) 
tmp <- binner(10000)
tmp <- binner(100000)
tmp <- binner(1000000)
tmp <- binner(10000000)

I know that a classic "R-like" way to resolve for-loop inefficiencies is through vectorization. This one is stumping me, though, because I'm not sure how to vectorize the application of logic on an element-by-element basis.

Any thoughts? How do we bring the run times down on this other than setting up a bunch of parallel workers?

-Aaron

Was it helpful?

Solution

How about this with cut()? I've returned a list so that the time comes out as well, but you can just return the bins. Also, I added 5 bins to cater for the 4 q points, 0-min and max-Inf:

  binner <- function(N=1000){ 

    vec<-runif(N)        
    timer<-system.time(ret<-cut(vec,breaks<-c(0,quantile(vec, seq(0, 0.75, 0.25)),Inf),labels=1:5))
    list(ret,timer)

  }

binner(10000000)

...
[[2]]
user  system elapsed 
4.55    0.12    4.70 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top