Question

I want to quickly generate discrete random numbers where I have a known CDF. Essentially, the algorithm is:

  1. Construct the CDF vector (an increasing vector starting at 0 and end at 1) cdf
  2. Generate a uniform(0, 1) random number u
    • If u < cdf[1] choose 1
    • else if u < cdf[2] choose 2
    • else if u < cdf[3] choose 3 *...

Example

First generate an cdf:

cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)

Next generate N uniform random numbers:

N = 1000
u = runif(N)

Now sample the value:

##With some experimenting this seemed to be very quick
##However, with N = 100000 we run out of memory
##N = 10^6 would be a reasonable maximum to cope with
colSums(sapply(u, ">", cdf))
Was it helpful?

Solution 2

How about using cut:

N <- 1e6
u <- runif(N)
system.time(as.numeric(cut(u,cdf)))
   user  system elapsed 
   1.03    0.03    1.07 

head(table(as.numeric(cut(u,cdf))))

  1   2   3   4   5   6 
 51  95 165 172 148  75 

OTHER TIPS

If you know the probability mass function (which you do, if you know the cumulative distribution function), you can use R's built-in sample function, where you can define the probabilities of discrete events with argument prob.

cdf = cumsum(runif(10000, 0, 0.1))
cdf = cdf/max(cdf)

system.time(sample(size=1e6,x=1:10000,prob=c(cdf[1],diff(cdf)),replace=TRUE))
   user  system elapsed 
   0.01    0.00    0.02 

If you have a finite number of possible values then you can use findInterval or cut or better sample as mentioned by @Hemmo.

However, if you want to generate data from a distribution that that theoretically goes to infinity (like the geometric, negative binomial, Poisson, etc.) then here is an algorithm that will work (this will also work with a finite number of values if wanted):

Start with your vector of uniform values and loop through the distribution values subtracting them from the vector of uniforms, the random value is the iteration where the value goes negative. This is a easier to see whith an example. This generates values from a Poisson with mean 5 (replace the dpois call with your calculated values) and compares it to using the inverse CDF (which is more efficient in this case where it exists).

i <- 0
tmp <- tmp2 <- runif(10000)
randvals <- rep(0, length(tmp) )

while( any(tmp > 0) ) {
    tmp <- tmp - dpois(i, 5)
    randvals <- randvals + (tmp > 0)
    i <- i + 1
}

randvals2 <- qpois( tmp2, 5 )

all.equal(randvals, randvals2)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top