Pregunta

How can I speed up probability-weighted Sampling in R.

# Let's assume we are considering following example:
w <- sample(1:4000,size=2e6, replace=T)   

# "w" will be integer, so we are going to convert it to numeric.
w <- as.numeric(w)

# Actually the sampling process have to be repeated many times.
M <- matrix(NA, 10, 2000)
system.time(
for (r in 1:10){
  ix <- sample(1:2e6,size=2000,prob=w/sum(w))
  M[r,] <- ix
})
# It's worth it to mention that without "prob=w/sum(w)" sampling is considerably faster.
# The main goal is to speed up sampling with probability weights!
system.time(ix <- sample(1:2e6,size=2000,prob=w/sum(w)))

A weighted sampling takes 9.84 sec., normal sampling 0.01 sec. If you have any idea how it is possible to speed up weighted sampling, please feel open to answer.

¿Fue útil?

Solución

The speed problem is only limited to weighted sampling without replacement. Here's your code again, moving the parts unrelated to sample outside of the loop.

normalized_weights <- w/sum(w)
#No weights
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000)
})
#Weighted, no replacement
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000, prob = normalized_weights)
})
#Weighted with replacement
system.time(
for (r in 1:10){
  ix <- sample(2e6, size = 2000, replace = TRUE, prob = normalized_weights)
})

The big problem is that when you do weighted sampling without replacement, each time you pick a value, the weights need to be recalculated. See ?sample:

If 'replace' is false, these probabilities are applied sequentially, that is the probability of choosing the next item is proportional to the weights amongst the remaining items.

There may be faster solutions than using sample (I don't know how well it's been optimized) but it's a fundamentally more computationally intensive task than the unweighted/weighted-with-replacement sampling.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top