Question

Given a numeric vector with N real numbers, what's the fastest way to sample k values, such that higher values have greater probability of being selected?

mathematically

prob(X) > prob(Y) when X > Y (Linearly)

This is easy with sample() when all entries are positive, just use the prob arg:

N = 1000
k = 600
x = runif(N, 0, 10)
results = sample(x, k, replace = TRUE, prob = x)

But it does'n work in my case, because some values might be negative. I cannot drop or ignore negative values, that's the problem.

So, what's the fastest (code speed) way of doing this? Obviously i know how to solve this, the issue is code speed - one method should be slower than other i guess:

1 - Normalize the x vector (a call to `range()` would be necessary + division)

2 - Sum max(x) to x (a call to `max()` then sum)

Thanks.

Was it helpful?

Solution

A few comments. First, it's still not exactly clear what you want. Obviously, you want larger numbers to be chosen with higher probability, but there are a lot of ways of doing this. For example, either rank(x) or x-min(x) will produce a vector of non-negative weights which are monotonic in x.

Another point, you don't need to normalize the weights, because sample will do that for you, provided that the weights are non-negative:

> set.seed(1)
> sample(1:10,prob=1:10)
 [1]  9  8  6  2 10  3  1  5  7  4
> set.seed(1)
> sample(1:10,prob=(1:10)/sum(1:10))
 [1]  9  8  6  2 10  3  1  5  7  4

On edit: The OP is now asking for a weighting function which is "linear" in the input vector. Technically this is impossible, because linear functions are of the form f(X)=cX, so if a vector x contains both positive and negative values, then any linear function of x will also contain both positive and negative values, unless c=0, in which case it still does not give a valid vector of probability weights.

I think what you mean by "linear" is simply x-min(x). This is not a linear function, but an affine function. Moreover, even if you had specified that you wanted P(X) to vary as an affine function of X, that still would not have uniquely determined the probability weights, because there are an infinite number of possible affine functions that would yield valid weights (e.g. x-min(x)+1, etc.)

In any case, assuming x-min(x) is what you want, the question now becomes, what is the fastest way to compute x-min(x) in R. And I'm pretty sure that the answer is just x-min(x).

Finally, for constants anywhere near what you have in your example, there is not much point in trying to optimize the calculation of weights, because the random sampling is going to take much longer anyway. For example:

> x<-rnorm(1000)
> k<-600
> p<-x-min(x)
> microbenchmark(x-min(x),sample(x,k,T,p))
Unit: microseconds
               expr   min      lq  median      uq    max neval
         x - min(x)  6.56  6.9105  7.0895  7.2515 13.629   100
 sample(x, k, T, p) 50.30 51.4360 51.7695 52.1970 66.196   100
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top