Pergunta

Suppose that we have a random generator that outputs numbers in the range $[0..R-1]$ with uniform distribution and we need to generate random numbers in the range $[0..N-1]$ with uniform distribution.

Suppose that $N < R$ and $N$ does not evenly divide $R$; in order to get a truly uniform distribution we can use the rejection sampling method:

  • if $k$ is the greatest integer such that $k N < R$
  • pick a random number $r$ in $[0..R-1]$
  • if $r < k N$ then output $r \mod N$, otherwise keep trying with other random numbers r', r", ... until the condition is met
Is rejection sampling the only way to get a truly uniform discrete distribution?

If the answer is yes, why?

Note: if $N > R$ the idea is the same: generate a random number $r'$ in $[0..R^m-1], R^m >= N$, for example $r' = R(...R(R r_1 + r_2)...)+r_m$ where $r_i$ is a random number in the range $[0..R-1]$

Foi útil?

Solução

Yes and no, depending on what you mean by “the only way”. Yes, in that there is no method that is guaranteed to terminate, the best you can do (for generic values of $N$ and $R$) is an algorithm that terminates with probability 1. No, in that you can make the “waste” as small as you like.

Why guaranteed termination is impossible in general

Suppose that you have a deterministic computation engine (a Turing machine or whatever floats your boat), plus an oracle that generates random elements of the $R$-element set $[0..R-1]$. Your goal is to generate an element of the $N$-element set $[0,N-1]$. The output of your engine depends only on the sequence of values returned by the oracle; it is a function $f$ of that potentially infinite sequence $(r_0, r_1, r_2, \ldots)$.

Suppose that your engine calls the oracle at most $m$ times. There might be traces for which the oracle is called less than $m$ times; if so, calling the oracle extra times so that it is always called exactly $m$ times does not change the output. So without loss of generality, we assume that the oracle is called exactly $m$ times. Then the probability of the outcome $x$ is the number of sequences $(r_0, \ldots, r_{m-1})$ such that $f(r_0, \ldots, r_{m-1}) = x$. Since the oracle is a uniform random generator, each sequence is equiprobable and has the probability $1/R^m$. Hence the probability of each outcome is of the form $A/R^m$ where $A$ is an integer between $0$ and $R^m$.

If $N$ divides $R^m$ for some $m$, then you can generate a uniform distribution over $N$ elements by calling the random generator $m$ times (this is left as an exercise to the reader). Otherwise, this is impossible: there is no way to obtain an outcome with probability $1/N$. Note that the condition is equivalent to saying that all of $N$'s prime factors are also factors of $R$ (this is more permissive than what you wrote in your question; for example you can pick a random element among 4 with a 6-sided fair die, even though 4 does not divide 6).

Reducing the waste

In your strategy, when $r \ge k\,N$, you don't have to re-draw immediately. Intuitively, there's a bit of entropy left in $[k\,N .. R-1]$ which you can keep in the mix.

Assume for a moment that you will in fact keep generating random numbers below $N$ forever, and you generate $u$ of them at a time by making $d$ draws. If you do a straightforward rejection sampling on this grouped generation, the waste over $d$ draws is $\dfrac{R^d - k\,N^u}{d}$, i.e. the remainder $R^d \mathbin{\mathrm{mod}} N^u$ divided by the number of draws. This can be as little as $\gcd(R,N)$. When $R$ and $N$ are coprime, you can make the waste arbitrarily small by picking sufficiently large values of $d$. For general values of $R$ and $N$, the calculation is more complicated because you need to take into account the generation of $\gcd(R,N)$ and $N/\gcd(R,N)$ separately, but again you can make the waste arbitrarily small with large enough groups.

In practice, even with relatively inefficient random numbers (e.g. in cryptography), it is rarely worth doing anything but simple rejection sampling, unless $N$ is small. For example, in cryptography, where $R$ is typically a power of 2 and $N$ is typically hundreds or thousands of bits, uniform random number generation usually proceeds by straight rejection sampling in the desired range.

Outras dicas

Shannon's source coding theorem shows that, in some exact sense, you need $\log N/\log R$ samples (on average) of the type $[0,\ldots,R-1]$ to generate a random number of the type $[0,\ldots,N-1]$. More accurately, Shannon gives an (inefficient) algorithm that given $m$ samples of the first type, outputs $m(\log N/\log R - \epsilon)$ samples of the second type, with high probability. He also shows that outputting $m(\log N/\log R + \epsilon)$ samples with high probability is impossible.

Shannon's theorem also works in the more general case of a skewed input distribution (and probably also skewed output distribution). In that case, you need to replace the logarithm with the entropy. While the algorithm given by the theorem is defined randomly, in some cases it is possible to derandomize it (at the cost of somewhat worse performance).

Actually, no, rejection sampling is far from the only way of proceeding. Unfortunately, considering that computers store all information as bits, and thus can only manipulate random bits of information, any algorithm to draw a uniform random variable of range $N$ will be infinite, if the binary base development of $N$ is infinite.

This theorem is a classical result by Knuth and Yao (1976), who developed the framework of DDG-trees (discrete distribution generating trees).

The methods exposed by Gilles are the typical sort of thing that has been done to mitigate the waste incurred by rejection, but of course if one can generate following Knuth and Yao's trees it is much, much more efficient - on average 96% of random bits are saved.

I have given more information on this in the following CStheory post.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top