Question

We are given a set of objects, say integers, $S$. In addition, we are given a predicate $P$, for example $P(i): \Leftrightarrow i \geq 0$. We don't know in advance how many elements of $S$ satisfy the predicate $P$, but we would like to sample or choose an element uniformly at random from $S' = \{ i \mid i \in S \wedge P(i) \}$.

The naive approach is to scan $S$ and for example record all the integers or indices for which $P$ holds, then choose one of them uniformly at random. The downside is that in the worst-case, we need $|S|$ space.

For large sets or in say a streaming environment the naive approach is not acceptable. Is there an in-place algorithm for the problem?

Was it helpful?

Solution

The following approach is often referred to as reservoir sampling. Start scanning the set $S$ and maintain an index $j$ to the set. The first element you encounter satisfying the predicate, set $j$ to it with probability $1/1$. Update your choice for $j$ when you encounter the second such element with probability $1/2$. When you encounter the $k$th element satisfying the predicate, set $j$ to it with probability $1/k$.

The algorithm works in $\Theta(n)$ time and in constant space. For correctness, it should be shown that every item has an equal probability of being chosen and that the events they are sampled are independent. The second claim of independency is obvious: our decisions are not based on whether we have previously sampled items. The first claim (equal likelihood) can be proven quite easily with induction.

For details, see J. S. Vitter, Random Sampling With a Reservoir, algorithm R. For a more compact presentation, see for example these slides.

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top