Question

A source provides a stream of items $x_1, x_2,\dots$ . At each step $n$ we want to save a random sample $S_n \subseteq \{ (x_i, i)|1 \le i \le n\}$ of size $k$, i.e. $S_n$ should be a uniformly chosen sample from all $\tbinom{n}{k}$ possible samples consisting of seen items. So at each step $n \ge k$ we must decide whether to add the next item to $S$ or not. If so we must also decide which of the current items to remove from $S$ .

State an algorithm for the problem. Prove its correctness.

Was it helpful?

Solution

Due to the dubious nature of the question, I only provide hints.

Have you tried the obvious? With probability $\frac{1}{n}$, add the new element to the sample. If it is added, choose one of the elements already in the sample uniformly at random and drop it. Sounds about fair, does it not?

For a proof, you will have to proceed inductively. In the step, you assume that $S_{n-1}$ is indeed a uniform sample. From this and the way you choose whether to include $x_n$ and which element to drop, you have to get that $S_n$ is a uniform sample, too.

Try if you can prove above idea correct. If it is not, find out where the problem is and fix it. See this answer to a similar question for a detailed application of this technique.

OTHER TIPS

The best algorithm for your problem is Reservoir Sampling Algorithm. Read this

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top