Question

Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?

Was it helpful?

Solution

The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.

The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).

Another article talking about this algorithm is this one by the Cloudera folks.

OTHER TIPS

You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.

Hope this helps,

Benoit

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top