문제

Is there an algorithm for how to perform reservoir sampling when the points in the data stream have associated weights?

도움이 되었습니까?

해결책

The algorithm by Pavlos Efraimidis and Paul Spirakis solves exactly this problem. The original paper with complete proofs is published with the title "Weighted random sampling with a reservoir" in Information Processing Letters 2006, but you can find a simple summary here.

The algorithm works as follows. First observe that another way to solve the unweighted reservoir sampling is to assign to each element a random id R between 0 and 1 and incrementally (say with a heap) keep track of the top k ids. Now let's look at weighted version, and let's say the i-th element has weight w_i. Then, we modify the algorithm by choosing the id of the i-th element to be R^(1/w_i) where R is again uniformly distributed in (0,1).

Another article talking about this algorithm is this one by the Cloudera folks.

다른 팁

You can try the A-ES algorithm from this paper of S. Efraimidis. It's quite simple to code and very efficient.

Hope this helps,

Benoit

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top