Question

This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work.

Problem: Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set.

Proposed solution:

  1. Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0 to k-1.
  2. Reduce operation: Among all numbers with same key, pick one randomly.

Claim: Probability of any number being in set of k size is k/n (where n is total number of samples)

Proof intuition:

Since map operation randomly assigned each input sample to bucket number i (0 <= i <= k-1), size of each bucket would be n/k. Now each number is present only in one bucket, suppose bucket i. The probability that it gets picked in reduce operation for bucket i is 1/(n/k) = k/n

I would appreciate any second thoughts on my solution, whether it seems correct or not.

Was it helpful?

Solution

There's a small flaw in your argument. Your algorithm may not return a sample of size k, as it may happen that none of the element gets mapped to a specific key. In the extreme case (even if it has small chance), it may happen that all the input elements get mapped to only one key, in which case your algorithm returns only one element.

The event of "missing" a specific key has probability ((k-1)/k)^n = (1-1/k)^n which is approximately (using Taylor approximation) e^{-n/k}. This is negligible if k is much smaller than n, but if k is proportional to n, say k=n/2, then this bad event actually happens with constant probability.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top