Question

The STL random_sample function use a replace strategy to sample from the given interval. Why we need a decreasing probability, I've seen a similar algorithm without decrease the replacement probability. What is the difference. The code is as:

/*This is an excerpt from STL implementation*/
template <class InputIterator, class RandomAccessIterator, class Distance>
RandomAccessIterator __random_sample(InputIterator first, InputIterator last,
                                         RandomAccessIterator out,
                                 const Distance n)
{
  Distance m = 0;
  Distance t = n;
  for ( ; first != last && m < n; ++m, ++first) //the strategy is also used in mahout
    out[m] = *first;//fill it

    while (first != last) {
        ++t;
        Distance M = lrand48() % t;
        if (M < n)
          out[M] = *first;//replace it with a decreasing probability
        ++first;
    }

  return out + m;
}
Was it helpful?

Solution

It is obvious that the first element (indeed, the first n elements) has to be chosen with the probability of 1 (the "fill it" stage). In order to remain in the final sample, this first element needs then to survive m-n possible replacements - that's what brings the probability of it being in the sample down to n/m. On the other hand, the last element only participates in one replacement; it should thus be added to the sample with the probability of n/m from the start.

For simplicity, suppose that you only need to pick one element, out of m, using this replacement strategy (note that you don't know up front what m is: you just iterate until you suddenly reach the end). You take the first element, and keep it with a probability of 1 (for all you know, this is the only element). Then you discover the second element, and you throw a coin and either keep it or discard it with the probability of 1/2. At this point, each of the first two elements has 1/2 probability to be the chosen one.

Now you see the third element, and you keep it with probability of 1/3. Each of the first two elements had 1/2 probability of participating in this encounter, and 2/3 of surviving it - for a total of 1/2 * 2/3 == 1/3 probability of still being around. So, again, each of the first three elements has 1/3 probability of being chosen at this point.

The inductive step of proving that, after tth element is inspected, each of the first t elements has 1/t probability of being chosen is left as an exercise for the reader. So is the extension of the proof to a sample of size n > 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top