Randomly sample a data set

Question 1

Binary sample/don't sample may not be the right answer.. suppose you want to sample 1000 strings and you do it via coin toss.. This would mean that approximately after visiting 2000 strings.. you will be done.. What about the rest of the strings?

I read this post - http://gregable.com/2007/10/reservoir-sampling.html

which answers this Q quite clearly..

Let me put the summary here -

SIMPLE SOLUTION

Assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times.

RESERVOIR SAMPLING

Make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. If you choose to add it, then replace any element (say element #2) in the reservoir chosen randomly. The element #2 is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.

This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.

Question 2

I think you have used the word infinite a bit loosely , the very premise of sampling is every element has an equal chance to be in the sample and that is only possible if you at least go through every element. So I would translate infinite to mean a large number indicating you need a single pass solution rather than multiple passes.

Reservoir sampling is the way to go though the analysis from @abipc seems in the right direction but is not completely correct.

It is easier if we are firstly clear on what we want. Imagine you have N elements (N unknown) and you need to pick 1000 elements. This means we need to device a sampling scheme where the probability of any element being there in the sample is exactly 1000/N , so each element has the same probability of being in sample (no preference to any element based on its position on the original list). The scheme mentioned by @abipc works fine, the probability calculations goes like this -

After first step you have 1001 elements so we need to pick each element with probability 1000/1001. We pick the 1001st element with exactly that probability so that is fine. Now we also need to show that every other element also has the same probability of being in the sample.

p(any other element remaining in the sample) = [ 1 - p(that element is removed from sample)]
    =  [ 1 - p(1001st element is selected) * p(the element is picked to be removed)
    =  [ 1 - (1000/1001) * (1/1000)] =  1000/1001

Great so now we have proven every element has a probability of 1000/1001 to be in the sample. This precise argument can be extended for the ith step using induction.

Question 3

As I know such class of algorithms is called Reservoir Sampling algorithms.

I know one of it from DataMining, but don't know the name of it:

Collect first S elements in your storage with max.size equal to S.
Suppose next element of the stream has number N.
With probability S/N catch new element, else discard it
If you catched element N, then replace one of the elements in the sameple S, picked it uniformally.
N=N+1, get next element, goto 1

It can be theoretically proved that at any step of such stream processing your storage with size S contains elements with equal probablity S/N_you_have_seen.

So for example S=10;

N_you_have_seen=10^6

S - is finite number; N_you_have_seen - can be infinite number;