How do I get an unbiased random sample from a really huge data set?

Question 1

So I thought about it, and got some help from http://math.stackexchange.com

It boils down to this:

If I picked n items randomly all at once, where would the first one land? That is, min({r_1 ... r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:

x = 1 - (1 - r) ** (1 / n)

that is, the distribution would be 1 minus (1 - r) to the nth power. Then solve for x. Pretty easy.

If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 ... r_n}) -- the same way that the lowest item would fall. Voila! I've just simulated picking the first item as if I had randomly selected all n.
So I skip over that many items in the list, pick that one, and then....
Repeat until n is 0

That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.

The only problem I'm having is that my implementation favors the first and last element in the list. But I can live with that.

In Python 2.7, my implementation looks like:

def skip(n):
    """
    Produce a random number with the same distribution as
    min({r_0, ... r_n}) to see where the next smallest one is
    """
    r = numpy.random.uniform()
    return 1.0 - (1.0 - r) ** (1.0 / n)


def sample(T, n):
    """
    Take n items from a list of size T
    """
    t = T
    i = 0
    while t > 0 and n > 0:
        s = skip(n) * (t - n + 1)
        i += s
        yield int(i) % T
        i += 1
        t -= s + 1
        n -= 1

if __name__ == '__main__':

    t = [0] * 100
    for c in xrange(10000):
        for i in sample(len(t), 10):
            t[i] += 1  # this is where we would read value i

    pprint.pprint(t)

Question 2

How about this:

precompute S random numbers from 0 to the size of your dataset.
order your numbers, low to high
store the difference between consecutive numbers as the skip size
iterate though the large dataset using the skip size above.

...The assumption being the order you collect the samples doesn't matter