Question

For an application I'm working on, I need to sample a small set of values from a very large data set, on the order of few hundred taken from about 60 trillion (and growing).

Usually I use the technique of seeing if a uniform random number r (0..1) is less than S/T, where S is the number of sample items I still need, and T is the number of items in the set that I haven't considered yet.

However, with this new data, I don't have time to roll the die for each value; there are too many. Instead, I want to generate a random number of entries to "skip", pick the value at the next position, and repeat. That way I can just roll the die and access the list S times. (S is the size of the sample I want.)

I'm hoping there's a straightforward way to do that and create an unbiased sample, along the lines of the S/T test.

  • To be honest, approximately unbiased would be OK.

  • This is related (more or less a follow-on) to this persons question:

https://math.stackexchange.com/questions/350041/simple-random-sample-without-replacement

  • One more side question... the person who showed first showed this to me called it the "mailman's algorithm", but I'm not sure if he was pulling my leg. Is that right?
Was it helpful?

Solution 2

So I thought about it, and got some help from http://math.stackexchange.com

It boils down to this:

  • If I picked n items randomly all at once, where would the first one land? That is, min({r_1 ... r_n}). A helpful fellow at math.stackexchange boiled it down to this equation:

x = 1 - (1 - r) ** (1 / n)

that is, the distribution would be 1 minus (1 - r) to the nth power. Then solve for x. Pretty easy.

  • If I generate a uniform random number and plug it in for r, this is distributed the same as min({r_1 ... r_n}) -- the same way that the lowest item would fall. Voila! I've just simulated picking the first item as if I had randomly selected all n.

  • So I skip over that many items in the list, pick that one, and then....

  • Repeat until n is 0

That way, if I have a big database (like Mongo), I can skip, find_one, skip, find_one, etc. Until I have all the items I need.

The only problem I'm having is that my implementation favors the first and last element in the list. But I can live with that.

In Python 2.7, my implementation looks like:

def skip(n):
    """
    Produce a random number with the same distribution as
    min({r_0, ... r_n}) to see where the next smallest one is
    """
    r = numpy.random.uniform()
    return 1.0 - (1.0 - r) ** (1.0 / n)


def sample(T, n):
    """
    Take n items from a list of size T
    """
    t = T
    i = 0
    while t > 0 and n > 0:
        s = skip(n) * (t - n + 1)
        i += s
        yield int(i) % T
        i += 1
        t -= s + 1
        n -= 1

if __name__ == '__main__':

    t = [0] * 100
    for c in xrange(10000):
        for i in sample(len(t), 10):
            t[i] += 1  # this is where we would read value i

    pprint.pprint(t)

OTHER TIPS

How about this:

  • precompute S random numbers from 0 to the size of your dataset.
  • order your numbers, low to high
  • store the difference between consecutive numbers as the skip size
  • iterate though the large dataset using the skip size above.

...The assumption being the order you collect the samples doesn't matter

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top