Question

Python question. I'm generating a large array of objects, which I only need to make a small random sample. Actually generating the objects in question takes a while, so I wonder if it would be possible to somehow skip over those objects that don't need generating and only explicitly create those objects that have been sampled.

In other words, I now have

a = createHugeArray()
s = random.sample(a,len(a)*0.001)

which is rather wasteful. I would prefer something more lazy like

a = createArrayGenerator()
s = random.sample(a,len(a)*0.001)

I don't know if this works. The documentation on random.sample isn't too clear, though it mentions xrange as being very fast - which makes me believe it might work. Converting the array creation to a generator would be a bit of work (my knowledge of generators is very rusty), so I want to know if this works in advance. :)

An alternative I can see is to make a random sample via xrange, and only generate those objects that are actually selected by index. That's not very clean though, because the indices generated are arbitrary and unnecessary, and I would need rather hacky logic to support this in my generateHugeArray method.

For bonus points: how does random.sample actually work? Especially, how does it work if it doesn't know the size of the population in advance, as with generators like xrange?

Was it helpful?

Solution

There does not seem a way that avoids figuring out how the indices map to your permutations. If you don't know this, how would you create a random object from your array? You could either use the trick using xrange() you suggested yourself, or implement a class defining the __getitem__() and __len__() methods and pass and object of this class as population argument to random.sample().

Some further comments:

  • Converting createHugeArray() into a generator won't buy you anything -- random.sample() will just not work anymore. It needs an object supporting len().

  • So it does need to know the number of elements in the population right from the beginning.

  • The implementation features two different algorithms and chooses the one that will use less memory. For relatively small k (that is, in the case at hand) it will simply save the indices already chosen in a set and make a new random choice if it hits one of them.

Edit: A completely different approach would be to iterate over all permutations once and decide for each permutation if it should be included. If the total number of permutations is n and you would like to select k of them, you could write

selected = []
for i in xrange(n):
    perm = nextPermutation()
    if random.random() < float(k-len(selected))/(n-i):
        selected.append(perm)

This would choose exactly k permutations randomly.

OTHER TIPS

You could create a list of array indexes with sample and then generate the objects according to the results:

def get_object(index):
    return MyClass(index)

or something like this. Then use sample to generate the indexes you need and call this function with those indexes:

objs = map(get_object, random.sample(range(length), 0.001 * length))

This is a little indirect as in it only chooses from a list of possible array indexes.

Explaining how random.sample works,

random.sample(container, k) will return k number of values randomly from the container. Because a generator is iterable like lists, tuples and the keys or values in dicts it will iterate through the container and then take these random elements.

e.g. random.sample(xrange(111),4) will return something like [33,52,111,1] as k = 4 meaning 4 random numbers from the xrange generator upto 111.

I'm guessing the function createHugeArray() contains a piece of code that is repeated once for each object that is created. And I'm guessing the objects are generated from some sort of initial value or seed, in which case createHugeArray() looks something like this:

def createHugeArray( list_of_seeds ):
  huge_array = []                  
  for i in list_of_seeds:
    my_object = makeObject( i )
    huge_array.append( my_object )           
  return huge_array

(I used lists not arrays, but you get the idea.)

To do the random sampling before actually creating the objects, just add a line that generates a random number, and then only create the object if the random number is below a certain threshold. Say you only want one object in a thousand. random.randint(0,999) gives a number from 0 to 999 - so only generate an object if you get zero. The code above becomes:

import random

def createHugeArray( list_of_seeds ):
  huge_array = [] 

  for i in list_of_seeds:
    die_roll = random.randint(0,999)

    if( die_roll == 0 ):
      my_object = makeObject( i )
      huge_array.append( my_object ) 
  return huge_array

Of course if my guess about how your code works is wrong then this is useless to you, in which case sorry and good luck :-)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top