How to use random.shuffle() on a generator? python

Question 1

In order to shuffle the sequence uniformly, random.shuffle() needs to know how long the input is. A generator cannot provide this; you have to materialize it into a list:

lst = list(yielding(x))
random.shuffle(lst)
for i in lst:
    print i

You could, instead, use sorted() with random.random() as the key:

for i in sorted(yielding(x), key=lambda k: random.random()):
    print(i)

but since this also produces a list, there is little point in going this route.

Demo:

>>> import random
>>> x = [1,2,3,4,5,6,7,8,9]
>>> sorted(iter(x), key=lambda k: random.random())
[9, 7, 3, 2, 5, 4, 6, 1, 8]

Question 2

It's not possible to randomize the yield of a generator without temporarily saving all the elements somewhere. Luckily, this is pretty easy in Python:

tmp = list(yielding(x))
random.shuffle(tmp)
for i in tmp:
    print i

Note the call to list() which will read all items and put them into a list.

If you don't want to or can't store all elements, you will need to change the generator to yield in a random order.

Question 3

Depending on the case, if you know how much data you have ahead of time, you can index the data and compute/read from it based on a shuffled index. This amounts to: 'don't use a generator for this problem', and without specific use-cases it's hard to come up with a general method.

Alternatively... If you need to use the generator...

it depends on 'how shuffled' you want the data. Of course, like folks have pointed out, generators don't have a length, so you need to at some point evaluate the generator, which could be expensive. If you don't need perfect randomness, you can introduce a shuffle buffer:

from itertools import islice

import numpy as np


def shuffle(generator, buffer_size):
    while True:
        buffer = list(islice(generator, buffer_size))
        if len(buffer) == 0:
            break
        np.random.shuffle(buffer)
        for item in buffer:
            yield item


shuffled_generator = shuffle(my_generator, 256)

This will shuffle data in chunks of buffer_size, so you can avoid memory issues if that is your limiting factor. Of course, this is not a truly random shuffle, so it shouldn't be used on something that's sorted, but if you just need to add some randomness to your data this may be a good solution.

Question 4

You could sample from arbitrary yielded results, generating a not fully random but somewhat shuffled set within a range. Similar to @sturgemeister code above, but not chunked.... there are no defined randomness boundaries.

For example:

def scramble(gen, buffer_size):
    buf = []
    i = iter(gen)
    while True:
        try:
            e = next(i)
            buf.append(e)
            if len(buf) >= buffer_size:
                choice = random.randint(0, len(buf)-1)
                buf[-1],buf[choice] = buf[choice],buf[-1]
                yield buf.pop()
        except StopIteration:
            random.shuffle(buf)
            yield from buf
            return

The results should be fully random within the buffer_size window:

for e in scramble(itertools.count(start=0, step=1), 1000):
    print(e)

For an arbitrary 1000 elements in this stream... they are random seeming. But looking at the overall trend (beyond 1000), it's clearly increasing.

To test, assert that this returns 1000 unique elements:

for e in scramble(range(1000), 100):
    print(e)

Question 5

I needed to find a solution to this problem so I could get expensive to compute elements in a shuffled order, without wasting computation by generating values. This is what I have come up with for your example. It involves making another function to index the first array.

You will need numpy installed

pip install numpy

The Code:

import numpy as np
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]

def shuffle_generator(lst):
    return (lst[idx] for idx in np.random.permutation(len(lst)))

def yielding(ls):
    for i in ls:
        yield i

# for i in random.shuffle(yielding(x)):
#    print i

for i in yielding(shuffle_generator(x)):
    print(i)

Question 6

For very large sequences, if you know the sequence size in advance:

class subset_iterator:
    """
    an iterator class that returns K random samples from another sequence
    that has no random-access. Requires: the sequence length as input

    similar to random.sample

    :param it: iterator to the sequence
    :param seqlen: size of the sequence of :param it:
    :param K: output sequence size (number of samples in the subset)
    """

    def __init__(self, it, seqlen, K):
        self.it = it
        self.N = seqlen
        self.K = K

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            r = random()
            nextitem = next(self.it)
            if r <= float(self.K) / self.N:
                self.K -= 1
                self.N -= 1
                return nextitem
            else:
                self.N -= 1

Question 7

A generator follows a sequential access pattern. Shuffled data requires the exact opposite, a random access pattern.

In many applications, we can get away with local perturbations only, which relaxes the problem quite a lot.

Here is an example of an in memory shuffle buffer.

from random import randint
domain = (0, 1000)
buffer = [randint(*domain) for _ in range(50)]

for element in range(*domain):
    idx = randint(0, len(buffer)-1)
    element, buffer[idx] = buffer[idx], element
    print(element)