Question

I have an 1D array A that represents categorical data (where each entry is the number of element of a certain category):

A = array([ 1, 8, 2, 5, 10, 32, 0, 0, 1, 0])

and I am trying to write a function sample(A, N) to generate an array B that contains N elements generated by randomly drawing elements from A (keeping the categories):

>>> sample(A, 20)
array([ 1, 3, 0, 1, 4, 11, 0, 0, 0, 0])

I wrote this:

def sample(A, N):
    AA = A.astype(float).copy()
    Z = zeros(A.shape)
    for _ in xrange(N):
        drawn = random.multinomial(1, AA/AA.sum())
        Z = Z + drawn
        AA = AA - drawn
    return Z.astype(int)

Probably it is quite naive, is there a better/faster way to do it? Maybe using some fast numpy function? Edit: It was not clear: it has to be without replacement!!!

Was it helpful?

Solution

faster than the other's as far as i can see. But probably uses more memory.

import random 
from collections import Counter

def sample2(A,N):
    distribution = [i for i, j in enumerate(A) for _ in xrange(j)]
    sample = Counter(random.sample(distribution, N))
    return [sample[i] for i in xrange(len(A))]


In [52]: A = np.random.randint(0, 100, 500)

In [53]: %timeit sample(A, 100) #Original
100 loops, best of 3: 2.71 ms per loop

In [54]: %timeit sample2(A, 100) #my function
1000 loops, best of 3: 914 µs per loop

In [55]: %timeit sample3(A, 100) #sftd function
100 loops, best of 3: 8.33 ms per loop

OTHER TIPS

This may not be the most elegant solution, but it's about 3x as fast. It uses numpy.random.choice, which has a Boolean replace option (in this case set to False - i.e. without replacement). The rest of the code is to:

  • Set up the array of choices, which contains A[n] counts of the index n, e.g. for A=[2,0,3,1] you'd get choices=[0,0,2,2,2,3]. Note that each of these will have equal probability so no need to create a probability array.
  • Convert the values chosen by the numpy function call into the output array required. Each element of the vals array will be an index picked from the choices array, so you need to add 1 to the appropriate element of B for each of these chosen indices.

I hope that makes sense! Here is the code:

def sample_2(A, N):
    # Create array of choices (indicies)
    choices = []
    for n in xrange(len(A)):
        for _ in xrange(A[n]):
            choices.append(n)
    # Randomly choose from these indicies
    vals = numpy.random.choice(choices, N, False)
    # Count up the chosen indicies
    B = numpy.zeros(len(A), dtype=int)
    for index in xrange(N):
        B[vals[index]] += 1
    return B

Speed test results for 10000 calls of each function:

Original: 3.0517 s
Method_2: 0.9968 s

Here's how I would do it:

def sample(A, N):
        population = np.zeros(sum(A))
        counter = 0
        for i, x in enumerate(A):
                for j in range(x):
                        population[counter] = i
                        counter += 1

        sampling = population[np.random.randint(0, len(population), N)]
        return np.histogram(sampling, bins = np.arange(len(A)+1))[0]

What we're doing is building a population defined by the histogram A and then randomly sampling from it. If the real-world case has N large and sum(A) small, and/or you need to sample A many times for a fixed A, this should be better. What you'd do is to build the population corresponding to A outside of the function call and define sample(population, N) as just the last two lines of the above.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top