Sampling keys due to their values

https://stackoverflow.com/questions/2305501

21-09-2019
|

문제

I have a dictionary in python with key->value as str->int. If I have to chose a key based on it's own value, then as the value gets larger the key has a lower possibility of being chosen.

For example, if key1=2 and key2->1, then the attitude of key1 should be 2:1.

How can I do this?

해결책

1. Construct a CDF-like list like this:

def build_cdf(distrib):
    cdf = []
    val = 0
    for key, freq in distrib.items():
        val += freq
        cdf.append((val, key))
    return (val, cdf)

This function returns a tuple, the 1st value is the sum of probabilities, and 2nd value is the CDF.

2. Construct the sampler like this:

import random
def sample_from_cdf(val_and_cdf):
    (val, cdf) = val_and_cdf;
    rand = random.uniform(0, val)
    # use bisect.bisect_left to reduce search time from O(n) to O(log n).
    return [key for index, key in cdf if index > rand][0]

Usage:

x = build_cdf({"a":0.2, "b":0.3, "c":0.5});
y = [sample_from_cdf(x) for i in range(0,100000)];
print (len([t for t in y if t == "a"]))   # 19864
print (len([t for t in y if t == "b"]))   # 29760
print (len([t for t in y if t == "c"]))   # 50376

You may want to make this into a class.

다른 팁

If the values are too large for gnibler's approach:

Build a list of tuples (key, index), where index is the sum of all values that come before key in the list (this would be the index of the first occurrence of key gnibler's list c. Also calculate the sum of all values (n).

Now, generate a random number xbetween 0 and n - 1. Find the last entry in the list with index < x. Since the list is sorted by index, you can use binary search to do that efficiently.

Update: KennyTM's code is an implementation of this, except that he uses a brute-force linear search instead of binary search; this will be inefficient if the number of keys are large.

It the values are not too large, you can do it this way

>>> from random import choice
>>> d={"key1":2,"key2":1}
>>> c=[]
>>> for k,v in d.items():
...  c+=[k]*v
... 
>>> choice(c)
'key1'
>>> sum(1 for x in range(100) if choice(c)=="key1")
63
>>> sum(1 for x in range(100) if choice(c)=="key2")
36

A quick and simple version of the algorithm from oefe's and KennyTM's answers:

def select_weighted(d):
   offset = random.randint(0, sum(d.itervalues())-1)
   for k, v in d.iteritems():
      if offset < v:
         return k
      offset -= v

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow