Sampling keys due to their values
-
21-09-2019 - |
문제
I have a dictionary in python with key->value as str->int
. If I have to chose a key based on it's own value, then as the value gets larger the key has a lower possibility of being chosen.
For example, if key1=2
and key2->1
, then the attitude of key1
should be 2:1
.
How can I do this?
해결책
1. Construct a CDF-like list like this:
def build_cdf(distrib):
cdf = []
val = 0
for key, freq in distrib.items():
val += freq
cdf.append((val, key))
return (val, cdf)
This function returns a tuple, the 1st value is the sum of probabilities, and 2nd value is the CDF.
2. Construct the sampler like this:
import random
def sample_from_cdf(val_and_cdf):
(val, cdf) = val_and_cdf;
rand = random.uniform(0, val)
# use bisect.bisect_left to reduce search time from O(n) to O(log n).
return [key for index, key in cdf if index > rand][0]
Usage:
x = build_cdf({"a":0.2, "b":0.3, "c":0.5});
y = [sample_from_cdf(x) for i in range(0,100000)];
print (len([t for t in y if t == "a"])) # 19864
print (len([t for t in y if t == "b"])) # 29760
print (len([t for t in y if t == "c"])) # 50376
You may want to make this into a class.
다른 팁
If the values are too large for gnibler's approach:
Build a list of tuples (key, index)
, where index
is the sum of all values that come before key in the list (this would be the index of the first occurrence of key
gnibler's list c
. Also calculate the sum of all values (n
).
Now, generate a random number x
between 0 and n - 1
. Find the last entry in the list with index < x
. Since the list is sorted by index, you can use binary search to do that efficiently.
Update: KennyTM's code is an implementation of this, except that he uses a brute-force linear search instead of binary search; this will be inefficient if the number of keys are large.
It the values are not too large, you can do it this way
>>> from random import choice
>>> d={"key1":2,"key2":1}
>>> c=[]
>>> for k,v in d.items():
... c+=[k]*v
...
>>> choice(c)
'key1'
>>> sum(1 for x in range(100) if choice(c)=="key1")
63
>>> sum(1 for x in range(100) if choice(c)=="key2")
36
A quick and simple version of the algorithm from oefe's and KennyTM's answers:
def select_weighted(d):
offset = random.randint(0, sum(d.itervalues())-1)
for k, v in d.iteritems():
if offset < v:
return k
offset -= v