I'm generating Huffman codes based off the following input distributions:
a = [(1,0.5),(0,0.25),(0,0.125),(0,0.125)]
b = [(0,0.5),(1,0.25),(0,0.125),(0,0.125)]
Where the only difference is the 1 is in a different bin.
However when I encode these using the following function:
def encode(symbfreq):
tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
heapq.heapify(tree)
while len(tree)>1:
lo, hi = heapq.heappop(tree), heapq.heappop(tree)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))
I get different codewords for the distribution:
a = [[1, '1'], [0, '00'], [0, '010'], [0, '011']]
whilst
b = [[0, '0'], [1, '11'], [0, '100'], [0, '101']]
Why am I getting this difference?
For reference: I'll need to split the tree into left and right branches (based on the left branch starting with a 1, the right a 0) as an attempt to find the 1. In the first case my algorithm should take 1 iterations and the second 2. However, because the codewords returned aren't the same for each bin each time both versions are currently taking 2 iterations to find the 1 - which isn't what I want!