Why does a label permutation produce a different Huffman code?

https://stackoverflow.com/questions/19709896

02-07-2022
|

質問

I'm generating Huffman codes based off the following input distributions:

a = [(1,0.5),(0,0.25),(0,0.125),(0,0.125)]
b = [(0,0.5),(1,0.25),(0,0.125),(0,0.125)]

Where the only difference is the 1 is in a different bin.

However when I encode these using the following function:

def encode(symbfreq):
    tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
    heapq.heapify(tree)
    while len(tree)>1:
        lo, hi = heapq.heappop(tree), heapq.heappop(tree)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))

I get different codewords for the distribution:

a = [[1, '1'], [0, '00'], [0, '010'], [0, '011']]

whilst

b = [[0, '0'], [1, '11'], [0, '100'], [0, '101']]

Why am I getting this difference?

For reference: I'll need to split the tree into left and right branches (based on the left branch starting with a 1, the right a 0) as an attempt to find the 1. In the first case my algorithm should take 1 iterations and the second 2. However, because the codewords returned aren't the same for each bin each time both versions are currently taking 2 iterations to find the 1 - which isn't what I want!

解決

Despite the fact that they looks different, this results are both right and equivalent.

You can make them look the same by sorting lo and hi branches so you always add 1 to the bigger branch by replacing:

lo, hi = heapq.heappop(tree), heapq.heappop(tree)

with:

lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)

Result

>>> encode(a)
3: [[1, '0'], [0, '10'], [0, '110'], [0, '111']]
>>> encode(b)
4: [[0, '0'], [1, '10'], [0, '110'], [0, '111']]

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow