Question

As an exercise I'm trying to encode some symbols using Huffman trees, but using my own class instead of the built in data types with Python.

Here is my node class:

class Node(object):
    left = None
    right = None
    weight = None
    data = None
    code = ''
    length = len(code)

    def __init__(self, d, w, c):
        self.data = d
        self.weight = w
        self.code = c

    def set_children(self, ln, rn):
        self.left = ln
        self.right = rn

    def __repr__(self):
        return "[%s,%s,(%s),(%s)]" %(self.data,self.code,self.left,self.right)

    def __cmp__(self, a):
        return cmp(self.code, a.code)

    def __getitem__(self):
        return self.code

and here is the encoding function:

def encode(symbfreq):
    tree = [Node(sym,wt,'') for sym, wt in symbfreq]
    heapify(tree)
    while len(tree)>1:
        lo, hi = sorted([heappop(tree), heappop(tree)])
        lo.code = '0'+lo.code
        hi.code = '1'+hi.code
        n = Node(lo.data+hi.data,lo.weight+hi.weight,lo.code+hi.code)
        n.set_children(lo, hi)
        heappush(tree, n)
    return tree[0]

(Note, that the data field will eventually contain a set() of all the items in the children of a node. It just contains a sum for the moment whilst I get the encoding correct).

Here is the previous function I had for encoding the tree:

def encode(symbfreq):
    tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
    heapq.heapify(tree)
    while len(tree)>1:
        lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        heapq.heappush(tree, [lo[0] + hi[0]] + lo[1:] + hi[1:])
    return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))

However I've noticed that my new procedure is incorrect: it gives the top nodes the longest codewords instead of the final leaves, and doesn't produce the same tree for permutations of input symbols i.e. the following don't produce the same tree (when run with new encoding function):

input1 = [(1,0.25),(0,0.25),(0,0.25),(0,0.125),(0,0.125)]
input2 = [(0,0.25),(0,0.25),(0,0.25),(1,0.125),(0,0.125)]

I'm finding I'm really bad at avoiding this kind of off-by-one/ordering bugs - how might I go about sorting this out in the future?

Était-ce utile?

La solution

There's more than one oddity ;-) in this code, but I think your primary problem is this:

def __cmp__(self, a):
    return cmp(self.code, a.code)

Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.

def __cmp__(self, a):
    return cmp(self.weight, a.weight)

For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?

Inside the loop, this is strained:

lo, hi = sorted([heappop(tree), heappop(tree)])

Given the repair to __cmp__, that's easier as:

lo = heappop(tree)
hi = heappop(tree)

Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.

I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.

More

About:

it gives the top nodes the longest codewords instead of the final leaves,

This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.

You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.

So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:

from heapq import heappush, heappop, heapify

class Node(object):
    def __init__(self, weight, left, right):
        self.weight = weight
        self.left = left
        self.right = right
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

class Symbol(object):
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

def encode(symbfreq):
    # return pair (sym2object, tree), where
    # sym2object is a dict mapping a symbol name to its Symbol object,
    # and tree is the root of the Huffman tree
    sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
    tree = sym2object.values()
    heapify(tree)
    while len(tree) > 1:
        lo = heappop(tree)
        hi = heappop(tree)
        heappush(tree, Node(lo.weight + hi.weight, lo, hi))
    tree = tree[0]

    def assigncode(node, code):
        node.code = code
        if isinstance(node, Node):
            assigncode(node.left, code + "0")
            assigncode(node.right, code + "1")
    assigncode(tree, "")

    return sym2object, tree

i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
    print v.name, v.code

That prints:

a 010
c 00
b 011
e 11
d 10

So, as hoped, the symbols with the highest weights have the shortest codes.

Autres conseils

I suspect the problem is in segment: -

lo.code = '0'+lo.code
hi.code = '1'+hi.code

both hi & low can be intermediate nodes where as you need to update the codes at the leaves where the actual symbols are. I think you should not maintain any codes at construction of the huffman tree but get the codes of individual symbols after the construction of huffman tree by just traversing.

Heres my pseudo code for encode: -

 tree = ConstructHuffman(); // Same as your current but without code field

 def encode(tree,symbol): 
     if tree.data == symbol : 
         return None
     else if tree.left.data.contains(symbol) :
         return '0' + encode(tree.left,symbol)
     else : 
        return '1' + encode(tree.right,symbol)

Calculate codes for all symbol using above encode method and then u can use them for encoding.

Note: Change comparision function from comparing codes to comparing weights.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top