There's more than one oddity ;-) in this code, but I think your primary problem is this:
def __cmp__(self, a):
return cmp(self.code, a.code)
Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Node
s by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.
def __cmp__(self, a):
return cmp(self.weight, a.weight)
For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0
and one 1
). How can you possibly tell whether it's working or not?
Inside the loop, this is strained:
lo, hi = sorted([heappop(tree), heappop(tree)])
Given the repair to __cmp__
, that's easier as:
lo = heappop(tree)
hi = heappop(tree)
Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi
must be true.
I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__
should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.
More
About:
it gives the top nodes the longest codewords instead of the final leaves,
This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.
You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.
So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:
from heapq import heappush, heappop, heapify
class Node(object):
def __init__(self, weight, left, right):
self.weight = weight
self.left = left
self.right = right
self.code = None
def __cmp__(self, a):
return cmp(self.weight, a.weight)
class Symbol(object):
def __init__(self, name, weight):
self.name = name
self.weight = weight
self.code = None
def __cmp__(self, a):
return cmp(self.weight, a.weight)
def encode(symbfreq):
# return pair (sym2object, tree), where
# sym2object is a dict mapping a symbol name to its Symbol object,
# and tree is the root of the Huffman tree
sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
tree = sym2object.values()
heapify(tree)
while len(tree) > 1:
lo = heappop(tree)
hi = heappop(tree)
heappush(tree, Node(lo.weight + hi.weight, lo, hi))
tree = tree[0]
def assigncode(node, code):
node.code = code
if isinstance(node, Node):
assigncode(node.left, code + "0")
assigncode(node.right, code + "1")
assigncode(tree, "")
return sym2object, tree
i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
print v.name, v.code
That prints:
a 010
c 00
b 011
e 11
d 10
So, as hoped, the symbols with the highest weights have the shortest codes.