Huffman encoding issue

Question 1

There's more than one oddity ;-) in this code, but I think your primary problem is this:

def __cmp__(self, a):
    return cmp(self.code, a.code)

Heap operations use the comparison method to order the heap, but for some reason you're telling it to order Nodes by the current length of their codes. You almost certainly want the heap to order them by their weights instead, right? That's how Huffman encoding works.

def __cmp__(self, a):
    return cmp(self.weight, a.weight)

For the rest, it's difficult to follow because 4 of your 5 symbols are the same (four 0 and one 1). How can you possibly tell whether it's working or not?

Inside the loop, this is strained:

lo, hi = sorted([heappop(tree), heappop(tree)])

Given the repair to __cmp__, that's easier as:

lo = heappop(tree)
hi = heappop(tree)

Sorting is pointless - the currently smallest element is always popped. So pop twice, and lo <= hi must be true.

I'd say more ;-), but at this point I'm confused about what you're trying to accomplish in the end. If you agree __cmp__ should be repaired, make that change and edit the question to give both some inputs and the exact output you're hoping to get.

More

About:

it gives the top nodes the longest codewords instead of the final leaves,

This isn't an "off by 1" thing, it's more of a "backwards" thing ;-) Huffman coding looks at nodes with the smallest weights first. The later a node is popped from the heap, the higher the weight, and the shorter its code should be. But you're making codes longer & longer as the process goes on. They should be getting shorter & shorter as the process goes on.

You can't do this while building the tree. In fact the codes aren't knowable until the tree-building process has finished.

So, rather than guess at intents, etc, I'll give some working code you can modify to taste. And I'll include a sample input and its output:

from heapq import heappush, heappop, heapify

class Node(object):
    def __init__(self, weight, left, right):
        self.weight = weight
        self.left = left
        self.right = right
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

class Symbol(object):
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight
        self.code = None

    def __cmp__(self, a):
        return cmp(self.weight, a.weight)

def encode(symbfreq):
    # return pair (sym2object, tree), where
    # sym2object is a dict mapping a symbol name to its Symbol object,
    # and tree is the root of the Huffman tree
    sym2object = {sym: Symbol(sym, w) for sym, w in symbfreq}
    tree = sym2object.values()
    heapify(tree)
    while len(tree) > 1:
        lo = heappop(tree)
        hi = heappop(tree)
        heappush(tree, Node(lo.weight + hi.weight, lo, hi))
    tree = tree[0]

    def assigncode(node, code):
        node.code = code
        if isinstance(node, Node):
            assigncode(node.left, code + "0")
            assigncode(node.right, code + "1")
    assigncode(tree, "")

    return sym2object, tree

i = [('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
s2o, t = encode(i)
for v in s2o.values():
    print v.name, v.code

That prints:

So, as hoped, the symbols with the highest weights have the shortest codes.

Question 2

I suspect the problem is in segment: -

lo.code = '0'+lo.code
hi.code = '1'+hi.code

both hi & low can be intermediate nodes where as you need to update the codes at the leaves where the actual symbols are. I think you should not maintain any codes at construction of the huffman tree but get the codes of individual symbols after the construction of huffman tree by just traversing.

Heres my pseudo code for encode: -

 tree = ConstructHuffman(); // Same as your current but without code field

 def encode(tree,symbol): 
     if tree.data == symbol : 
         return None
     else if tree.left.data.contains(symbol) :
         return '0' + encode(tree.left,symbol)
     else : 
        return '1' + encode(tree.right,symbol)

Calculate codes for all symbol using above encode method and then u can use them for encoding.

Note: Change comparision function from comparing codes to comparing weights.