Question

I am encoding a set of items using Huffman codes, and, along with the final codes, I'd like to return the intermediate nodes encoded as well, but with the the data of the child nodes concatenated into the the data of the intermediate node.

For example if I were to encode this set of symbols and probabilities:

tree = [('a',0.25),('b',0,25),('c',0.25),('d',0.125),('e',0.125)]

I'd like the following to be returned:

tree = [['ab','0'],['cde','1'],['a','00'],['b','01'],['c','10'],['de','11'],['d','110'],['e','111']]

I'm using the following code to produce Huffman trees:

import heapq

#symbfreq = list of symbols and associated frequencies
def encode(symbfreq):
    #create a nested list as a tree
    tree = [[wt, [sym, ""]] for sym, wt in symbfreq]
    #turn the tree into a heap
    heapq.heapify(tree)
    while len(tree)>1:
        #pop the lowest two nodes off the heap, sorted on the length
        lo, hi = sorted([heapq.heappop(tree), heapq.heappop(tree)], key=len)
        #add the next bit of the codeword
        for pair in lo[1:]:
            pair[1] = '0' + pair[1]
        for pair in hi[1:]:
            pair[1] = '1' + pair[1]
        #push a new node, which is the sum of the two lowest probability nodes onto the heap 
        heapq.heappush(tree, [lo[0]+hi[0]] + lo[1:] + hi[1:])
    return sorted(heapq.heappop(tree)[1:], key=lambda p: (len(p[-1]), p))

The Huffman algorithm is:

1.Create a leaf node for each symbol and add it to the priority queue.
2.While there is more than one node in the queue:
    3.Remove the node of highest priority (lowest probability) twice to get two nodes.
    4.Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    5.Add the new node to the queue.
6.The remaining node is the root node and the tree is complete.

I can't for the life of me think of a way to stop the intermediate nodes being overwritten (i.e. I want to persist the intermediate nodes created at stage 4).

Était-ce utile?

La solution

I don't know how do it while constructing (a part from building an output tree which is never popped), but you can retrieve the intermediary nodes quite easily :

huffman_tree = encode(tree)
complete_tree = huffman_tree

get_intermediate_node = lambda val, arr :   ''.join( [ char for char,binary in itertools.ifilter( lambda node : node[1].startswith( val ),arr)] ) 

for val in range( next_power_of_two( len(huffman_tree) ) ):
    bvalue = bin(val)[2:] 
    node = [ get_intermediate_node( bvalue , huffman_tree) , bvalue ] 
    if node not in complete_tree:
        complete_tree.append( node)

print sorted( complete_tree , key=lambda p: (len(p[-1]), p) )

>>> [['ab', '0'], ['cde', '1'], ['a', '00'], ['b', '01'], ['c', '10'],
    ['de', '11'], ['', '100'], ['', '101'], ['d', '110'], ['e', '111']]

(You still need to prune the empty nodes )

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top