Algorithm to Generate 'n' Binary Prefix Codes

https://stackoverflow.com/questions/7322703

27-10-2019
|

Question

A Prefix Code is a set of codes such that no code is a prefix of another code. For example, the following set is a prefix code:

With n = 8 members. I think these are usually created with some type of Huffman tree.

My question is: Could you help me create a function that will generate a binary prefix code with 'n' members?

Something like this:

list<int> GenerateBinaryPrefixCodes(int n);

Also, the requirement is that it be "optimal" in the sense that the total sum of bits is minimized.

I would prefer an answer in C/C++/C#/something similar. This isn't really homework, but I tagged it that way because it sounds like it would be a good hw problem.

Thanks!

Solution

Prefix Codes

As you pointed out, a Prefix Code is one where a given code is not a prefix for any other given code. This is a very general definition. A Huffman encoding is a restricted form of Prefix Code.

A common usage for Huffman coding is to minimize (optimize) the total bit count needed to encode a "message". A "message" is typically a sequence of symbols and it is encoded by mapping each symbol occurrence to a specific prefix code and writing out the prefix code in its place. Any set of prefix codes could be used to do this. But, a Huffman encoding will result in the shortest possible message based on bit count.

For example the ASCII character set could be considered as a mapping of symbols to a set of 8 bit prefix codes. This could even be considered a Huffman encoding provided that the encoded message contained exactly the same number of each possible symbol.

The interesting stuff starts when the message to be encoded contains symbol frequencies that are unequal. At this point one can reduce the total bit length of the message by using prefix codes of different lengths. Use short prefix codes for more frequent symbols and longer prefix codes for less frequent symbols.

From your example there are 8 symbols to encode. Symbols mapped to prefix codes '11' and '10' would be the most frequent symbols in the message. Likewise, symbols mapped to '0111', '0110', '1010' and '0100' would be least frequent. Higher the frequency the shorter the prefix code.

The "trick" in creating a Huffman coding is to build the set of Prefix Codes such that after mapping each symbol in the message to their associated prefix codes the message contains as few bits as possible.

I find it useful to view prefix codes as a binary tree where each leaf node maps to a symbol. For example, the binary tree corresponding to the prefix codes given in your question (01, 11, 000, 001, 0100, 0101, 0110, 0111) would be:

           +-- (11)
        +--+
        |  +-- (10)
        |
        |        +-- (0111)
      --+     +--+
        |     |  +-- (0110)
        |  +--+
        |  |  |  +-- (0101)
        |  |  +--+
        +--+     +-- (0100)
           |
           |  +-- (001)
           +--+
              +-- (000)

To get the values in brackets you just assign a '1' when the top edge is followed or a '0' if the bottom edge is followed.

How to build such a tree?

Start with data structures to represent a binary tree and a list.

The binary tree will contain two types of node. 1) A leaf node representing a symbol and its frequency and 2) an internal node representing the cumulative frequency of all the nodes below it (it also needs two pointers, one for the left branch and one for the right branch).

The list contains an ordered set of nodes from the binary tree. Nodes in the list are ordered based on the frequency value of the node they point to. Lowest frequency nodes occur at the front of the list and increase toward the end of the list. A linked list of pointers to tree nodes might be a useful implementation - but any ordered list structure will do.

The algorithm below employs two lists: a "reference" and a "working" list. As nodes are processed from the "reference" list new nodes are created and inserted into the "working" list such that the "working" list remains ordered by node frequency.

Use these data structures and the following algorithm to create a Huffman encoding.

0. Initialize the "reference" list by creating a leaf node for each symbol
   then add it into this list such that nodes with the lowest frequency 
   occur at the front of the list and those with the highest frequency
   occur at the back (basically a priority queue).

1. Initialize the "working" list to empty.

2. Repeat until "reference" list contains 1 node

   2.1 Set MaxFrequency to the sum of the first 2 node frequencies

   2.1 Repeat until "reference" list is empty
       If ("reference" list contains 1 node) OR
          (sum of the next two nodes frequency > MaxFrequency)
            Move remaining nodes to the "working" list
            Set "reference" list to empty
       Else
          Create a new internal node
          Connect the first "reference" node to the left child
          Connect the second "reference" node to the right child
          Set the new node frequency to the sum of the frequencies of the children
          Insert the new node into the "working" list
          Remove the first and second nodes from the "reference" list

   2.2 Copy the "working" list to the "reference" list
   2.3 Set the "working" list to empty

At the end of this process the single "reference" list item will be the root of a Huffman tree. You can enumerate prefix codes by doing a depth first traversal of the tree. Write out a '0' for every left branch taken and a '1' for every right branch. The code is complete when a leaf is encountered. The symbol at the leaf is encoded by the Huffman code just generated.

What is an optimum encoding

An interesting calculation one can perform is to calculate the "bit weight" of a prefix encoding. The bit weight is the total number of bits needed to represent the set of prefix codes.

Look at your original tree above. The weight of this tree is (2 bits * 2) + (4 bits * 5) + (3 bits * 2) = 30 bits. You used 30 bits to represent 8 prefix values. What is the minimal number of bits you could have used? Think about it, as a tree becomes unbalanced the length of the path to some leaves gets longer - this adds to the weight. For example the worst case for a 4 value prefix tree would be:

                 +-- (1 bit)
               --+                  
                 |  +-- (2 bits)
                 +--+
                    |  +-- (3 bits)
                    +--+
                       +-- (3 bits)

giving a total weight of (1 bit * 1) + (2 bits * 1) + (3 bits * 2) = 9 bits

Balance the tree:

                +-- (2 bits)
             +--+
             |  +-- (2 bits)
           --+  
             |  +-- (2 bits)
             +--+
                +-- (2 bits)

giving a total weight of (2 bits * 4) = 8 bits. Notice that for balanced trees all prefix codes end up having the same number of bits.

Tree bit weight is just the sum of the path lengths to all leaves. You minimize the bit weight by minimizing the total path length - and this is done by balancing the tree.

As you can see, there isn't much value in minimizing any given prefix tree, you just end up with a fixed length symbol encoding. The value comes when you consider the bit weight of the resulting encoded message. Minimizing that leads to Huffman encoding.

How many different encodings are there?

Prefix codes may be generated by traversing a binary tree and emitting a '0' for each lower branch followed and a '1' for each upper branch followed until a leaf is encountered. As in:

             +--+ (1)
             |  
           --+  
             |  +-- (01)
             +--+
                +-- (00)

Alternatively we could "flip" that rule and assign a '1' for each lower branch and a '0' for the upper branches:

             +-- (0)
             |  
           --+  
             |  +-- (10)
             +--+
                +-- (11)

These generate two different sets of prefix codes. Addtitional sets can be generated by going through all the possible 1/0 assignments to branches and then traversing the tree. This will give you 2^n sets. But if you do this, you will find the same prefix codes may be generated but in different order. For example, the previous tree would yield the following sets: {(0, 10, 11), (0, 11, 01), (1, 01, 00), (1, 00, 01)}. Then flip the tree to:

                +-- (??)
             +--+
             |  +-- (??)
           --+
             |
             +-- (?)

and you get: {(11, 10, 0), (10, 11, 0), (01, 00, 1), (00, 01, 1)}. Put them both together for 2^3 = 8 sets. However if you want unique sets disregarding order there are only 2 sets: {(0, 10, 11), (1, 00, 01)}. Go through the same exercise for a balanced tree and there is only ever 1 set. All this leads me to believe that the number of unique encodings is related to the balance structure of the tree used to generate prefix codes. Unfortunately, I don't have an exact formula or calculation worked out. On a hunch I would guess the number would be 2^(number of distinct code lengths - 1). For a balanced tree that is: 2^(1 - 1) = 1; for a tree with two distinct code lengths (as in the example above): 2^(2 - 1) = 2; and for your example: 2^(3 - 1) = 4.

OTHER TIPS

The requirement that the sum of the number of bits is minimized is equivalent to requiring the codes to be optimal Huffman codes for a string where each symbol occurs once. So simply create a string with n unique characters and produce a Huffman tree for it. The algorithm is outlined on Wikipedia.

Your example for n=8 doesn't seem to represent an optimal solution.

10 11 000 001 0100 0101 0110 0111 Total bits: 26

000 001 010 011 100 101 110 111 Total bits: 24

When there is a constant frequency the optimal prefix encoding will be fixed length. Each prefix code will be of length log(n) and be the binary representation of the alphabet from 0..n-1.

EDIT for the case where n is NOT a power of 2.

// generate tree
function PCode(n) {
 var a = [];
 for(var x=1; x<=n; x++) {
  a.push({"v":x});
 }
 for(var x=0; x<n-1; x++) {
  var node = {"v": null, "l": a.shift(), "r": a.shift()};
  a.push(node);  
 }
 return a.pop();
}

//print
function Print(node, s) {
 if(node["v"] != null) {
  console.log(s);
 }
 if(node["l"] != null) Print(node["l"], s + "0");
 if(node["r"] != null) Print(node["r"], s + "1");
 return;
}

//test
Print(PCode(3), "");

Please take a look at this C++ tutorial site. It will provide helpful C++ structures for you. And I'm seeing other similar SO questions that may be of help at the "Related" section to the right.

I have done this before in C with a recursive algorithm, and yes, it would make a great homework problem.

The generation problem (uniqueness of decoding) can be guaranteed by building a binary tree of n leaf nodes, and enumerating the position of each such node in the tree (0 is left branch, 1 is right branch). And you are right, Huffman Trees have this property. Note that for Huffman Trees, each node is given a weight equal to the frequency of its representative character, and the tree is built with a recursive property that the left-right decision on node joins is based on the sum of the children to that point. This cumulative sum property is also why a Fibonacci distribution gives the worst-case compression for Huffman Trees.

Note, Huffman encoding is optimal for variable encoding of fixed alphabets. An example of a non-fixed alphabet is the decision to treat " the " as a single element in your set to be compressed (as opposed to two spaces and one each of the letters).

Your problem appears to not be substitution related. You just want prefix codes for n elements where the sum of the lengths of all prefix codes is minimized. This is the same as building a Huffman tree where every element frequency is 1 (because it guarantees the minimum encoding of the total encoded string, which for you is equal to the sum of the bits of every encoded element exactly once, i.e. minimizing the total bits). Note: this guarantees the minimum encoding, it does not guarantee the fastest implementation. You probably do not need to build a tree for each method call. Unfortunately, I don't know an implementation off the top of my head.

Let's encode a binary string x by the number whose binary representation is 1x. Otherwise, 0 and 00 would map to the same int.

std::vector<int> GenerateBinaryPrefixCodes(int n) {
    std::vector<int> list;
    for (int i = n; i != 2 * n; ++i) list.push_back(i);
    return list;
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow