information and also outputs the values of entropy, efficiency and frequency probabilities of characters present in the data stream.

Huffman algorithm is a popular encoding method used in electronics communication systems. It is widely used in all the mainstream compression formats that you might encounter—from GZIP, PKZIP (winzip, etc) and BZIP2, to image formats such as JPEG and PNG. Some programs use just the Huffman coding method, while others use it as one step in a multistep compression process.

Huffman coding & deciding algorithm is used in compressing data with variable-length codes. The shortest codes are assigned to the most frequent characters and the longest codes are assigned to infrequent characters.

Huffman coding is an entropy encoding algorithm used for lossless data compression. Entropy is a measure of the unpredictability of an information stream. Maximum entropy occurs when a stream of data has totally unpredictable bits. A perfectly consistent stream of bits (all zeroes or all ones) is totally predictable (has no entropy).

Huffman tree (Image courtesy: Wikipedia)

The Huffman coding method is somewhat similar to the Shannon–Fano method. The main difference between the two methods is that Shannon–Fano constructs its codes from top to bottom (and the bits of each codeword are constructed from left to right), while Huffman constructs a code tree from the bottom up and the bits of each codeword are constructed from right to left.

The model for Huffman tree is shown here in the figure. It is generated from the sentence “this is an example of a huffman tree” using Huffman algorithm. Here 36 is the root of the tree. Below the root node you can see the leaf nodes 16 and 20. Adding 16 and 20 gives 36. Adding 8 and 8 gives 16, while 4+4=8. On the left-hand side, ‘e’ is attached to 4. Similarly, ‘a’, ‘n’, ‘t’, etc have been attached to form the complete Huffman tree.

For details on Huffman tree formation, please refer the ‘Data Compression and Decompression’ software project published in EFY April 2005.

The simplest tree construction algorithm uses a priority queue or table where the node with the lowest probability or frequency is given the highest priority.

First, create a leaf node for each symbol or character and add it to the priority table. If there is more than one node in the table, remove two nodes of the highest priority (lowest frequency) from the table. Create a new node with these two nodes as sub-nodes and with probability equal to the sum of the two nodes’ probabilities. Continue in this way until you reach the last single node. The last node is the root, so the tree is now complete.

Steps to encode data using Huffman coding

Step 1. Compute the probability of each character in a set of data.
Step 2. Sort the set of data in ascending order.
Step 3. Create a new node where the left sub-node is the lowest frequency in the sorted list and the right sub-node is the second lowest in the sorted list.
Step 4. Remove these two elements from the sorted list as they are now part of one node and add the probabilities. The result is the probability for the new node.
Step 5. Perform insertion sort on the list.
Step 6. Repeat steps 3, 4 and 5 until you have only one node left.[/stextbox]

Now that there is one node remaining, simply draw the tree.

With the above tree, place a ‘0’ on each path going to the left and a ‘1’ on each path going to the right. Now assign the binary code to each of the symbols or characters by counting 0’s and 1’s starting from the root.

Since efficient priority-queue data structures require O(log n) time per insertion, and a tree with ‘n’ leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where ‘n’ is the number of symbols.

From the above, it is now clear that the encoding method should give rise to a uniquely decodable code so that the original message can be detected uniquely and perfectly without errors. The message generated with the highest probability will be generated more number of times than other messages. In such a case, if you use a variable-length code instead of a fixed-length code, you will be improving the efficiency by assigning fewer bits to the higher-probability messages than the lower-probability messages.