Huffman Coding — A Complete Overview
Huffman coding (also known as Huffman Encoding) is a greedy lossless (a lossless method may compress and decompress data without data loss) data compression algorithm. It was developed by David Huffman in the early 1950s while he was a Ph.D. student at MIT.
It is often beneficial to compress data that contains frequently occurring characters. The aim is to assign variable-length codes to input characters, with the lengths of the issued codes determined by the frequency of the related characters. The most often occurring character is assigned the smallest code, whereas the least frequently occurring character is assigned the largest code.
Huffman Coding eliminates the ambiguity in the decoding process by employing the idea of prefix code, which states that a code associated with a character should not appear in the prefix of any other code.
How Huffman Coding works?
Assume the string below is to be transmitted across a network -
A B C A D B C A A E D A B C A
Each character takes up 8 bits. The above string has a total of 15 characters. As a result, sending this string requires a total of 8 * 15 = 120 bits.
However, we can compress the string using the Huffman Coding approach.
Huffman Coding is divided into two components -
1. Create a Huffman Tree from the provided characters.
2. Navigate the Huffman Tree, assigning codes to characters.
Huffman Coding Steps -
BASIC IDEA: The input is an array of unique characters with their frequency of occurrences, and the output is a Huffman Tree.
→ Create a priority queue Q consisting of each unique character.
→ Sort then in ascending order of their frequencies.
→ For all the unique characters:-
create a newNode
extract minimum value from Q and assign it to the left child of newNode
extract minimum value from Q and assign it to the right child of newNode
calculate the sum of these two minimum values and assign it to the value of newNode
→ insert this newNode into the tree
→ return rootNode
Huffman Coding Complexity -
Encoding each unique character depending on its frequency has a time complexity O(nlog n). The extraction of the minimum frequency from the priority queue takes 2*(n-1) times and has a complexity of O (log n). As a result, the overall complexity is O(nlog n).
Applications of Huffman Coding -
- They are used for transmitting fax and text.
- They are used by conventional compression formats like PKZIP, GZIP, etc.