CS 240: Lecture 21 - Huffman Encoding

Dear students:

Imagine you speak a language whose alphabet consists of just these five characters:

With this alphabet, you say things like FOOO!, GOOO-FOO!, and FOGOFOGO-OOOOO! Your scientific and literary communities have used this alphabet to generate zettabytes of documents, and you have been hired to digitize all of these documents in as compact a form as possible. A more compact form means fewer hard drives are needed and that the documents will be delivered faster across networks. Can you do it? Of course. Can you do it better than your competition? Hmm...

ASCII

Your society's digital platforms all use binary storage devices, so the documents must be encoded using 0s and 1s. But how do you represent each symbol? You could use the ASCII standard invented in the United States of America in the 1960s. This standard assigned a number to the symbols commonly used in American publishing venues. Symbols included uppercase and lowercase letters, digits, punctuation, and certain control characters. Uppercase A was assigned 65, space was assigned 32, and underscore was assigned 95.

Since the ASCII standard acknowledges 128 symbols, legal ASCII values are in [0, 127]. As the world opened up to technology, 128 symbols was far too small to capture the symbols of all the world's languages. Eventually a more expansive standard called Unicode was introduced. The Unicode UTF-8 standard preserves the original ASCII standard, but it assigns numbers to approximately 143000 additional symbols, including emoji.

Fixed-length Encoding

The largest ASCII value is 127. This special value was originally used to mark out a mistyped character on punch cards. In binary, it has the form 1111111, so seven digits are needed to represent it. In a punch card, it is all holes. The smallest ASCII value is 0. It is the null character. You might know it as the null terminator used to mark the end of C strings. How many digits are needed to represent it? 1? 0? 7?

Imagine if each symbol used the least number of digits possible. Suppose you then encounted the sequence 1111111. Is that one ASCII 127? Or seven ASCII 1s? Or one ASCII 7 and one ASCII 15? It's impossible to tell.

To avoid this ambiguity and to make binary sequences have predictable offsets, the ASCII standard imposes that each symbol is represented with seven bits, even the ones that don't need all seven. Forcing all representations to use the same number bits is called fixed-length encoding.

You could used fixed-length encoding for your alphabet. Your alphabet has five symbols, so you'd need 3 bits. You might define this mapping:

000 -> F
001 -> G
010 -> O
011 -> -
100 -> !
101 -> unmapped
110 -> unmapped
111 -> unmapped

Three sequences aren't used. That's okay. If they ever appear in a document, then you know something went wrong.

A document with \(~n~\) symbols in it will be encoded in \(~3n~\) bits. Can you do better than this?

Variable-length Encoding

Many compression algorithms work by encoding the highest frequency terms with the fewest bits. Less frequent terms can use more bits. However, care must be taken that a sequence of bits is never ambiguous. No string of bits can ever be a prefix of some longer sequence. When you use a different number of bits across terms, you have a variable-length encoding.

An optimal variable-length encoding called Huffman coding is possible. The mapping is determined by first composing a tree where the high-frequency terms are at the top of the tree, and the low frequency-terms are at the bottom. The tree is built using this algorithm:

  1. Count up the frequencies of each term.
  2. Populate a min-heap keyed on the frequency-nodes.
  3. Extract two nodes from the heap.
  4. Make the first a left child and the second a right child of a new parent keyed on the sum of their frequencies.
  5. Add the combined node back into the heap.
  6. Repeat until all nodes have been formed into a tree.

The bit sequence for each symbol is a description of the path from the root to the symbol's leaf node. A left path is marked with bit 0 and a right path is marked with bit 1.

Exercises

For the remainder of our time together, you'll work through some exercises on Huffman coding with your neighbors.

TODO

You've got a few things to do before we meet again:

See you next time!

Sincerely,

P.S. It's time for a haiku:

The more you say it The less work it is to say So, here goes: SORRY