Huffman coding

Huffman coding is a form of entropy coding, which was developed in 1952 by David A. Huffman. It assigns a fixed number of source symbols to each code words of variable length.

  • 3.1 Example

Basics

To display data as possible without redundancy, the source symbols must be encoded with code words of different word lengths. The length of the code words corresponds ideally their information content. To be able to decode a code again clearly, he must meet the Kraftsche inequality and additionally be präfixfrei, ie no code word can be the beginning of another.

The basic idea is, a k- ary rooted tree to use (a tree, each with k children per node) for the representation of the code. In this so-called Huffman tree the leaves are for the number of input characters while the path from the root to the leaf determines the code symbol. In contrast to the Shannon -Fano coding of the tree will created from the leaves to the root (English bottom-up).

The tree obtained in the Huffman Koderiung provides guarantees optimal and prefix-free encoding. That is, there exists no symbol -related coding, which may generate a short code if the occurrence probabilities of the symbols are known.

Algorithm

Definitions

  • X is the source symbol alphabet - the character set, consisting of the source symbols
  • Px is the a priori probability of the symbol x ( the relative abundance )
  • C is the code alphabet - the Zeichenvorat, composed of the code words
  • M is the cardinality of the alphabet code C - the number of different characters

Structure of the tree

Construction of the codebook

Coding

Mean word length

The average length of a code word can be computed in three ways.

  • On the weighted sum of the code word lengths:
  • By summing the probabilities of occurrence of all the intermediate nodes of the Huffman tree.
  • When only the same frequency of the elements to be encoded is the average length

Example

The source alphabet contains the characters: When we choose the binary code alphabet: and. The text from a abc abcd should be compressed ( without the spaces).

Determine the relative frequencies:

Construct the Huffman tree, and then wear a the codewords at the edges. (See picture on the right )

Code dictionary:

Encode the original text:

Average codeword length:

  • In a naive encoding each character would with coded.
  • This Huffman coding encodes each character with
  • The entropy is

Characterized in that the information content per source symbol is not an integer, a residual redundancy in the coding.

Decoding

For decoding a Huffman encoded data stream, the code table created in the encoder is (traditional method ) is necessary. Basically, the procedure is as in the encoding step vice versa. The Huffman tree is reconstructed in the decoder, and with each incoming bit - from the root of - the elapsed corresponding path in the tree, until it arrives at a sheet. This sheet is then the desired source icon and you begin to decode the next symbol back to the root.

Example

The decoder has the code dictionary:

And a received message: 1,101,101,001,101,001,000th

Now, for each bit received from the root of the path in the tree ( see above) has expired, to a sheet has been reached. Once a leaf is reached, the decoder records the symbol of the sheet and starts again from the root to the next sheet is reached.

Optimality

Average code word length of a Huffman code is valid:

This means that on average each code symbol requires at least as many digits as its information content. However, at most one more.

Huffman coding is optimal with respect to the entropy iff all probabilities of occurrence are.

Summing together n source symbols to a Hyper symbol y, then for the average code symbol lengths ly:

That is, as the number of jointly coded source symbols is the average code word length asymptotically against entropy - the encoding is asymptotically optimal.

Adaptive Huffman coding

The adaptive Huffman coding updated the tree. The initial tree is generated by a given probability distribution is assumed for all source symbols ( in complete ignorance of the source of a uniform distribution ). With each new source symbol it is updated, so you need to change the code symbols. This update step can be followed in the decoder, so that transmission of the codebook is not necessary.

With this method, a data stream on-the -fly can be encoded. However, it is much more susceptible to transmission errors because a single error in a - from the point of failure - completely wrong decoding leads.

402272
de