Shannon–Fano coding

The Shannon -Fano coding is an entropy encoding. This one encodes characters according to their frequency of occurrence so that they have the smallest possible average word length. Due to the entropy, the average information content per symbol, a lower bound is given ( source coding rate).

The basic idea

From the entropy ago is known that characters with a different number of bits must be encoded if you wish to receive a memory- efficient representation. However, it is not possible to simply assign arbitrary bit strings the characters and output it.

For example the letter A with the bit sequence " 10 ", the letter B, with the result "01" and C coded with "0", then the character string ABC to " 10010 ". This sequence is generated but also by the character string " ACA ". So it's no longer easily identifiable, for which the bit string string " 10010 " is.

To avoid this ambiguity, which the characters assigned bit strings must be präfixfrei, ie no bit sequence that represents a single character, may be at the beginning of a different character. This characteristic is not satisfied in the above example, as the bit sequence is "0", which is C, B is the beginning of the assigned row.

The basic idea is to use a full binary tree for the representation of the code. In the tree, the leaves stand for the number of input characters while the path from the root to the leaf determines the bit sequence.

The picture shows a tree which is of the desired form. The internal nodes are numbered in order to rename it to.

In order to determine the sequence of bits that should be available for one of the characters must be specified, as the branches are to be coded. One way is to say that left subtrees with a 0 and right are coded with a 1. From this follow the example tree following bit strings:

Just as good but also many other encodings are possible. Here are just a few examples:

If you wanted to encode a string, the corresponding characters assigned bit sequences are successively issued. The string " ACDC " is the first code to the bit string " 00100101100 ".

To decode this bit sequence, it is executed bit by bit. The decoder has to remember the current position in the tree there. Starts at the root, so the node 1 then the first bit is read, a 0, which means in this encoding to the left of the current node is thus 2 Here is another 0 bit. The current node is now A. A is output and the current node is set to 1 again. Read the next 1-bit result is that the current node is the 3 etc.

The problem to be solved now is to create such a tree, in which the average number of questions until a character is clearly identified, the average is as small as possible, that is as close to the entropy.

Shannon -Fano coding and Huffman coding are two different algorithms for constructing these trees.

Shannon - Fano

Named after Claude Shannon and Robert Fano algorithm works with the following rule:

  • Sort the characters in order of frequency.
  • Parts character along that order as in 2 groups, the sum of the frequencies in the two groups as equal as possible. The two groups correspond to the left and right subtree of the tree to be created Shannon. The produced partitioning is not necessarily the best that can be reached with the given frequencies. The best partitioning is not even sought after, as they do not forcibly lead to a better result. What matters is that the average deviation from the entropy of the characters is as small as possible.
  • More than one character is in one of the resulting groups, the algorithm recursively to turn this group.

The important element of this algorithm is the first step. This ensures that characters with similar probabilities often end up in the same subtree. This is important, since nodes receive the same tree bit sequences with similar code lengths ( the maximum difference can only be the number of nodes in this tree). Are the probabilities now very different, you can not generate bit sequences with the necessary length differences. The result is that rare characters to get short bit sequences and frequent characters long bit strings.

Example

The example shows the construction of the Shannon codes for a small alphabet. The five input characters have the following frequencies:

The values ​​are already sorted, next follows the partitioning. Initially, all the characters are in a partition (in picture a).

Half of the probability sum is 19.5. 15 is 4.5 under this half, 15 7, however, only about 2.5 - a better partitioning does not exist. The alphabet is thus divided into two parts so that one part of the characters A and B and the other portion of the residue ( C, D and E) (Fig. b). Both partitions still contain more than one character, so need to be further divided. The partition A and B can be decomposed into two parts, each with only one character. The other group has 2 options.

6 6 is farther from the center than 6 alone. So is divided into two partitions C and D E (Fig. c). Finally, D and E is still divided. The resulting decision tree is shown in the picture section d.

The characters A, B, and C 2 bits, D and E were so assigned 3 bits. The probability of occurrence of A is 15/39, B 7/39, C and D 6/39 and E 5 / 39th Thus, for the average codeword length:

Since the assignment is not clear, here is an example for a possible coding under this tree:

326329
de