Tunstall coding

The Tunstall coding is a form of lossless data compression and entropy encoding, which was developed in 1967 by Brian Parker Tunstall in his doctoral thesis at the Georgia Institute of Technology. In contrast to similar processes such as Huffman coding or Lempel-Ziv 77 Tunstall coding maps a source symbol having a variable length, the actual length is determined by the probability of occurrence of each of the source symbols and stored in a search tree, a code symbol with a fixed number of bits ( digits) to.

Method

As an example, the data encoding of the string " hello, world" to serve. For simplicity, it is further assumed that the size of the source symbols, the so-called alphabet, only the symbols { dehlorw 9 } is intended to include. In this case, can the probability of occurrence of each character in the string to encode directly specify: For example, is the character ' l' in the 12 -character string three times before, which corresponds to a probability of occurrence of 3 /12.

For the first iteration and the structure of the search tree, the individual probabilities of occurrence of all the nine source symbols are determined, and these 9 symbols encoded with a respective code symbol of length bits. The notation stands for the so-called Gaussian curve. For example, as shown in the figure, the symbol ' h' assigned to the code word w3 = " 0010".

Through further iterations, the entropy coding is improved. In the second iteration, the sheet is made with the highest probability of appearance of the tree, the symbol L with 3/12, and all probabilities followed by one of the nine other possible symbols for the l symbol made in this case. Also in this case, the occurrence probabilities of the individual symbols or symbol sequences are formed and sorted in descending order in the tree, as shown in the second figure. Thus, the occurrence probability of the symbol sequence ll in this example:

As a follow which are encoded with bits in total 17 different code words. The process stops when the number of code words has reached or exceeded an initial preset limit. Would the Tunstall coding in this example, after the second iteration will be terminated with a circumference of 17 code words, the string " hello, world" Tunstall coding would be the following binary code sequence, including the associated source symbols are given:

01010 01011 00001 00000 01100 01101 01110 00000 01111 00011 he ll o wor ld literature

Template: Internet resource / Maintenance / date not in ISO format Martin Bossert: Applied Information Theory. University of Ulm, April 2007, accessed on April 5, 2013 ( PDF; 815 kB).

LZ77 and LZ78 Symbol rate#Symbols

786599