Entropy encoding

The entropy coding is a method for lossless data compression that assigns to each individual character in a text a different length string of bits. Typical examples are eg Huffman coding or arithmetic coding.

In contrast, there are string replacement method, replace a sequence of characters of the original text by a sequence of characters of another alphabet.

Since a certain minimum number of bits necessary to distinguish all the characters from each other, the number of bits which are assigned to the characters can not be indefinitely small. The optimal number of bits to be assigned to a mark with a given probability, determined by the entropy.

Entropy are often combined with other encoders. This upstream technologies may serve to reduce the entropy of the data. Often this prediction, methods such as the Burrows - Wheeler Transform, but often other compressor are. LHarc for example, uses an LZ encoder and returns the output of this encoder characters to a Huffman encoder on. Even ZIP and Bzip have as a final stage entropy coder.

  • 2.1 Static Model
  • 2.2 Dynamic Model
  • 2.3 Level of the model

Coding

The entropy first determined the frequency of symbols ( characters). Each symbol is represented by a specific bit pattern. In this case, common symbols are to be represented by a short bit pattern in order to minimize the total number of bits required.

Mathematical algorithms for estimating the optimal length of the code of each symbol usually lead to non-integer results. In use, the lengths must be rounded to integer values ​​, thereby to lose a portion of the compression.

With all bits

Shannon -Fano coding suggests a possibility that rounds the number of bits to an integer. This algorithm provides but in certain cases not the optimal solution. Therefore, the Huffman code has been developed which provides a provable always the optimal solutions with all bits. Both algorithms produce a präfixfreien variable length code by a binary tree is constructed. In this tree are the "leaves" for the symbols to be encoded and the internal nodes for the bits to be deposited.

In addition to these methods, specifically adapted to create custom tables to the data to be encoded, there are also variants that use fixed code tables. The Golomb code can be used for example in data where the frequency distribution is subject to certain rules. These codes have parameters to adjust it to the exact parameters of the distribution of the frequencies.

Improvement

However, these processes meet the prescribed entropy of the number of bits only in exceptional cases. This leads to a non-optimal compression.

For example, a string with only two different symbols to be compressed. The character has a probability of, the other of. The method of Shannon and Huffman cause both characters are stored with one bit. This leads to an output, which contains as many bits as the input of characters.

An optimal entropy encoder but would only use bits for the characters A and B. The results for this bit to an output that includes only bits per symbol ( maximum entropy).

With an arithmetic coder can approach the optimal coding on. This method provides for a more uniform distribution implicitly of bits to the characters to encode without explicitly for each character an individual code character is constructed. But even with this method can be achieved in general not maximum entropy. This is because that there are still a "blend " is that based on the fact that only integer bit values ​​may actually occur while the entropy usually requires fractional bits. In the above example, the arithmetic coder achieved a code length of one bit. However, the blend generally disappears with increasing length of the data to be encoded, so that the limit value, the entropy can be maximized.

Model

To specify the number of bits for each character, you have to make at any time the encoding process as specific as possible about the likelihood of each character. This task has the model. The better the model to predict the probabilities, the better the compression. The model must when compressing and decompressing deliver exactly the same values. In the course of time various types have been developed.

Static model

In the static model, statistics on the entire sequence is created before the encoding of the string. The probabilities learned are used to encode the entire string.

Advantages:

  • Encoding tables must be created only once. The process is therefore very fast.
  • The issue is not necessary to decompress Statistics guaranteed not greater than the original.

Cons:

  • The statistics shall be transmitted to the decoder (either in statistical or shape of the Huffman or Shannon -Fano codes), what memory costs.
  • The statistics relate to the entire string, ie the values ​​do not adapt to local conditions in the string.

Dynamic Model

In this model, the probabilities change during the encoding process. There are several possibilities:

Advantages:

  • Fit of the model to local circumstances
  • Statistics information need not be transmitted in the forward - dynamic model.

Cons:

  • Tables need to be updated after each character. The cost calculation time.
  • The statistics hurries after the true circumstances. In the worst case, the statistics never agrees with the real probabilities, which means that more bits are required.

Usually carried out at not dynamic models with probabilities, but with the frequency of appearance of characters.

Dynamic models also allow other variations.

Aging can statistics data by halved from time to time, the frequency of characters. In order to reduce the influence of long-ago characters.

For never happened outdated characters, there are several variants:

  • It is a frequency of at least one for each character, so that all the characters can be encoded.
  • It adds to the alphabet added a new character. This symbol indicates a departure from the context. Once these characters have been coded, all letters of the alphabet can be written or read with a fixed code. The character is usually called the escape character. Some of the best compression algorithms, the PPM family, use this procedure.

Level of the model

The level of a model refers to how many characters of history are used by the model for the calculation of probabilities. A level -0 model considers no history and are the probabilities to global. A Level -1 model, however, considered the predecessor characters and take in response to this sign his statement about the probability. ( If German text to be coded, for example, the probability of the letter " U" after a " Q" much higher, or the probability of a capital letter in the middle of a word much smaller than after a space ). The level can be theoretically as high as desired, but then requires enormous space and large amounts of data in order to obtain meaningful statistics.

PPM algorithm using an arithmetic encoder with a dynamic model of the level 5

309791
de