Arithmetic coding

The arithmetic coding is a form of entropy encoding, which is used, inter alia, for lossless data compression.

This article only describes how to encode a given set of characters - probability pairs individual characters so that it takes the smallest possible average word length. In this case, the entropy (average information content ) is given a lower bound ( Quellencodierungstheorem ).

Always associated with an entropy model of the character probabilities is described in entropy # model.

Among the founders Jorma Rissanen counts the late 1970s and early 1980s.

3.1 compared to the Huffman coding

Basic principle

The method works with theoretically infinitely precise real numbers, although then you have to resort to using finite precision integer, fixed-point or floating-point again in the actual implementations. However, this always means that must be rounded and therefore the result is no longer optimal.

The inputs to the arithmetic encoder can be referred to hereinafter as a symbol, a character. The output of the encoder is a real number ( here denoted by x ).

First, encoder and decoder must agree on an interval in which the number is x are. Normally, the range between 0 and 1 is used ( not included), so the half-open interval.

In addition, coders and decoders in the De - coding or a character must always have identical tables with the probabilities of all possible decodable characters. For the provision of these probabilities, the model is responsible.

One possibility is pre-encoding the input data to create specially a frequency analysis, and that, in addition to the actual message, communicated to the decoder. Encoder and decoder then apply to all characters that table.

Encoder

Decoder

The decoder can then decode a character at a time by performing the following steps:

In this algorithm, it is noticeable that he does not terminate: It's the sheer numbers x does not see when the last character has been decoded. It must therefore always be communicated to the decoder with an additional information when he has finished his work. This is usually realized in the form of a specified length, but can also ( for example, if in the encoding only a single pass of the data is desired ) be done by a special character meaning " end."

The interval division

The sub-intervals must be chosen so that encoders and decoders determine the size and location of the same. As mentioned above, the result is the size of the subintervals from the probabilities of the characters.

The arrangement ( order) of the intervals, however, is the quality of the algorithm is not important, so you can specify any order stuck here. This is the subject of an agreement (eg, alphabetical order ).

A way for the calculation of the intervals is the following:

And the boundaries of the interval. is the length of the interval, in other words. The two values are the sum of the probabilities of all the characters with a code less, or less than or equal to the number of input characters.

Example

In this example, the string " AAABAAAC " is compressed. First, the frequencies of all the characters needed for the size of the subintervals. For simplicity, a static likelihood for all characters.

The optimal number of bits obtained from the formula for the entropy. This value can be calculated that the information content of the string corresponds to 8.49 bits.

Now the end. The table below shows the exact values for the sub-intervals in accordance with the encoding of the individual characters. The chart at right illustrates the selection of subintervals again.

Stored is an arbitrary shortest possible number from the last interval, eg 0.336.

Corresponding to 8-9 bits. Huffman coding requires would for the given string on the other hand 10 bits ( one for each A and each 2 for B and C)

The difference in this example is 10%. The profit is greater when the number of bits actually used by the Huffman coding deviates more from the optimum, so if a character extremely common.

The decoder takes this number for decoding. In this case, the following occurs:

Is started with the start of the interval [ 0; 1]. This interval is divided by the decoder in the three sub-intervals (such as the first line of the image).
0,336 in the subinterval is A [ 0; 0.75 ]. So is the first character A.
The current sub-interval [0; 0.75 ]
[0; 0.75 ] is broken again by the known scheme
0.336 is again in the first interval [0; 0.5625 ], ie A spend.
...

Optimality

Arithmetic coding is asymptotically optimal:

After the last symbol has been processed you will receive an interval with

This corresponds to the probability for given symbol probabilities to get exactly such a sequence.

In order to specify a binary value in the interval, one needs

At least bits if
But not more than ( to describe the interval with an accuracy of s / 2, which barely suffice in the binary system, in order to discern whether the value is within ) bits.

Because

And

We can estimate the length of the arithmetically coded sequence:

This means you need at least as many bits as the entropy, but at most two bits more.

The average length of an encoded symbol is limited to

For long sequences this ( at most two) additional bits distributed evenly to all symbols such that the average length of an encoded symbol asymptotically goes against the true entropy:

Compared to Huffman encoding

The Huffman coding can theoretically achieve optimal coding, namely if and only if all symbol probabilities are.

In practice, this is hardly the case, so that the arithmetic coding most efficiently coded.

Implementation

Since you can not work with infinitely precise real numbers in a specific implementation, the concrete implementation of the algorithm must be slightly different. The maximum precision of the numbers is generally predefined (eg, 32 bits ) and can not exceed this value. Therefore, one can not implement on a real computer arithmetic coder one.

To circumvent the problem of limited accuracy, two steps are taken:

Point 1 actually leads to the fact that the algorithm is not arithmetic coder more, but only similar. But there are some independent algorithms derived from arithmetic coder; these are:

Despite this process, various problems remain with the arithmetic coding:

Entropy (information theory) Symbol rate#Symbols Self-information IBM JPEG

77663