Range encoding

The range coding (English range encoding) is a data compression method for entropy coding, which realizes a sort of arithmetic coding.

The coding region is often seen as an alternative description of the arithmetic encoding and as basically identical to this one. It is based on the 1979 published document "Range encoding: an algorithm for removing redundancy from a Digitised message " (English to German about the area coding, an algorithm to redundancy of digitized messages to remove ), by G. N. N. Martin. Due to the age of the document, it is assumed that the method for arithmetic coding implementations described herein are not affected by the patents in the arithmetic coding. This has sparked interest in the art, especially in the free software community.

Operation

The coding region encoded in principle, all symbols of a message to a number - in contrast to the Huffman coding, which assigns each symbol a bit pattern and the bit pattern after another lined up again. Therefore, the coding region is not as such limit the need to use at least one bit of a symbol, and also does not suffer from this such as the inefficiency in the use of probabilities of occurrence which is a multiple of two are not exact. Thus, in order to achieve better compression rates.

The central concept behind the range coding is simplified as follows: If you have a sufficient range (English range) of integer values as symbols and an assessment of the probabilities of occurrence of the symbols, the original area can be easily divided into sub-areas, whose sizes are proportional to the probabilities of occurrence symbols they represent. Thus, each symbol of the message to be coded by the value range is reduced to the sub-area corresponding to the next symbol to be encoded. The decoder has to have the same probability of acceptance as the encoder are available, either previously transferred, derived from data already transferred or may be part of the encoder and decoder.

If all the symbols are coded, it is enough to transmit the message, display the sub-region, provided of course that the decoder gets to know when the message is complete again. A single integer is sufficient to indicate the sub-region, and it must not even be necessary to transfer the whole integer, but maybe there are enough of the first centers in order to describe the original message clearly.

Example

It is the message " AABA " are coded, with the end of the message ( end of message ) indicates. For this example, it is assumed that it is known to the decoder that a decimal system, a range of [0, 100000) and the frequency rating { A: 60; B:, 20; : 20} will be used. The first symbol divides the range [0, 100000) into three sub- areas:

A: [0, 60000 ) B: [ 60000, 80000 ) : [ 80000, 100000) Since the first symbol is a A, this reduces the initial range to [0, 60000 ). The second symbol decision cut that subregion in turn into three sub- areas shown below according to the already coded 'A':

AA: [0, 36000 ) AB: [ 36000, 48000 ) A : [ 48000, 60000 ) Having already encoded two symbols, the range [ 000000 remains, 036000 ) and the third symbol decide between the following areas:

AAA: [0, 21600 ) AAB: [ 21600, 28800 ) AA : [ 28800, 36000 ) This time it is the second of the three options that describes the desired message, and the range is limited to [ 21600, 28800 ). It may now be more difficult for this case, the sub-regions appear to make out but they arise quite simple: By subtracting the lower limit from the top shows that the area includes 7200 numbers, which are divided according to the probability model in 4320 for the first, 60 - and each share in 1400 for the next two ,20- shares of the total (partial) range. Added back to the lower limit value of the total (partial) range results in the sub-areas:

AABA: [ 21600, 25920 ) AABB: [ 25920, 27360 ) AAB : [ 27360, 28800 ) To encode the last symbol of the area is now restricted already up to [ 21600, 25920 ). This breaks up for it now after just described method between the limits shown in the following subsections:

AABAA: [ 21600, 24192 ) AABAB: [ 24192, 25056 ) AABA : [ 25056, 25920 ) The last symbol puts the final range is fixed to [ 25056, 25920 ). Since all numbers that begin with " 251 ", fall within this range, the message is very fact clearly described. ( Since eight of these unique numbers actually exist, the encoding is still not ideal, in that the decimal system instead of, for example, the binary system was used. )

Now it may seem as the main problem to choose at the beginning of a sufficiently large area to encode all the symbols without leaving when dividing integers or to get zero. In practice, this problem is not apparent, however, since the encoder can increase the number gradually and first places that have already been established, already able to issue and no longer needed for the calculations. So is working at all times with a small number range by issued fixed numbers and this on the other side which must be added. In the example was already after processing the first three symbols "2 " as the first digit of the result determined and would no longer need to be included in the calculation.

116643