Run-length encoding

The run-length encoding (English run- length encoding, short RLE ) is a simple lossless compression algorithm and belongs to the entropy coder. It is suitable to compress longer repetition of symbols.

3.1 File Formats

Idea

The basic idea of the algorithm is to replace each sequence of identical symbols by the number and if necessary the symbol. That is, the points are only marked where the icon changes to the message. Since the length specification only grows logarithmically compared to the length of the sequence, you can save disk space significantly, especially in long repeat sequences. Conversely, the saving is smaller, the shorter the repetition.

Run length encoding is now as Vorkodierungsschritt (eg, in image compression, such as JPEG ) used to save effort in the next encoding step. ( For example, you may save with the Huffman coding, the consideration of longer symbols, as these have previously been reduced. )

Example:

Instead of a series of five repetitions of the character 0 and 1 three times

It stores only

The longer a single sequence, the greater the savings because

For 10 reps needed to two decimal places,
For 100 repetitions needed to three decimal places,
For 1000 repetitions needed to four decimal places, etc.

The same applies to any other number systems.

Method

Bit sequences

During the coding of bit strings there are only two possibilities: A sequence of zeros or a sequence of ones. Guaranteed to any sequence of zeros followed by at least a one - and vice versa also. The only exception is when the end of message is reached.

The encoder now agree with the decoder it is started with which bits. This can be either by convention or, for example by an additional bit at the beginning. Then the lengths of the zero and one - sequences are transmitted alternately. The decoder then needs to do but spend than to any received value corresponding number of zero or one bits nothing.

Example:

The output sequence loud:

Encodes this becomes:

Multi-valued symbol sequences

For compression of sequences of symbols consisting of an alphabet having more than two icons, is no longer unique, which symbol follows next (e.g. bytes have an alphabet of 256 characters ). Here, the symbol must also be sent in addition to the number of repetitions, of which the sequence.

A special feature here is that the compressed message may be even greater if the space for the number of repetitions is greater than the sequence itself

Example:

The output sequence loud:

Encodes this becomes:

Generally, a symbol may take place only from one, also consist of two characters:

In the worst case ( no repeats ) would be the " compressed " message is larger than the original. From the sequence

Implementation

The basic algorithm (without optimizations ) is easy to implement:

# include int main () { int n = 1; / * Number of repetitions * / int ch = -1; / * Current character * / int prev_ch = getchar (); / * Previous character * / do { ch = getchar (); if (( ch = prev_ch )! | | ( n == 255 )) / * Symbol / reps -tuple output if a different symbol is reached or the maximum representable number. * / { / * Printf (" % c % c ", prev_ch, n); * / / * Binary output * / printf (" % c ,% d \ n", prev_ch, n); / * Readable output as a 2 -tuple * / n = 0; / * Start of a new sequence * / } prev_ch = ch; n; } While ( ch = EOF! ); return 0; } modifications

Sometimes found in a message, very few repeat sequences. In order to prevent that with a polyvalent alphabet each occurrence is replaced by a tuple with length specified one (eg ABC → A1B1C1 ), coded one only consequences of a certain length (eg from three).

However, then one needs a special character (Escape character), indicating that a compressed tuple follows. This special character or symbol is ideally otherwise not in the message before, otherwise choose a symbol from which it is assumed that it occurs only rarely. The peculiarity of this symbol is that it must be every time encoded as a tuple (even if it occurs only once ), otherwise return can not be made between the symbol and the tuple.

Example:

The original message loud:

As an escape character we choose ( for clarity) the character " s". In addition, we only encode sequences that comprise at least three repeats of:

Repeating itself one letter more than three times or he is the escape character, so is indicated by the output of the escape character that follows a tuple with length specification. This is followed by the number of times and finally the corresponding character.

Through the escape character is additional memory requirement, this effect will be offset by the savings in length -1 sequences again in this case. Naive would be the encoded message:

Applications

Run-length encoding is used in combination with a modified Huffman coding in the fax transmission according to ITU- T Recommendation T.30 ( " G3 fax " ) are used. Especially when transferring black and white pages, the run-length encoding achieves good results, because here long white areas with shorter black areas alternate.

Wherein the lossy compression of images, the run length coding is applied after the transformation into the frequency domain to the individual coefficients. In particular, after quantization usually arise many of the same values, or zeros, which can effectively be compressed with run-length encoding. Then this will be compressed still further by the Huffman coding, " in front of" uncompressed data.

File Formats

Known file formats that use the run-length encoding, some older video formats such as Windows Bitmap, GEM Image, Targa or PCX. Under Microsoft Windows, the file extension is *. Rle commonly used for RLE - compressed images.

Numeral system Filename extension

415403