Lempel–Ziv–Welch

The Lempel -Ziv - Welch algorithm (short LZW or LZW ) is an often in graphic format for data compression, ie to reduce the amount of data inserted algorithm. A large part of the operation of this algorithm have been developed in 1978 by Abraham Lempel and Jacob Ziv and published ( LZ78 ). Some improvements were made in 1983 by Terry A. Welch.

LZW is a lossless compression method. It is used for example in 1987, developed by CompuServe GIF format image employees and can optionally be used to TIFF. It is suitable for any form of data, since the used dictionary is generated at run time and is independent of the format. LZW is probably the best known representatives of the LZ family.

  • 3.1 Algorithm of decompression
  • 3.2 Example for decompression

Operation

LZW compressed using dictionaries in which the most frequently occurring character strings, such as " is ", " the " and "a " are stored and must now only be addressed under an abbreviation. The advantage of this algorithm is that the dictionary does not have to be stored separately. This is implicitly written into the file. The decoder is able to reconstruct it from the data stream. Entries in the dictionary are usually addressed via a 12 -bit index. So there are at most 212 = 4096 entries are possible. The entries with index 0 and 255 can be filled with the appropriate bytes, so entry 0 with 00hex, 02hex with entry 2, ..., entry 255 with FFhex ( hexadecimal ). The following entries are inserted at run- time, so must compulsorily start with the index 256. New entries are generated by the entry found is stored plus the next character. If the search string is only one character long, only this character is usually stored as a reference to the corresponding element of 12 bits, the sign itself but uses only 8 bits. The distinction as to whether now is a reference or an icon in the bit stream can be set via flag.

Compression

Algorithm for compression

The algorithm will first return a 9 -bit code, and later can be up to 12 bits wide, if the alphabet is not previously canceled by sending a clear code, this code.

The lowest values ​​of 256 Codieralphabets are predefined and correspond with the return itself. The algorithm now searches the longest existing pattern of the code in Codieralphabet at the command and returns the corresponding value. This would be only at the beginning of a byte is output as the ninth bit of a 9 -bit code to 0. Then he chained to the next character of the input to this pattern and adds the result as a next higher entry into the alphabet. And so it goes all the time, until the alphabet fills up. The alphabet is maintained internally in the compressor, but not explicitly stored. The decompressor it builds its part, also from the input. It can reconstruct it. It is also the K [ Omega ] K- case in which the pattern of the alphabet to the decompressor is not known yet. But he can reconstruct the value.

To save a table with 4096 patterns whose length is up to 4096 characters, one would generally need 16 MB. However, each pattern of the length N in the table, starting with a sub-pattern of length n -1, which is also in the table. So you can put the whole table into two fields Prefix and suffix. In this case (ie, a reference to another entry in the table) the last character of the pattern k and the index of the start pattern contains. If the pattern has a length of one, is set to a constant . An entry in the table is shown in the algorithm as a pair pattern = ( prefix, suffix). The algorithm then operates as follows.

Initialize pattern table ( sign ) for all characters       pattern: =       while figures available             sign: = read next character             if ( pattern sign ) in pattern table then                   pattern: = ( pattern characters)             otherwise                   add ( pattern sign ) to the a pattern table                   output pattern                   pattern: = sign       if pattern not then             output pattern The pattern variable contains the index of the corresponding pattern in the table and output pattern means that the index of the current pattern is written to the output file. One character = pattern to the index of the entry ( sign ) is set: If the statement pattern. Since the pattern table but was initialized with these patterns, this index corresponds exactly to the mark.

Example of compression

An example with the string " LZWLZ78LZ77LZCLZMWLZAP "

The result is thus the string " LZW < 256 > 7 8 < 259 > 7 < 256 > C < 256 > M < 258 > ZAP " ( read " output " from top to bottom ), which instead includes 16 original 22 characters, the same information. In this example, thus a compression rate of about 27 % was achieved.

However, in practice the compressed string with 12 bit ( 16 characters * 12 bits / 8 bits = 24 bytes ) per character encoding. In this example, therefore, an actual compression ratio of about 9% has been achieved.

Decompression

Decompression algorithm

Decompression may be made of the code words in sequence exactly the same pattern table to be generated, since the compression, only the old pattern and not the new pattern has been output to the next character. During compression, each pattern begins with the last letter of the previous added to the chart pattern. Conversely, if the last character of the pattern that has to be added to the table, is equal to the first character of the last pattern to be output.

The problem arises when the pattern to be output is not yet registered in the table. Then you can not search in the table after the first character of this pattern also. But that only happens if a pattern several times in direct succession occurs. Then: The new model is the previous pattern first character of the previous pattern.

INITIALIZE pattern table WITH ( , characters) FOR ALL characters       last: = lies_ersten_Code ()       Output ( pattern FROM last)       WHILE STILL Codes_verfügbar () REPEAT:          next: = lies_nächsten_Code ()          IF IN next pattern table THEN:             ADD ( ( pattern FROM last), erstes_Zeichen_von ( pattern OF next) ) TO BE ADDED pattern table          ELSE:             ADD ( ( pattern FROM last), erstes_Zeichen_von ( pattern FROM last) ) TO BE ADDED pattern table          Output ( pattern OF next)          last: = next Example for decompression

The characters are read in sequence. A sign gives the previous character, or dictionary entry a new entry in the dictionary.

"Output" read from top to bottom again yields the previously encoded string " LZWLZ78LZ77LZCLZMWLZAP ".

Variants

The LZ78 algorithm works similar, but starts with an empty dictionary.

LZC is only a slight variation of LZW. The index size and thus the size of the dictionary is variable, starts at 9 bits and can grow up to a specified size by the user. A number up to 7% better compression can be expected.

LZMW ( Victor S. Miller, Mark N. Wegman, 1985) differs in that instead of just each to append a character to a string in the dictionary, each character string with the longest known string that can be found in the following entry immediately following can can be appended. This is quite handy for special data ( such as a file, which consists of 10,000 "a" s ), LZW, however, comes with general data cope better.

Patents

For LZW and similar algorithms several patents issued in the U.S. and other countries. LZ78 was submitted on August 10, 1981 and granted on August 7, 1984 U.S. Patent 4464650 Sperry Corporation covered ( later Unisys merged), in the Lempel, Ziv, Cohn and Eastman are registered as an inventor.

Two U.S. patents have been issued for the LZW algorithm: No. 4,814,746 by Victor S. Miller and Mark N. Wegman for IBM, filed June 1, 1983 and No. 4,558,302 to Welch for the Sperry Corporation, later Unisys Corporation, filed June 20, 1983.

U.S. Patent 4,558,302 caused the greatest controversy. One of the most common applications for LZW was in the 1990s for websites increasingly popular GIF format for images. Unisys had indeed since 1987 the license fees charged for the use of LZW in hardware and hardware-related software, royalty-free use of the LZW algorithm allows, however, while GIF JFIF developed next to a standard format. However, in December 1994, Unisys began with CompuServe royalties from commercial software developers who read the GIF format and could write to demand and extended this in 1999 on free software. This recovery as software patent called in developer and user groups worldwide outrage and motivated the rapid development of based solely on freely available code and powerful graphics PNG file format.

Many legal experts concluded that the patent did not cover such devices, LZW decompress data, but can not compress. For this reason, the widespread program gzip file archives in Z format read but not write.

U.S. Patent 4,558,302 expired on 20 June 2003 after 20 years. The corresponding European, Canadian and Japanese patents followed in June 2004.

506352
de