Deflate

Deflate (English let out for air ) is an algorithm for lossless data compression. It was developed by Phil Katz for the ZIP archive format and later fed to the public domain.

Description

LZSS replaces strings that occur more than once. After an entropy coding is carried out according to Huffman.

There are a number of parameters for Deflate with which one can influence execution speed and reduction rate:

Bitstream format

Deflate a data stream ( stream) consists of a sequence of blocks. Each block is preceded by a 3-bit header:

  • 00: a Literal portion which is between 0-65535 bytes long.
  • 01: is a compressed block, which uses a pre-defined static Huffman tree.
  • 10: compressed block, with its own Huffman tree.
  • 11: Reserved, currently not used.

Most blocks are sure ultimately encoded with 10, the dynamic Huffman method. This creates an individually optimized Huffman tree for each block. Instructions to build the necessary Huffman tree follow directly to the block header.

Compression is achieved in two steps:

  • Find and Replace duplicate strings with pointers.
  • Replace symbols ( characters) shorter by part, by the frequency of occurrence weighted symbols.

Eliminating duplicate strings

If within a block to be compressed a sequence of repeating bytes (corresponding to a repeating string) found, it is replaced with a backward reference. This points to a previous occurrence of the string. A hit on a previous string is of length ( 3-258 bytes) and a distance ( 1-32768 bytes). This backward reference can extend over any number of blocks, but must be within the already decompressed data, ie within the sliding window within the distance of 32 KiB.

Bit reduction

The second phase of the compression consists of the replacement of frequently used symbols ( characters) by shorter forms of representation and rarer symbols ( characters) by representational forms that require slightly more space. This method of generating a Huffman coding tree with unprefixed not overlapping intervals in which the length of each bit sequence is inversely proportional to the probability of occurrence of each symbol. The more frequently a symbol occurs, the shorter is the bit sequence can be represented to encode.

It creates a tree that can accommodate 288 symbols:

  • 0-255: represents a literal / Symbol 0-255.
  • 256: At the end of the block - stops the execution if it is the last block, otherwise processing continues with the next block.
  • 257-285: combined with extra bits a hit with 3-258 bytes.
  • 286-287: not used, reserved and invalid, but still part of the tree.

The Hits length always follows the distance. Depending on the code the extra bits may be read to determine the final distance. The distance tree consists of 32 symbols:

  • 0-3: Distance 1-4
  • 4-5: Distance 5-8, 1 extra bit
  • 6-7: Distance 9-16, 2 extra bits
  • 8-9: Distance 17-32, 3 extra bits
  • ...
  • 26-27: Distance 8193-16384, 12 extra bits
  • 28-29: Distance 16385-32768, 13 extra bits
  • 30-31: not used, reserved and invalid, but still part of the tree.

Note that for the symbols 2-29, the number of extra bits can be calculated as follows:.

Use

Deflate is used among others in the following formats and libraries:

  • In the archive format ZIP,
  • The compression program gzip,
  • The image file format PNG,
  • The image file format TIFF,
  • In OpenDocument, the ISO standard format for Office files
  • The Portable Document Format (PDF) and
  • The CAB archive format.

The free library zlib abstracts the use of Deflate for use in other programs. 500 programs use the algorithm in this way.

History

Katz implemented along the lines of the LHA published in 1982, Lempel -Ziv - Storer - Szymanski algorithm, which represented an improvement over the old Lempel- Ziv algorithm from 1977. Deflate was introduced in 1993 with version 2 of the software PKZIP by Phil Katz's company PKWare Inc.. The exact specification of Deflate and the associated bitstream format is laid down in RFC 1951 by May 1996.

225736
de