Sequitur algorithm

Sequitur is an algorithm for lossless data compression, which in the work "Identifying hierarchical structure in sequences: A linear -time algorithm" was described by Craig Nevill - Manning and Ian Witten of the University of Waikato, New Zealand in 1997.

Description

Sequitur be replaced repetitive strings to strings (eg texts) with the help of grammatical rules. This operation is performed recursively. As a result Sequitur provides a hierarchical representation of the original sequence, which gives insight into its lexical structure. There, the scope of the grammar is reduced and as a " by-product ", the sequence structure. The advantage of Sequitur is the iterative procedure.

Sequitur generates one of a plurality of possible context-free grammars for a given string. This grammar has to be the optimum for the purpose of compressing not necessarily. In addition Sequitur can be very memory intensive, so this algorithm is not so well suited for compression as other common methods.

Operation

Using the Sequitur algorithm for a string in a context-free grammar is converted. The grammar is built up gradually. But one character is read. Compete in a subsequence twice, this subsequence is replaced by a variable that is registered in a dictionary ( usually ). During the construction of the grammar not only repeatedly occurring subsequences in the original string, but can ( in whole or in part ) of variables existing sub-sequences are replaced by variables and hence redundancies are removed.

As a result, the algorithm provides a Sequitur grammar representing the input string with fewer symbols. The compression rate depends on the encoding of the result of production. For this, the arithmetic coding is used, for example.

Strings or part of strings are called " n-grams " means. A " n-gram " of Length 2 will be referred to as the " bigram " or " Digramm ".

The following mandatory rules of the algorithm generate the aforementioned hierarchical structure of the result string:

Digrammeindeutigkeit: every digram in the string to be compressed may occur at most once. Otherwise, a replacement by an (already existing or newly created ) variable.

Usefulness variables: Each variable, which replaces a substring of the original string must be used at least twice.

Generating a Sequitur grammar

The yellow shaded areas symbolize replace operations and therefore always result in a call to step 4 If a previously used variable namely by one of these operations no longer exist, the required variables usefulness must be restored.

All three color-coded points cause recursive calls of Step 3, as a replacement process always causes new digrams occur. It must therefore always be checked whether the required Digrammeindeutigkeit is still present, or may need to be restored this.

Example

The table below shows a simple example of how the Sequitur algorithm works. In the example, the input string is " abcdbcabcd " losslessly compressed using the Sequitur algorithm. It is shown step by step how to go through the input string, while the new grammar is generated.

A = bc

B is used only once

A = C = bc aad

The " abcdbcabcd " string one character is read and run. Thus, only at the beginning of the character "a " is considered. Since the string is checked at this time only one character, it can of course also be no repeats, so it can still be no replacement by a variable.

String a Dictionary ( empty)

Thereafter, the character "b " is added. In the string "ab" is still not a string occurs at least twice, so no replacement is also possible here. The same is true for the string "abc ", " abcd " and " abcDB ".

String from, abc, abcd, abcDB Dictionary ( empty)

After reading the 6th character of string is created " abcdbc ". This string contains the string " bc" occurs twice ( " abcdbc "). The digram "bc " is now replaced by the character "A". It is a new variable " A → bc " is generated and the string " abcdbc " as " aadA " saved.

String aadA Dictionary { A → bc }

Now the character "a " is read it. The result is the string " aAdAa ". This string contains no new string occurs twice.

String aAdAa Dictionary { A → bc }

After reading in the 8th character in the string " aAdAab " arises. Here, too, no new string at least twice.

String aAdAab Dictionary { A → bc }

Now the character " c" is read it. The result is the string " aAdAabc ". This string is included registered string " bc" already in the dictionary as a variable "A". You will now be replaced by the variable. The result is the new string " aAdAaA ". The digram " aA " now occurs twice and is replaced by the character "B ". It creates a dictionary entry "B → aA ". The variable " B " stands now for the string "abc". The string is stored as " BDAB ".

String BDAB Dictionary { A → bc }, { B → aA }

Finally, the last character "d" is read. The result is the string " BdABd ". This string contains the string " Bd " occurs twice. For this string the new variable C (C → Bd) is defined. The string " Bd " is replaced by the variable C. The result is the new string " CAC ". The variable " B" comes in this string no longer, and in the dictionary only once before. The condition of the utility variables (r2) is no longer fulfilled. The variable "b " is therefore deleted. The variable " C", which was previously saved as " Bd " is changed by eliminating the variable " B " in " AAD " (ie, it creates a long dictionary entry ).

String CAC Dictionary { A → bc }, { C → Aad }

The Sequitur with lossless compressed version of the string " abcdbcabcd " is therefore " CAC " with the dictionary { A → bc }, { C } → Aad.

Compressed string: CAC

Dictionary entries: A → C → bc aad

Reconstructed original string: abcdbcabcd

Analysis

In order to realize the individual replace operations quickly when a digram occurs repeatedly, the productions of the grammar are stored as linked lists. These lists are also linked with each other. This can, starting from the point of use of the variable, fast definitions of variables (ie, the dictionary entry ) are found. In addition, a digram index is managed. This enables rapid checking whether a digram already even exist. The index is stored as a hash table. Thus, it is possible to find in constant time a sought Digramm the positions of all occurrences in the productions. When a compound as described herein implementation of the Sequitur algorithm running time and required additional memory are linear, ie depending on the length of the output string. The runtime behavior therefore corresponds to O ( n ) or O according to the variable used here (z).

Recursion Run time (program lifecycle phase)

723371