Burrows–Wheeler transform

The Burrows - Wheeler Transform (BWT ) is an algorithm that finds such as bzip2 application in data compression techniques, however, it itself performs no data compression. The transformation was developed by Michael Burrows and David Wheeler in 1994 at DEC Systems Research Center (SRC ) in Palo Alto and is based on an unpublished transformation of Wheeler from the year 1983.

  • 3.1 original method 3.1.1 sample code in Lua
  • 3.1.2 example
  • 3.1.3 detailed example
  • 3.2.1 Explanation
  • 3.2.2 Example code in Lua

Properties

The BWT is an algorithm which consists of a block of data (input ) an equally large block generates data (output ), and a little additional information ( an index ). The output is a permutation of the input, that is, the character frequency of input and output is the same, but the order may change. The output can be better compress in general, as same character more frequently after the other as in the input. From the output data and the index can recover the input data, ie reverse the transformation.

The Burrows-Wheeler transformation depends on a number of parameters:

  • The character set used, in most cases, the BWT is applied to 8-bit bytes.
  • The sort sequence of characters is relatively unimportant as long as the characters are completely ordered.
  • The index, which is produced as output of the BWT, can be 1 - based or 0-based.

It is important that these parameters are the same for the forward and reverse transformations.

Forward transformation

Input:

  • The data to be encoded

Issue:

  • The encoded data
  • An index

First, all possible rotations of the data to be encoded are produced, by one character from the beginning of the data is moved to the end. All these rotations are written to a table. This table is sorted lexicographically. The last character of each line result - read from top to bottom - the encoded text. The index is the number of the row in which the appear to be coded data in the table.

Principle

The efficiency of the Burrow - Wheeler transform is based on the fact that in each language determine parts of words / syllables usually with only a few letters are started (eg ( g) eg- lengths, ( g) eg- s, ...). By sorting the table above those word elements are piled. Since the last column of the table in each case includes the sign in front, piling up in the output of the algorithm in bits and pieces just always the characters that begin each word / text segment most likely.

D - ie droplet diameter within a N D - ie distinction of mists in certain d - he availability of water vapor and d - ie the air can hold, without con w - ie mainly thermal Oberflächene s - ie often in the direction of the built environment D - ie highest frequency of fog appears dab w - ie in buoyant areas of the case. The wa s - ie but also greatly between individual d - ie maximum amount of water vapor that the Luf d - ie not poss without existing surfaces. - .... - .... - ... l - lead egos fog density, one speaks of l - is increased egos frequency of fog or you l - egos scale areas can greatly s l - egos causes and is described in section Nebe e - egos of the case. The perceived Nebelhä l - Icher causes reaches the dew point. the e - Icher assess. The spatial Ska n - happens maybe too often. If mist b n - ot is possible. So then even before a Sample code in Lua

BWT_vorwaerts function (text)    local len = string.len (text)      Create table with all the rotations of the text -    local matrix = { }    for i = 1, len do      matrix [ i] = string.sub ( text, i) .. string.sub ( text, 1, i - 1 )    end      - Sort table    for i = 1, len do      for j = i 1, len do        if matrix [ i] > matrix [ j ] then          matrix [ i], matrix [j ] = matrix [ j], matrix [i ]        end      end    end      - Take the last character from each line    local encoding = ""    local index = -1    for i = 1, len do      encoding = encoding .. string.sub (matrix [i ], -1)      if matrix [ i] == text then        index = i      end    end      return coded index end optimization options

It is not necessary to store the entire table because all rows contain the same text, just start at different points. It is therefore sufficient to store the text only once. Each table row consists then only of a pointer to the start of the string. The larger the block of data is to be transformed, the more it is worth, this optimization. For example, if you transformed 500 kilobytes, and each row stores for themselves, you need 500 000 lines of 500,000 bytes, ie 250 GB to accommodate just about the table in memory. If one uses pointer however, only needs to 500,000 bytes for the text per line and 4 bytes for the pointer (for 32 -bit memory addresses ), so together 2.5 megabytes.

Example

Input:

  • Data to be encoded: " PINEAPPLE. . "

First of all rotations are generated and written to a table:

1:. PINEAPPLE. 2: .. PINEAPPLE 3: S. ANANA. 4: AS .. ANAN 5: NAS .. ANA 6: ANAS TO .. 7: NANA .. A 8: PINEAPPLE .. Then this table is sorted.

1: PINEAPPLE .. 2: ANAS TO .. 3: AS .. ANAN 4: NANA .. A 5: NAS .. ANA 6: S. ANANA. 7:. PINEAPPLE. 8: .. PINEAPPLE The last character of each line are successively written and result in the output text. The input text is present in the 7th row, therefore, the index is 7

Issue:

  • Coded data: ". NNAAA.S "
  • Index: 7

Notes:

  • The sort order is chosen here so that the point is to be positioned behind the letters.
  • In the output text like characters come more often in succession before in the input text.

Inverse transformation

Input:

  • Encoded text
  • An index

Issue:

  • Uncoded text

For the reverse transformation, there are several methods, which are explained in the following. If you have only the encoded text as input, one can determine the order of characters in uncoded text. That leaves but still so many ways that the text has characters. Thus, the reversal is clearly, therefore you also need a number that indicates where in the encoded text begins the uncoded. This number can be calculated from the index.

Original method

To back transform the text, the text with a stable sorting method to sort, and the sort is taken which character lands in which position. This provides a mapping between the coded position (in the last row of the coding table) and the sorted position (in the first row of the coding table). This mapping is a permutation with one cycle, that is, if one starts at a certain position, to pass through the permutation to reach all of the elements once and then again ends up in this position. This is what is also made in the inverse transformation, as in this passage to get past all of the text characters in the order in which they were placed before the forward transform. The fact that you start at index one, the characters in exactly the same order as they were placed in front of the forward transformation.

Sample code in Lua

BWT_rueckwaerts function (text, index)    local len = string.len (text)      Store characters associated with positions in a table -    local table = {}    for i = 1, len do      table [i ] = {position = i, sign = string.sub (text, i, i ) }    end      - This table sort by the characters. It is important here    - Use a '' stable '' sorting method.    for i = 1, len - 1 do      for j = 1, len - 1 do        if table [ j]. character- > table [j 1]. then sign          table [j ], table [ j 1 ] = table [j 1], table [j ]        end      end    end      - When starting index once the table    - A walk will collect all characters.    local decoded = ""    local idx = index    for i = 1, len do      decoded = decoded .. table [ idx ]. signs      idx = table [ idx ]. position    end      decoded return end example

The data ( text: "a iepdWkii! " Index: 2 ) should be transformed back. The sort order is: exclamation marks, capital letters, lowercase letters ( as in ASCII).

The data is sorted using a stable sorting method, and the sort is taken to ensure the character that ends up in which position.

Example: In the coded text the big " W" stood at position 7, after sorting it is at position 2, together with the information that comes from position 7. The stable sorting method is important in order not to upset the order of is. In line 2, they are at positions 3, 9 and 10, and in this exact order they are also in line 3

Detailed example

The text ". NNAAA.S " should be transformed back. The sorted table, he stood on line 7

From these data, the complete sorted array can gradually reconstruct, and if that is done, one finds, in line 7 the original text.

The first column of the matrix can be easily reconstructed, because it contains all the characters of the text, simply by:

1: A______.   2: A______N   3: A______N   4: N______A   5: N______A   6: S______A   7:. ______.   8:. ______S If you now considering that in all rows of the same text is only rotates, you can gradually complete the matrix. For a better overview, you can write the last column again before the first column.

8 │ 12345678   ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─   1:. │ A______.   2: N │ A______N   3: N │ A______N   4: A │ N______A   5: A │ N______A   6: A │ S______A   7:. │ _____. ?.   8: S │ _____S. ? We see now that a ". " either an "A" followed by (line 1 ), or another ". " ( Line 7). These two characters must therefore in line 7 and 8 at the point 2 are ( with a question mark labeled). Since, as already mentioned, the matrix is ​​sorted and the first character in these lines is the same, has the "A" in line 7 are and the point on line 8 The same procedure with the other signs: You look all subsequent characters, sorted her, and she is wearing from top to bottom in the rows that end with this sign. This gives the second column

  • To "A" followed by "N", "N" and "S", which are in row 1, 2 and 3
  • "N" followed by "A" and "A", which are in line 4 and 5
  • In "S" followed by ". " Which comes on line 6
  • On ". " followed by "A" and "." which come into line 7 and 8

8 │ 12345678   ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ─   1:. │ AN_____.   2: N │ AN_____N   3: N │ AS_____N   4: A │ NA_____A   5: A │ NA_____A   6: A │ S._____A   7:. . │ A_____.   8: S │ .. _____S For the other columns, this method no longer works. (Example: The follow up to " A", " N", " N" and " S", and if the down in the lines that end in " A" enters from above is in line 7 at a time ". AS____. ", and that can not be right. ) But by sharp look, it can be seen that somewhere in the word string" p. "occurs ( line 8). And there's only one way to continue this string, namely " .. A" ( line 7). Then comes (line 1), and now there is the next problem "ON. ": It could either line 4 or line 5 come, because both contain the characters " ANA". However, for this problem there is a solution.

If one remembers when reconstructing the column 2, from which line come the characters and in which they are used, we obtain the following table:

This table fits remarkably well to the followers, which can be determined by sharp glance. Indeed, the successor of line 8 is line 7, the successor of 7 is 1, and that is in the table as well. The table also provides the answer to the problem of ambiguity. The correct order of the rows can now be read off from the table, by starting with the 7 (which is the line number in which the original text was ) and the number from the bottom line to writing. We search for this number in the top row, and so on. This gives the sequence 1, 4, 2, 5, 3, 6, 8, 7

In the last step, we look for for each of these numbers, which is in the last column of the row. In line 1, which is a ". " In line 4 is an "A " line 2 "N ", and if one concatenates all these signs, you get the text ". PINEAPPLE. ".

Alternative Method

This method is computationally more complex than the original method and therefore more suitable for the demonstration as to implement in computer programs. It is based on the idea of ​​starting from the encoded text that was in the table, the forward transformation in the last column to reconstruct all other columns gradually. If this goal is reached, you can in the line that specifies the index, read the back encoded text.

Explanation

The table that is used in the forward transformation, has an important property which is exploited in this backward transformation ( and any other ) are at the same frequency in the first column of the table in front, only sorts the signs of the last column. That is, if the first column is not known, but any others that can be used to construct the first column thereof.

After filling the last column (step 1.1 ) corresponds to the already completed part of the table to that which was already used for the forward transformation. By rotating and subsequent sorting (Steps 1.2 and 1.3), the first columns of the table are filled in correctly, because the table was sorted for the forward transform as well. By rotating (step 1.2), the last column is free again, so that it can be filled again, with data that match the already completed the front columns.

Sample code in Lua

BWT_rueckwaerts function (text, index)    local len = string.len (text)      - At the beginning of the table is empty    local table = {}    for i = 1, len do      Table [i] = ""    end      for n = 1, len do      - Coded text character by character set before the first column      for i = 1, len do        table [i ] = string.sub (text, i, i ) .. table [i ]      end        - Sort table      for i = 1, len do        for j = i 1, len do          if table [i ]> table [j ] then            table [i ], table [j ] = table [j ], table [i ]          end        end      end    end      return table [index ] end see also

  • Move to front, a coding method that is often used after the Burrows - Wheeler transform.
155549
de