Suffixarray

In computer science, a suffix array is an array that specifies the suffixes of a string into lexicographical order.

Details

Consider the string " abracadabra " the length 11 He has eleven suffixes: " abracadabra ", " bracadabra ", " racadabra ", and so on up to "a". Sorted in lexicographical order, there are the following suffixes:

A abra abracadabra acadabra adabra bra bracadabra cadabra dabra ra racadabra Its first character: ( Indexing can start with zero or with one, here it starts with one note.) Be specified if the original string is available, every suffix by the index. The suffix array is an array of indices in lexicographical order. For the string " abracadabra " is the suffix array } { 11,8,1,4,6,9,2,5,7,10,3 because "a" begins with the eleventh sign, " abra " the eighth letter and so further.

Strictly speaking, the string " abracadabra " a twelfth suffix: the empty string (with index 12). But since the empty string is lexicographically sorted before other, is lost by neglecting this suffix no information.

Algorithms

The easiest way to design a suffix array is to use an effective comparison-based sorting algorithm. This requires Suffixvergleiche, but a Suffixvergleich takes time, so that the total run time of this method is. Improve on the algorithms developed by utilizing the results of the comparing part so as to avoid redundant comparisons. Some algorithms are known to do with a low memory requirement of small constants.

Applications

The suffix array of a string can be used as an index to quickly locate any substring within the string. To find every occurrence of a substring is equivalent to finding all suffixes that begin with that substring. Thanks to the lexicographical ordering these suffixes are grouped together in the suffix array, and can be efficiently found using a binary search. If it is implemented naively, requires the binary search time, the length of the substring is. To avoid repeated comparisons, additional data structures, the information on the longest common prefix ( longest common prefix engl.: in short LCP) are suffixes deliver constructed. Thus, a seek time of can be achieved.

Suffixsortierende algorithms can be used to perform the Burrows-Wheeler Transform ( BWT). At BWT but its cyclic permutations are sorted instead of the suffixes of a string. This can be rectified by a special character that is not found in the string and is the lexicographically smallest character is appended to the end of the string. Sort Cyclic permutations is then equivalent to sorting suffixes.

History

Suffixarrays were originally developed in 1990 by Gene Myers and Udi Manber, to reduce the memory consumption compared to Suffixbäumen. Approximately a decade later, began the development of compressed Suffixarrays and BWT -based compressed full-text indexes.

753811
de