Suffix tree

A suffix tree is a versatile data structure that enables efficient solutions of many problems in the area of ​​string processing. A suffix tree stores all suffixes ( endings) of a string. It can be constructed in linear time with linear memory consumption and allows, once established, the fast performing many operations such as searching for words or phrases in longer texts.

Definition

A suffix tree T for a string S with m symbols is a directed tree with m leaves. Each edge is labeled with a substring of S. Each internal node of T has at least two children, never begin their edge labels with the same symbol. For each leaf i in T the labels of the edges found on the path from the root to i concatenated the suffix of S that starts at index i. Thus T contains all the suffixes of S, where multiple occurrences of substrings are included only once in T ( see figure).

Construction

The tree is at the beginning of a root node and a list of ( pointers to ) all continuable points in the tree, at the beginning of that is simply a pointer to the root node. The tree is built for a gradually elongating word. For all nodes from the list of continuable Make an edge is drawn to a new node. This edge is labeled with the last appended to the word letters. This new node is then added to a list of continuable points for the next iteration. Finally, it is always the start node to be included in the new list (because the empty word is always a valid suffix). If there is a continuable place already an edge whose label corresponds to the last suffix letter, no node is added, and instead added the already existing target node in the list of resumable sites.

By the addition of the start node in each step, the list of locations continuable after n iterations and n 1 elements is long, resulting in a square term. There exists an algorithm with linear running time, which is based on the suffix array.

Applications

Thus, a suffix tree enables the efficient solution of many problems in the area of ​​string processing. The construction of the tree is possible to linearly growing term and linearly increasing memory requirements. A suffix tree allows approximately in the range of exact string comparison in a text of length m after preprocessing in O (m) the search of patterns of length k n with a runtime complexity of O ( k n). In the area of ​​soft string comparison allows a suffix tree efficient methods, such as when matching with wildcards or k- mismatch (see Gusfield 1999:199 ff ) or in the form of acceleration determining the Levenshtein distance on hybrid dynamic programming (see Gusfield 1999:279 ff. ).

For a detailed presentation and more than 20 other applications of Suffixbäumen see Gusfield ( 1999:89 ff.)

History

  • Morrison ( 1968): Patricia trie.
  • Weiner (1973 ): First algorithm for constructing Suffixbäumen with linearly growing term.
  • McCreight (1976 ): Further development of higher space efficiency.
  • Ukkonen (1995 ): Conceptually new, simpler algorithm with running time and space complexity O (n), also enables on-line construction of the tree.
  • Giegerich & Kurtz (1995 ): Accelerated algorithm by means of functional language and lazy evaluation: structure during the search, only relevant parts of the tree are created if not already present.
753808
de