String searching algorithm

In computer science, string matching algorithms are a set of algorithms that describe the retrieval of text segments in a string ( engl. string) based on a predetermined search pattern. They thus belong to the class of string algorithms.

These algorithms search in the strict sense for exact matches (English matches). In a broader sense, algorithms are meant that allow approximate matches, the term needs to be about exactly defined by a tolerance criterion.

Exact search

Problem

Basically, two situations must be distinguished:

In the following, however, deals only with the first situation.

Solution methods

Naive algorithm

The simplest algorithm is to move a so-called search window of the length of the search on the text. The symbols of the mask are compared with those of the underlying text in any position of the search. When a non- matching symbol is found, the window is shifted one position, and a comparison is once again; if all symbols in the window match, the search is found. The algorithm ends when all of the text has been scanned by the window.

This algorithm has a running time of order, if m is the length of the search, and n is the length of the text.

Pseudocode:

Input: string T = T1 ... Tn and P = P1 ... Pm Output: q the points at which P occurs in T for q = 0 to n - m do if P = T [q 1] and P = T [q 2] and ... and P [m ] = T [ q m ] then write q Surprisingly, the naive approach in practice is very fast, since errors in natural language texts appear after 1 to 2 characters. For the English language there is a probability of 1:07 mark. Thus, the naive approach is nearly linear fast.

This is also evident when one looks at the worst case itself. It reads

Text: aaa aab ... Pattern: from Such cases are extremely unlikely in natural language texts.

Finite automaton

In the string - matching algorithm by means of finite automata is an automaton with states, with with, created as the length of the pattern. This represents the number of matching characters in the pattern string from the outset considered dar. For simplicity, it is the prefix of the pattern to the letter placed. The transition function is now showing a condition in which the maximum number of letters is that which is a prefix of the word. So.

Python implementation

The Knuth - Morris -Pratt algorithm

The Knuth - Morris -Pratt algorithm is based on the naive search algorithm. The main difference is that the window of comparison is not always advanced by only one position, but possibly by more than one position.

For this purpose, the search must be analyzed at the beginning, so that it is known at each partial match, as the first k symbols, whether the beginning of the search to the end of the last matching subform matches. The displacement of the search carried out according to the overlapping match; additional advantage is that the symbols already being compared need not be again compared.

Search the suffix tree

Especially when the text to be searched is known in advance, and should be sought in this later after many different patterns, the construction of a suffix tree offers. This construction can take place in. Then, each pattern can without re- preparation of the text be sought in: If it exists, you can reach the corresponding node of the source of the suffix tree, otherwise the search fails (there is no such node exists).

Survey

Where m is the length of the search, and n is the length of the text.

Other algorithms

Boyer- Moore algorithm:
Skip -Search algorithm
Baeza -Yates - Gonnet algorithm ( "shift- or")
Soundex
Cologne Phonetic

Pattern matching search

The search for patterns is somewhere between fuzzy and exact search, since the user must explicitly specify the leeway it allows for certain character classes at certain string positions.

In the fuzzy search usually decides the algorithm by specifying a quality or distance criterion, how great must go the deviation of hits.

Finite-state machine Boyer–Moore string search algorithm Bitap algorithm Phonetic algorithm Digital Object Identifier

752172