Knuth–Morris–Pratt algorithm

The Knuth - Morris -Pratt algorithm is named after Donald Ervin Knuth, James Hiram Morris and Vaughan Ronald Pratt and is a string matching algorithm. Its asymptotic running time is linear in the length of the pattern (also keyword search ) is searched for, plus the length of the input text.

  • 2.1 Prefix - analysis
  • 2.2 Search
  • 3.1 Duration of the prefix - analysis
  • 3.2 Duration of the search
  • 3.3 Together

Description

The Morris -Pratt algorithm is based on the naive search algorithm. The main difference is that the window of comparison is not always advanced by only one position, but possibly by more than one position.

For this purpose, the search must be analyzed on strings at the beginning, which are the longest possible prefix of the pattern itself. There is a table on the algorithm then created that contains, for each prefix of length j is the length of the true edge of the prefix, so the maximum number of characters in the prefix of the search pattern, the suffix and prefix are the same at the same time. It is defined that also the length of the real edge of a prefix of length zero is -1. This will be helpful later in the algorithm when searching.

The following example illustrates the predicted figuratively:

During the search, is now first to proceed such as in the naive search: It starts at position 0, to compare the characters of text and search, as long as, no longer correspond to two characters or the search was found. If the search found, the algorithm is complete. If the sign in front of a full hit do not match, the search is shifted by the difference between the number of matching characters and the length of the edge of the prefix to the right. Here, the previous definition of the length of the edge of a prefix of zero length comes into play: the difference from 0 to -1 is 1, hence so is moved to a non- matching character in the first place by one to the right. Also, then the search for the length of the edge of the prefix or zero if the margin is less than zero, further started right.

For each partial match, such as the first k symbols is therefore now known that the beginning of the search corresponds to the end of the last matching subform. The displacement of the search carried out according to the overlapping match; additional advantage is that the symbols already being compared need not be again compared.

The information obtained is used during the search, to avoid repeated comparisons.

Consequently, the algorithm is divided into two phases, namely

Search

As an example we look at the text " abababcbababcababcab ... " according to the pattern " ababcabab ".

As with the naive algorithm, the pattern is written left aligned below the text and the letter pairs are compared from left to right until patterns and text no longer match ( an error occurred):

The first error is found at position 4 in the text. Consider the pre-computed table with the length of the edges of a prefix on the site " prefix ( 0 to 3 ) ," we see that here the length is 2 deposited. The pattern is thus shifted by 2 characters to the right so it is ( ie the " second part " of the edge ) overlaps with the suffix match; also proceeding with the search immediately after the edge, since we already know that the two parts match (this is the great strength of the KMP algorithm):

Thus, the algorithm continues at position 4, which is exactly where previously an error has been found, continue with the comparisons. In particular, the edge is " off " is not checked again.

The next error occurs at position 7 in the text, and thus at position 5 in the search pattern. We consider again our table at " prefix ( 0 .4 ) ," she states that there are no characters here, which form an edge ( zero length ). We can now be sure, therefore, that there are no results, we searched the character to position 7 by naive slide to the right one character. Therefore, the pattern can be pushed up under the point 7 in the text, the result of the search text position (Number of Congruent ends signs - edge length ( prefix ) ), ie 2 ( 5-0 ) = 7:

The algorithm then again at position 7 with the comparisons continue.

Sometimes there, as here, already at the first location of the pattern on a failure. In this case, the pattern is shifted by one to the right; Search text position (Number of Congruent ends signs - edge length ( prefix ) ), ie 7 (0 - (-1)) = 8, and the algorithm continues here at the next location in the text ( 8) with comparisons continued.

Were all the characters of the pattern found in the text, so a hit is issued.

Since the most recently checked four characters " abab " at position 13 to 16, a prefix length is 4, the pattern is moved to position 13. The comparison is again at point 17 ( 4 = 13 ) continued:

The algorithm ends when the pattern can not to project beyond the end of the addition, the text can not be shifted further to the right, or when the end of the text has been reached.

Observations

Prefix analysis

The pattern is all the longest prefixes found in the pattern before the actual search analyzes.

To do this, writes the first character in the pattern -1, and each additional character, the number of immediately preceding characters that form a prefix of the pattern without starting at the beginning of the pattern.

We work as an example again, the pattern " ababcabab ":

Thus, for the pattern " ababcabab " the following prefix table. Note that the last line of the table is greater than the pattern by one field.

For comparison table above:

Implementation

The algorithm in the pseudo code.

Input are

  • A text t of length m, and
  • A pattern w of length n

For each occurrence of the pattern w in the text t is the initial position of the word to be output in the text.

We summarize patterns and text in an array that are numbered starting with zero. So, for example, w is the first character of the pattern, and t is the ninth character of the text. It is common practice to start numbering 0.

Prefix analysis

First, the prefix analysis is performed. They created the prefix table discussed above, only the last row as an array of N that contains the length of the immediately preceding prefix for each character of the pattern.

Input is

  • A pattern w of length n

Output is

  • The array N of length n 1.

I: = 0 / / variable i always points to the current position   j: = -1 / / in the pattern, variable j is the length of the found -                / / Which prefix to.     N [ i]: = j / / The first value is always -1     while i < n / / as long as the end of the pattern is not reached   |   |! While j > = 0 and w [j ] = w [i ] / / If a found   | | / / Prefix does not extend,   | | J: / Search = N [ j ] / after a shorter.   | |   | end   |   | / / At this point is j = -1 or w [i ] = w [j ]   |   | I: = i 1 / / Under the next character in the pattern   | J: = j 1 / / the value found ( at least 0)   | N [i ]: = j / / enter in the prefix table.   |   end search

The second phase is the search. Since the pattern is of course not really written under the text and then moved two variables i and j are used. Here, i is the current position in the text, and j is the current position in the pattern on. The significance is that always point j of the pattern is "in" position i of the text.

Input are

  • The pattern w of length n,
  • The array N of length n 1 in the first phase, and
  • A text t of length m.

Output are

  • All positions where w occurs in t.

I: = 0 / / variable i always points to the current position in the text.   j: = 0 / / variable j to the current position in the pattern.     while i < m / / end of the text that is not reached   |   |! While j > = 0 and t [ i] = move w [j ] / / pattern until   | | / / Text and patterns in place   | | J: = N [ j ] / / i, j coincide. there   | | / / Use array N.   | end   |   | I: = i 1 / / Add text and pattern may vary a   | J: = j 1 / / place to go.   |   | If j == n then / / / If the end of the pattern reaches   | | / / Report a hit. This began   | | Print i - n / a / n characters already earlier.   | |   | | J: = shift N [ j ] / / pattern.   | |   | End if   |   end Runtime Analysis

The running time is, as usual, given in the Landau notation.

Term of the prefix - analysis

The outer while loop is at most n times through because at the beginning i = 0 and i is incremented by 1 at each step.

In the inner while loop is j for each run are stored in a previously calculated, the smaller value of y in N [j] is set. This loop is thus at most be passing as often as j is increased. Since j is increased only in the outer loop, the inner loop is a maximum through n times.

However, the whole pattern must be run. Therefore, the duration of the prefix analysis is therefore in.

Runtime of the search

The outer while loop is at most m times through because at the beginning i = 0, and i is incremented by 1 at each step.

In analogy to the phase of prefix analysis, the inner while loop is a maximum through m times.

Since here the entire text is run through the runtime is in.

Together

Since prefix search and analysis are carried out sequentially, the running time of the entire algorithm. Overall, comparisons between characters of the pattern and the text are performed at most.

Thus the algorithm of Knuth, Morris and Pratt may be a better worst-case running time guarantee than the algorithm of Boyer and Moore.

However, Boyer -Moore search in certain circumstances carry, Knuth - Morris - Pratt always requires linearly many comparisons.

48034
de