Boyer–Moore string search algorithm

The Boyer -Moore algorithm is a string matching algorithm. The algorithm is used in order to find in a text T is a specific part of text ( pattern M ) and was developed in 1977 by Robert S. Boyer and J Strother Moore.

  • 3.1 bath -character heuristic

Algorithm

The pattern is left -justified at the top under the text and then compared from right to left character by character with the text. When a mismatch occurs, calculate two heuristics how far the pattern can be shifted to the right.

It happens that the two heuristics charge different shifts. The algorithm always chooses the maximum of two proposals to move the pattern to the right.

Examples

Bad -character heuristic

String: Hoola - Hoola girls like hooligans

Search for: Hooligan

Hoola - Hoola girls like hooligans Hooligan The last character does not agree with the last of the search pattern, so you can search pattern to the first " o" ( the search string, read from the rear) move:

Hoola - Hoola girls like hooligans Hooligan       Hooligan         Hooligan The "r" in the string is not mentioned at all in the pattern. This allows you to move around the entire length of the search string, there is absolutely excluded here that the pattern matcht in one place.

Hoola - Hoola girls like hooligans Hooligan       Hooligan         Hooligan                 Hooligan                         Hooligan Pattern was found.

The Boyer- Moore algorithm works most efficiently when it encounters a character that does not appear in the search pattern. The bad -character rule comes into play. This is most likely at a relatively small sample and a large alphabet, which makes it especially suitable for such a case. In this case, the algorithm works with an efficiency of compare.

Good suffix heuristics

String: reinesupersauersupesupersupe

Please search supersupe

Reinesupersauersupesupersupe supersupe Only the last 4 letters are the same ( " supe "). This suffix occurs in the pattern before the very beginning, so you can move the pattern to get there:

Reinesupersauersupesupersupe       supersupe Only the last letter "e" is the same. We can use the pattern until the next occurrence of "e" move in supersupe:

Reinesupersauersupesupersupe            supersupe Only the last letter " ersupe " match, which occur nowhere else in the pattern more. However, the suffix " supe " occurs both at the end of " ersupe " and at the beginning of the pattern.

Reinesupersauersupesupersupe                 supersupe "E" and "r" are not the same, we move up one position. This occurs several times in succession:

Reinesupersauersupesupersupe                  supersupe reinesupersauersupesupersupe                   supersupe reinesupersauersupesupersupe                    supersupe reinesupersauersupesupersupe                     supersupe Pattern was found. Together with the "bad -character heuristic " could last 3 iterations are skipped because we can move to the next "r" in the pattern.

Term

In the following we use the Landau notation to specify the asymptotic behavior of the term. Addiction the Boyer- Moore algorithm only he has a worst-case complexity of the first experience any of the pattern. If, however, he used to find all matches of the pattern, the worst-case complexity is. This complexity can, however, by an additional rule for the case that in the last step the sample was found to be reduced again. If one uses the algorithm but on a relatively small sample and a large alphabet, you get an average-case complexity of.

Bad -character heuristic

If one uses only the bad -character heuristic gives still an average-case complexity of needs but has a worst-case complexity of take into account. A worst case example, the text in conjunction with the pattern. Here is always the whole pattern with the text compared to the first point, a mismatch occurs. After such a mismatch can be postponed only one place to the right of the pattern (via Bad -character heuristic ).

Sample code in C

In practice, the algorithm applies to both rules and always uses the rule that you can jump the furthest the pattern for the "good rule" and for the "Bad " rule you created to start looking each a jump table (see "skip table " and " next table " in the code).

The following source code of the application is done " Good control " table ( next [ ] ) in the (unlikely) worst case, which is negligible at not too large patterns. The search for the suffix for the "Bad " rule table can be, for example, about the KMP algorithm to make, but the clarity is avoided due here. Thus, the following algorithm is.

If you immerse yourself output the number of required comparisons, these are at a large alphabet surprisingly few and the Boyer- Moore algorithm, for example, the Knuth - Morris -Pratt algorithm is preferable.

# include # include # include # include     char * boyerMoore (char * p / * pattern * /, char * s ) {        int plen = strlen ( p), slen = strlen ( s );        if ( plen > slen ) {          return NULL;      }        int skip [ UCHAR_MAX 1];      int i, j, k, * next;        / * Calc skip table ( "bad rule" ) * /      for (i = 0; i <= UCHAR_MAX; i ) {           skip [ i] = plen;      }        for (i = 0; i < plen; i ) {           skip [ (unsigned char) p [ i] ] = plen - i - 1;      }          / * Calc next table ( "good rule" ) * /      next = (int *) malloc ( ( plen 1) * sizeof ( int) );        for (j = 0; j < = plen; j ) {          for (i = plen - 1; i> = 1; i - ) {              for ( k = 1, k < = j, k ) {                  if ( i - k < 0) {                      break;                  }                  if ( p [ plen - k] = p [i - k]! ) {                      goto nexttry;                  }              }              goto matched; nexttry:             ;          } matched:          next [ j ] = plen - i;      }        plen -;      i = plen; / * Position of last letter p in s * /        while (i < slen ) {          j = 0; / * Matched letter count * /          while ( j < = plen ) {              if ( s [i - j] == p [ plen - j ]) {                  j ;              Else { }                  i = skip [ (unsigned char) s [i - j] ]> next [ j]? skip [ (unsigned char) s [i - j] ] - j: next [ j];                  goto newi;              }          }          free (next);          return s i - plen; newi:         ;      }      free (next);      return NULL; }   int main () {      char * s = " ababbbcabaabbbabababbbababbba ";      char * p = " ababbba ";      char * found = boyerMoore (p, s);        while ( found! = NULL) {          printf ("% .7 s found @ position % d \ n", found, found - s 1);          found = boyerMoore (p, found 1);      }      return 0; } literature

  • Robert S. Boyer, J. Strother Moore: A fast string searching algorithm. In: Communications of the ACM. Vol 20, No. 10, ISSN 0001-0782, pp. 762-772, online version (PDF; 1:19 MB), doi: 10.1145/359842.359859.
  • Robert S. Boyer, J. Strother Moore: A Lemma Driven Automatic Theorem Prover for Recursive Function Theory. In: 5th International Joint Conference on Artificial Intelligence, 1977, IJCAI -77. Proceedings of the Conference. Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, August 22-25 1977. Band 2 Morgan Kaufmann Publishers Inc., Los Altos CA 1977, ISBN 0-934613-48-6, pp. 511-519, online version (PDF, 240 KB).
141973
de