Bitap algorithm

The Baeza -Yates - Gonnet algorithm or shift -or algorithm, which is also known under the name Shift -and solves by simulating non-deterministic machines, the string matching problem. Among other things, a modification of this algorithm is used in the Unix grep tool.

Since the implementation of bit operations can be returned, then the algorithm is itself already very efficient embodiment of the forth. Combine that with the underlying system ( in the preprocessing loop once over the pattern, while searching again loop through the text ) results in an extremely efficient algorithm.

Basis

Basis of the algorithm is a set of vectors with the following definition:

This clearly means that just then, when, after the processing of characters of the text match the last character with the first character of the search pattern.

A hit for a search pattern with length is found, therefore, necessary.

Furthermore, the characteristic vectors for all potentially occurring in the text characters are needed:

Example:

Search pattern length

Characteristic vectors:

Sequence ( exact matching)

To simplify the procedure, a special bit operation and the bit shift vector is first introduced:

The algorithm for exact matches can now be reduced to a few simple steps:

Steps ( 2) and (3) result in a closer look exactly the calculation rule for this: By shifting the sign is applied to point to the location (equivalent in combination with the condition ) from the "old". The characteristic vector of the current text label shall contain at the point just then, if patterns and text match here. By both conditions are linked.

Example ( exact matching)

Pattern:

Text:

Since a hit is ( position - pattern length correction for first character) before.

Extension ( approximative matching)

The algorithm can, with slight modifications to perform a fuzzy search. For this purpose, the vector is divided:

Attention: When the faulty vectors, the above interpretation as " to j characters agree with the last i the first i of the pattern match" difficult and not necessarily obvious.

The formula for calculating remains unchanged. For error vectors is differentiated according to the originating action:

To insert a character in the search pattern

Explanation: The first part of the expression describes the case that errors are already present, but match the current character of the text and pattern. The second part describes the error case: So far ( in ) were only fault, the current character can therefore be included in the pattern.

Interpretation: if, after characters in the input of the last character to match at least characters with the search pattern and the rest can be accommodated by inserting the missing characters to match.

Delete a character from the search pattern

Explanation: The first part of the expression describes the case that errors are already present, but match the current character of the text and pattern. The second part describes the error: no If you look at the first character of the text characters, but only the first ( in the vector the position above), so, the pattern correspond to errors. The character of the pattern is then simply be deleted.

Replace a character in the pattern

Explanation: The first part of the expression describes the case that errors are already present, but match the current character of the text and pattern. The second part describes the error case: After sign the last characters were consistent. So we now replace the characters in the pattern by the character of the text, also agree on the character the last characters with the first character of the "new" pattern match.

The variants may be linked by any.

Pictures of Bitap algorithm

97224
de