Rabin–Karp algorithm

The Rabin - Karp algorithm is a search algorithm for texts, which was developed by Michael O. Rabin and Richard M. Karp. The algorithm searches for a pattern, such as a character string within another string using hash values ​​. In the single pattern recognition, the algorithm is not widespread, but it is of considerable theoretical importance and very effective to repeatedly look for a pattern in a text. N For a text of length and a pattern of length m is its average and best running time of O ( n ), the (very atypical ) running time in the worst case ( worst-case running time), however, is O ( nm). This is one of the reasons why the algorithm is not often used. Nevertheless, he has the unique advantage of any number of strings in a single run, on average, in O (n) or to find less.

One of the simplest practical applications of Rabin - Karp is the detection of plagiarism: Take the example of a student who writes a paper on Moby Dick. A professor may from different sources about Moby Dick automates a list of all the sets to create. Rabin - Karp could now quickly search the given document and find every occurrence of one of the sets of the source material. In order to avoid the simple outbraking the system with small changes to the original source, details, such as punctuation, previously removed and ignores it. Due to the number of the desired strings single string search algorithms would be impractical in this case.

Other algorithms to search for partial strings

The basic problem, which is referred to the algorithm, finding a fixed string of length m, pattern is called, in a string of length n An example would be to find the word "sun" in the sentence " Hello Sunshine in the Valley of Tears. " A very simple algorithm (see the following listing) for this case would simply search for the string in all possible positions:

1 function Naive search (string s [ 1. N], string sub [ 1. M] )   2 for i from 1 to n   3 for j from 1 to m   4 if s [i j -1] ≠ sub [ j]   5 jump to the next iteration of the outer loop   6 return i   7 not return found This algorithm works well in many practical cases, however, suggests the absence of a few examples. As a typical example, the following case may be mentioned: Search a pattern consisting of 10,000 "a" s followed by a " b" in a string s consists of 10 million "a". In this case, the algorithm would achieve its worst case Θ (Mn).

The Knuth - Morris -Pratt algorithm reduces this to Θ ( n ) by each individual letter by comparing pre-calculations only once. The Boyer- Moore algorithm jumps not only by one letter forward, but to as many letters as possible to let run the search is successful. It decreases as the actual number of iterations of the outer loop, so that the number of characters being examined, is small. At best, they can then be n / m. Karp - Rabin algorithm, however, focuses attention on the acceleration of the line 3-6 of the above example.

Accelerating the search by means of hash values

Instead of reducing the number of comparisons by more sophisticated procedures, Rabin - Karp algorithm speeds up the comparison between the pattern and the fragmental character strings by using a hash function. Rabin - Karp exploits the idea that the same pieces of text have equal hash values ​​. So the idea is to calculate the hash value of the desired character string and then search for a substring with the same hash value.

Here are two points to consider. First, the same hash values ​​do not come necessarily of identical strings. If they match, the strings themselves must be so even compared to what may take some time to complete on large strings. Fortunately, ensures a good hash function for different meaningful input also largely different hash values ​​, so that the average seek time by this circumstance did not significantly increase.

The algorithm would be as follows:

1 function RabinKarp (string s [ 1. N], string sub [ 1. M] )   2 H sub: = hash ( sub [ 1 m ]. )   3 hs: = hash (s [ 1 m ]. )   4 for i from 1 to n - m 1   5 if hs = H sub   6 if s [ i. I m -1] = sub   7 return i   8 hs: = hash (s [ i 1 .. i m] )   9 return not found Lines 2, 3 and 6 each run in Ω (m). Lines 2 and 3 are executed only once. Line 6 is executed only if the hash value matches. This will occur only a few times usually. Line 5 is executed n times, but requires only a constant time.

Now comes the second point: The line 8 If just the hash value for each sub- text s [i 1 .. i m] would be calculated, then the time required for this would be Ω (m). As this is necessary in each loop, the algorithm Ω (mn) would need, just like the above simplest search algorithm Naive search.

An optimization based on the fact that the hs variable already contains the hash value of s [ i. I m -1]. When you select a suitable hash function (rolling hash) can be calculated from this in constant time the next hash value. In the simplest case, the individual character values ​​of the sub- text to be added. Then:

S [i 1 .. i m] = s [ i, i m -1. ] - s [i ] s [i m ]

In the worst case, or when a bad hash function which produces a constant, for example, line 6 is to be carried out n times in each iteration of the loop. As this Ω (m) takes time, the entire algorithm is still the worst-case run-time is required Ω (Mn).

Used hash functions

The key to Rabin - Karp embodiment is the efficient calculation of the hash values ​​of the successive sub- text of the text. A popular and effective rolling hash function treats each part of text as a number to a base. This base is usually a large prime number. For example, if a part of the text " Hi" and the base is 101, the hash value would be 104 105 × 1011 × 1010 = 10609 (ASCII "H" is 104 and "i" is 105).

For example, if we " abracadabra " a text and are looking for a pattern of length 3, we can use the hash value of "bra " from the hash value of " abr " ( the previous text part ) calculated by number the that was added for the first "a" of " abr " pull off, eg 97 × 1012 (97 is ASCII for ' a' and 101 is the base we use ), multiplied by the base and add for the last of a "bra ", eg 97 × 1010 = 97, added. If the strings are long, this algorithm will achieve great profit in comparison to many other hashing schemes.

Theoretically, there are other algorithms suitable calculations, for example, multiplying the values ​​of all the ASCII characters, so that the displacement of the string would only require the parts by the first letter and the last letter of the multiplying. The limit of this would be the maximum size of an integer and the condition modulo calculations use the hashes to keep small.

Rabin - Karp and multiple pattern search

Rabin - Karp is Knuth - Morris - Pratt, Boyer -Moore, and inferior to other faster single pattern text search algorithms due to its slow worst-case behavior in the single - pattern search. Nevertheless, Rabin - Karp a good algorithm for multiple search pattern.

If we search each occurrence of a large number of fixed pattern lengths in a text, such as k, we can use a simple variant of the Rabin - Karp that uses a hash table or create any other suitable data structure. This checks whether a hash value of a given text string by which we seek to fit a collection of hash values ​​of patterns.

Function RabinKarpSet (string s [ 1. n], set of string subs, m) {       set hsubs: = emptyset       for each sub in subs           insert hash ( sub [ 1. m] ) into hsubs       hs: = hash (s [ 1 m ]. )       for i from 1 to n           if hs ∈ hsubs               if s [ i. i m -1] = a substring with hash hs                       return i           hs: = hash (s [ i 1 .. i m] )       return not found   } We assume that all strings have a fixed length m. This assumption, however, can be eliminated by simply comparing the current hash value with the hash values ​​of all strings. At the same time, we perform a quick lookup in our data structure by, and then check every hit we find all strings with this hash value.

Other algorithms can search for a single pattern in O (n), so they are to k patterns in O ( nk ) are looking for. The above mentioned Rabin - Karp variant is still in O (n) look up whether a substring hash equals a pattern hash in the best and in the average case, since it allows a hash table in O (1) or not. We can proceed further by O ( mnlogk ) in the worst case, if m is the length of the longest of the k- texts by the hash values ​​are stored in an AVL tree, rather than in a hash table.

Credentials

  • Karp and Rabin 's original paper: Karp, Richard M.; Rabin, Michael O. ( March 1987). "Efficient randomized pattern -matching algorithms". IBM Journal of Research and Development 31 ( 2), 249-260.
  • Search algorithm
467009
de