Approximate string matching

The fuzzy search, even fuzzy search or fuzzy string search, called in computer science includes a class of string matching algorithms to search and find a specific string ( engl. string) in a longer string or a text should.

Typical of the "fuzzy " (English fuzzy) search method is that not the exact string as the search criterion must be based, but also similar strings are to be found. A well-known measure to calculate this similarity is the so-called Levenshtein distance; It indicates how many operations - such as replacing, moving characters in words - are needed to derive a string of the other of the fewer operations are required, the more similar are both strings. Another possibility is based on so-called n-grams, which is calculated by means of certain probability, which could follow the letter or character string in a different combination.

Another approach is not based directly on the graphical representation of a word, but it looks for strings that sound the same: the phonetic search. A well-known method in this context, the words indexed by their sound, is for the English the Soundex algorithm.

Both approaches make it possible to find searched string even if, for example, the exact spelling of a name or expression is not known, found inflected forms of a word or error-tolerant search results should be accepted. Use the fuzzy search, for example in databases, search engines and computational linguistic applications.

Practical Example

When searching in databases error-tolerant search tools can compensate for typos and spelling errors using string matching algorithms. Similarities between the search term entered and the entries in the database can also be determined without stored word lists. Results can be output sorted by relevance and proximity to the search term. The search for the term " Levenstein " for example, would find entries for " Levenshtein ". Become a synonym lists stored, find the fuzzy search, for example, to the term " television ", terms such as " TV ".

The use of complex string matching methods such as the Levenshtein algorithm, is usually accompanied with an enormous amount of calculation and results in large amounts of data to an often high temporal delay.

357495
de