Levenshtein distance

The Levenshtein distance (also edit distance ) between two strings is the minimum number of insert, delete and replace to operations to convert the first string into the second. Named is the distance after the Russian scientist Vladimir Lewenstein, who introduced it in 1965. Mathematically, the Levenshtein distance is a metric on the space of symbol sequences.

For example, the Levenshtein distance between "animal" to "gate" 2 is a possible sequence of two operations:

In practice, the Levenshtein distance to determine the similarity of strings is used for example to check spelling or duplicate detection.

Algorithm

In the Levenshtein article of 1965, the Levenshtein distance is defined, but it is not specified algorithm for computing the distance. In the literature, an algorithm for calculating the Levenshtein distance, uses the method of dynamic programming, referred to as Levenshtein algorithm.

The algorithm is specified by the following matrix recurrences designate where and are the two input strings:

After the matrix is calculated, is the Levenshtein distance, ie the minimum number of edit operations, the matrix cell. In each cell is the Levenshtein distance of prefixes and. In a further deletion or insertion is now just another sign is consumed by or. That the result is stored in or. So the cost of a deletion or insertion in the cell can by recurrent or calculate. The case of the substitution is analogous to explain.

If not only the Levenshtein distance is to be calculated, but also the sequence of operations that led to the conclusion that must be performed on the calculated matrix a " backtrace ". That beginning of the optimization decisions are recursively traced. In pseudo- code is the backtrace algorithm:

String backtrace (i, j):    if i > 0 && D [i- 1, j ] 1 == D [i, j]      return backtrace (i-1, j ) " Del "    if j > 0 && D [i, j-1 ] 1 == D [i, j]      return backtrace (i, j-1 ) " ins"    if i > 0 && j > 0 && D [ i-1, j- 1] 1 == D [i, j]      return backtrace (i-1, j-1 ) " Rep "    if i > 0 && j > 0 && D [ i-1, j-1 ] == D [i, j]      return backtrace (i-1, j-1 ) " Eq"    return " " To simplify the pseudo-code, is only one possible backtrace is calculated.

Example

The procedure can now be easily performed in a table. Here is an example with the two strings "animal" and "gate".

ε t o r ε 0 1 2 3 T 1 0 1 2 i 2 1 1 2 e 3 2 2 2 r 4 3 3 2 The distance between the two strings is 2 ε stands for the empty string " ".

Complexity

The running time of the algorithm is in O, since all cells of the matrix are calculated and the computational cost per cell is constant.

The memory requirement is because the matrix rows and columns has. However, if no backtracing is done, then only the last row or column is needed to calculate the next row or column to row- or column-wise calculation of the matrix. The memory requirement is in the case, in.

The Hirschberg algorithm allows calculation of the Levenshtein distance and a " backtrace " in quadratic time and linear memory consumption.

Lists

For the Levenshtein distance some upper and lower bounds are valid:

  • It is at least the difference between the lengths of the two strings
  • They exceed the length of the longer string
  • It is then and only then 0 if the two strings are identical (as defined metric )
  • Is not greater than the Hamming distance, plus the difference in length of the strings

Demarcation

The Levenshtein distance can be viewed as an extension of Hamming distance, which is limited to substitutions and therefore only strings can measure the same length. A phonetic search can use the Levenshtein distance to allow for error. For comparative words can be transferred, for example, with the Cologne phonetics or the Soundex algorithm in a phonetic symbol - string in front of a distance calculation.

The DP algorithm for calculating the Levenshtein distance is related to the Needleman - Wunsch algorithm for calculating the sequence alignments of two strings. The Needleman - Wunsch algorithm is in and can be arbitrary cost functions for insert and delete operations to the length. The Levenshtein algorithm is an insert or delete operation of a maximum of one character. Variants of the Needleman-Wunsch algorithm limit the choice of the cost function, so that their running time is in.

The Smith -Waterman algorithm to optimize the cost of the edit operations by the same scheme as the DP Needleman-Wunsch or the Levenshtein algorithm, however, insert and delete operations at the beginning or end of a sequence of rated and ignored in the backtrace. That the Smith -Waterman algorithm calculates the local "sequence alignment". Its running time is.

The Levenshtein distance can be regarded as a special form of Dynamic Time Warping distance ( DTW).

Variants

Weighted Levenshtein distance

The cost of each insert, delete and replace operations can also be different or even dependent weighted by the respective participating characters. The so generalized method is referred to as weighted Levenshtein distance Weighted Levenshtein Distance, or shortly WLD algorithm.

An example of weighted costs of spacer operations can be minimized with the TCD algorithm, the cost of the writing distance.

Damerau - Levenshtein distance

Damerau - Levenshtein Levenshtein extends the functionality of the ability to identify two transposed characters, such as " Raisch " ↔ " Rasich ". In order to convert a string into the other two operations would be necessary with Levenshtein. Damerau - Levenshtein requires only one.

The following matrix recurrences specify the algorithm variant.

Wherein the cost of an interchange of two numerals.

254631
de