Smith–Waterman algorithm

The Smith -Waterman algorithm is an algorithm which calculates the optimal local alignment score ( similarity score) and the optimal local alignment between the two sequences. Sequence alignment is a sequence of editing operations (such as character substitution, insertion, deletion ), which transfers the sequence to the other. The individual operations have a score and the alignment score is defined as the sum of the edit operation scores. A local alignment is a sequence of edit operations to a subsequence of the first sequence to be converted into a partial sequence of another sequence, ie can be ignored in the optimization of a sequence of insert and delete operations at the beginning and the end, if this improves the alignment score. These operations are not ignored part of the local alignment.

The input sequences may be strings to different alphabets, e.g., in Bioinformatics the Smith -Waterman algorithm is applied to DNA sequences or amino acid sequences. A use case is, for example, the search for genes ( in newly - sequenced genomes ), the sequence of a known gene sequence is similar in another organism, the Edit Operations model approximates biological changes during evolution.

The algorithm uses the method of dynamic programming and its running time is quadratic. It was developed in 1981 by Temple Smith and Michael S. Waterman, is a version of the Needleman - Wunsch algorithm, which calculates the global alignment.

Local alignment problem

The Smith -Waterman algorithm solves the local alignment problem:

Motivation

The calculation of the optimal local alignment has an application other than the calculation of the optimal global alignment.

The consideration of global alignments is useful if it can be assumed that the relatively similar sequences to be compared, for example, sequences of the same length from a protein family.

However, if you for local matches ( = Similarities ) wants to search into sequences that can be very different in other areas, so the consideration of local alignment is more useful. For an optimal global alignment may cover these local matches in this case, since it has to maximize its score with respect to the entire sequence, for example, individual motifs in different protein sequences.

Demarcation to the Needleman - Wunsch algorithm

The Needleman - Wunsch algorithm calculates the global alignment of two sequences. In order to solve the local alignment problem which Needleman - Wunsch algorithm, two modifications are necessary:

The local alignment score is not in the lower right corner of the score matrix, but somewhere in the matrix. It is the entry with the largest value in the matrix.

The optimal local alignment is obtained by backtracking from the matrix entry with the greatest value to a 0 - entry in the matrix.

As with the calculation of the global alignment multiple optimal local alignment of two sequences may exist. So may be several maximum values ​​in the score matrix, and for each optimum value, several different backtraces are possible.

Matrix recurrences

Specification of the algorithm by matrix recurrences:

Input

  • ... Strings over an alphabet with
  • Gap - character ...

Recurrences

  • Indicates the maximum alignment score between a suffix of the first sign of and a suffix of the first signs of at

Efficiency

The runtime complexity of the Smith -Waterman algorithm is in and the memory requirement in. This can be easily deduced from the matrix recurrences.

Because the score matrix can be calculated rows or columns, you need only the current and the last row or column to store if you do not want to calculate the alignment only the score and. In the case where the memory requirement is O (n) and O (m).

In linear memory requirements can also calculate the local alignment using the programming method divide-and -conquer. See Hirschberg algorithm.

Example

Input

  • Sequence a = TCCG
  • Sequence b = ACGA

For the optimal local alignment is started at the number and wandered back diagonally, which ( in sequence) with CG supplies as a result of the alignment CG ( out of sequence). This seems trivial in this simple example, with longer sequences, however, the result is no longer read at a glance from the specification.

734977
de