Needleman–Wunsch algorithm

The Needleman - Wunsch algorithm calculates the optimal global Similarity score or with the help of backtracking one or more optimal global alignment between two sequences. It is an optimization algorithm from bioinformatics. There it is used for comparing two nucleotide or amino acid sequences. The Similarity score is a measure of the similarity between two sequences; the higher the score, the more similar the sequences are under a scoring model. The optimization of the algorithm consists in the maximization of Similarity scores. In this case, an alignment of a sequence of editing operations, in order to transfer the first sequence to the second. For two sequences, there are many different alignments each - an alignment is optimal if it has maximum Similarity score. The algorithm uses the method of dynamic programming and allows arbitrary scoring models (ie, the use of general gap costs) for insert and delete operations.

The procedure

The Needleman-Wunsch -DP algorithm calculates in a matrix for all pairs of the possible prefixes of the sequences a and b are the optimal global Similarity alignment score. The element of the matrix containing the optimal score for the optimal global alignment of the partial sequence of a and of b. The spelling corresponding to the i th prefix of A. When m is the sequence length of a and n is the length of sequence b, then the score matrix comprises M rows and columns. The alignment score of the complete sequences is after the execution of the algorithm contained within.

The score matrix is ​​calculated recurrent. For the element (i, j ) of the matrix M three cases is maximized. The extension of the previous alignment of the sequences and to a match or mis-match corresponds to the addition of the previously calculated score and cost for the replacement of through. The extension of an already computed alignments to a final deletion corresponds to the addition of the general gap costs the length of the deletion to the score of the optimal alignment of the sequences and the length of the deletion indicated. Analogously to the deletion corresponds to the extension of an optimal alignment of the sequences and to a final insertion of the addition of the scores of this alignment and the cost of the gap length of the insert. The maximum value of these three alternatives is stored in the element.

The gap cost function may be general. That it is not assumed that uniform cost or affine gap costs are used.

The previously described informally matrix recurrences are written formally in the next section. In order not to violate the dependencies of the recurrence, the score matrix to be row- or column-wise calculated.

The alignment can be output by backtracking. As a possible backtracking method you can store in an auxiliary matrix for each element during the calculation of the score matrix, if the maximum value was calculated by an insertion, deletion or by a match. From the element of the matrix can after the full calculation of the matrix the path to the calculation of the maximum traced, while the optimal alignment can be constructed. For two sequences exist in a matrix or the lead several optimal paths in a score matrix for the optimal score, as well as two sequences can exist multiple alignments, which have the optimal score.

Matrix recurrences

Specification of the algorithm by matrix recurrences:

  • A, b = strings over an alphabet
  • M = length ( a)
  • N = length ( b )
  • - Is the maximum score Similarity between a prefix of a, terminating in I and a prefix of B, which ends in j
  • , Similarity Score function
  • F, general gap penalty function

Example

The steps of the algorithm are ' presented by a small example here.

As an evaluation function, the following function is used:

A = b = ACGTC and AGTC

For a better understanding, one can imagine that the rows are labeled with the letters of the sequence a and the column with the letters of sequence b. Mathematically, this makes no sense within the matrix, so this is only for contemplation.

0 step: Initialization

The entries of the matrix for the first row and the first column is filled as described above. The rating for the entry is calculated from the overlying assessment and score on the site. So the other values ​​are now calculated analogously.

Step 1: Calculation of:

→ The maximum occurs in the first case, that is, A is aligns with A.

→ Increase j by 1, i remains the same

Step 2: Calculation of:

→ The maximum arises from the third case, since the maximum of the calculation, namely 0 arises, ie a gap ( -) were aligns with C.

The filled matrix looks after the complete execution of the above Steps as follows:

The assessment of this alignment is 3

The corresponding alignment looks like this:

Calculates there is a traceback.

Choice of evaluation function

The choice of the evaluation function has a major impact on the results obtained by the Needleman - Wunsch algorithm. A simple evaluation function selected as above in no way reflects the biological background of alignments and is therefore less suitable for practical purposes. Read the most common at the moment scoring functions the evaluation of a so-called scoring matrix of. For proteins, you can use the PAM or Blosum matrices. These matrices with the size of 20 * 20 (or 24 * 24, when some special cases still be observed) included Reviews (so-called log -odds ) that an amino acid is substituted for another. The log -odds based on probabilities. These scoring matrices are also calculated from sequence alignments.

The evaluation function used above is used to determine the similarity between two sequences. In order to determine the distance you can simply change the evaluation function, ie inequality can return a positive value, which can be interpreted as a punishment and equality at 0 or a negative value. However, it should be noted that the recursion at a distance-based weighting function is not the maximum, the minimum, but must be determined.

An example of a distance- based scoring function:

Complexity

The term of the Needleman-Wunsch algorithm is. It must be calculated elements of the matrix, and for each element must be maximized over the elements. This can be easily deduced from the matrix recurrences. The memory requirement is.

Demarcation

In the Needleman-Wunsch 1970 Paper no matrix recurrences are included, the algorithm described there informal; the matrix recurrences arising from this description.

In a paper by Waterman, Smith and Beyer 1976 the Needleman - Wunsch algorithm is specified explicitly in matrix recurrences. Therefore, some authors also refer to the Needleman - Wunsch algorithm as Waterman -Smith - Beyer algorithm.

In some literature the standard DP algorithm for computing the global alignment is incorrectly referred to under uniform gap costs as Needleman - Wunsch algorithm. However, this is a specialization of the Needleman-Wunsch algorithm, since the use of uniform gap costs for the edit - operations, only the three neighboring cells must be considered.

Because of the cubic term of the Needleman - Wunsch algorithm is not used in practice. If restricted to uniform cost can be calculated with the above-described optimization, the optimal alignment. With the Gotoh algorithm, the optimal alignment can be calculated with affine gap costs in quadratic running time. The Hirschberg algorithm computes a global alignment with uniform or affine gap costs on a linear space in time and the Smith -Waterman algorithm calculates the optimal local alignment of two sequences.

Memory Assessment

Because of the quadratic memory requirements of the algorithm for the Alignieren longer sequences is rather inappropriate. For example, if are stored in the matrix of integers, each of which is 4 bytes in size, then the calculation of alignment scores of a sequence of 10,000 characters requires a matrix with entries. So be 381 megabytes occupied by the matrix bytes. Alignment of whole genomes can thus not perform efficiently. For example, a bacterial genome average about 1-4 million base pairs and the human genome of approximately 3,000,000,000 base pairs.

Apart from this, a global alignment of whole genomes is not necessarily a high biological insights.

Variants

Uniform Gap Costs

When using uniform gap costs a specialization of the matrix recurrences of the Needleman - Wunsch algorithm is possible, thus increasing the duration of on reduced. Sellers have these recurrences explicitly specified in a paper of 1974.

A uniform gap cost function is defined by, ie every part of an Gaps costs the same. Using uniform gap costs is in the contemplation of an optimal alignment of prefixes and ending in an insertion gap of length, directly visible, that the optimal alignment of prefixes and ends in an insertional Gap. Therefore, it suffices to count for the calculation of the cost of an optimal insertion Gap ( any length ) in. The calculation of the deletion -gap cost is the same.

This implies the following specialized recurrences:

Free -shift alignment

The calculation of the optimal free- shift alignment (or End Gap Free Alignment) is a variant of the Needleman - Wunsch algorithm in which a sequence of insertions or deletions at the beginning or end of the alignment are not included in the score calculation. For this, the algorithm for computing the global alignment is changed so that the first row or first column and the last column and last row are initialized with.

596640
de