Sequence alignment

An alignment (English for: balance, arrangement, alignment), fachsprachlich often called Alignierung or Alinierung, is used to compare two or more strings (technical term for string, sequence), most often used in bioinformatics and molecular phylogeny to the functional or evolutionary relationship ( homology) of nucleotide sequences or amino acid sequences to investigate. Sequence alignment is a branch of pattern matching.

  • 2.1 Global Alignment 2.1.1 example
  • 6.1 Online Interface

The principle

There are automated alignment methods, but you can also alinieren smaller data sets manually. The manual method allows for greater care and exclusion of highly variable and thus not alignierbaren positions that would interfere with subsequent analyzes. When alignment is assigned to the elements of a subject to those of the string / strings of the other so that the sequence is maintained, and each element is associated with another element or a gap (space, gap ) in each string. A mismatch in the alignment corresponds to a mutation. The gaps, however, suggest a deletion or an insertion ( Indel ). The associated ( alignierten ) elements should be identical or as similar as possible, because to point many of the same or similar elements in the same order on an evolutionary or functional relationship. The similarity of the elements is usually predetermined and depends on the characteristics of the data or used scoring matrices from. Thus a useful alignment is possible, and since the sequences are often of different lengths, gaps may be inserted into the sequences.

The alignment of two sequences is called a pairwise alignment that. Several as multiple alignment Pairwise alignment, we further differentiate between global, local and semiglobalem alignment.

Cost function for automated alignment

The task of a alignment algorithm to find an alignment that is optimal under a cost function (alignment score). The cost function in bioinformatics is usually in a so-called " similarity matrix ", or " exchange matrix " ( engl: Similarity matrix or mutation data matrix ) is predetermined. These indicate for each pair to be compared sequence elements, how likely it is that this pairing is caused by evolution. Identical elements are highly valued, "similar" elements less high, very different elements put the score down. The cost function has the consequence that the calculated best alignment is one which would be expected if the two sequences are homologous. A problem in this case represent gaps; for the occurrence of insertions or deletions in biological sequences there has been no precise mathematical models. We therefore used empirically motivated, so called affine gap scores peeling a constant contribution to the introduction of gaps from each result and a further contribution, which grows linearly with the length of the gap.

Example

- AAACGG AAAACCG The above- illustrated alignment of two short sequences of DNA is in the first position ( A ) so that a gap can be inserted to compensate for differences in length. The gap has been inserted at the beginning of the upper sequence and not in the center, because it is likely from the perspective of biology, that a sequence mutated at the ends than in the middle.

At the penultimate position C and G were aligns as well mutations in the DNA are possible, in which instead of a C, a G is inserted at random, or vice versa. It would also have been possible to alignieren G and C respectively to a gap in the other sequence. This decision depends on the cost function used.

The protein sequence alignment of the amino acid sequences correspond to strings. The cost functions for the similarity of the individual amino acids to one another are slightly more complex than in the case of DNA.

Pairwise alignment

An alignment of two symbol sequences, a sequence of editing operations, which describes a transform according. Here, a symbol with another symbol replaced ( mismatch), a symbol is deleted ( deletion ) or an icon can be inserted (insert ). Typically, the symbols involved in an edit operation are written over each other, where the symbol - denotes a formed by an inserted or deleted in the other sequence symbol gap.

Each edit operation can be assigned by an evaluation function value. The sum of the values ​​of all the edit operations is referred to as an alignment alignment score.

An optimal alignment is an alignment whose score is optimal under a marking scheme.

This distinction is sometimes made between the terms score and edit distance. The term score or distance is then used in a scheme that evaluates sequence similarity or sequence differences by positive values ​​. That according to this distinction is an optimal score or an optimum distance maximum or minimum.

An example of an evaluation scheme are the unit cost (unit costs). The evaluation function is defined as:

Global alignment

In a global alignment between two sequences, all symbols are considered. Global alignments are mainly used when the to be examined sequences are of similar length and strong sequence homologies are expected.

The calculation of the optimal alignment score and the optimal alignment is an optimization problem that is efficient (running time ) can be solved in the pairwise alignment using the method of dynamic programming ( Needleman - Wunsch algorithm ).

Example

  • Given: Two sequences S and T, unit costs as an evaluation scheme
  • Assume S and T have a common ancestor ( homologous ).
  • Question: What positions in S and T are homologous?

For S = T = GAC and GC are possible solutions:

GAC GC 0 1 1 = 2 2 GAC - --- GC 1 1 1 1 1 = 5 3 GAC G -C 0 0 1 = 1 Free -shift alignment

The free- shift alignment (also known as semi- global alignment or End Gap Free Alignment designated ) of two sequences is a variant of the global alignments, in which a sequence of insertions or deletions at the beginning or at the end of Aligments in the calculation of scores not be considered. The calculation of the optimal free- shift alignment may be more meaningful than the computation of the optimal global alignment in certain applications, for example, if a sequence is significantly longer than the other and a protruding suffix or prefix has no relevance.

Local alignment

A local alignment of two sequences is a global alignment of a subsequence ( substring ) of and a partial sequence of. That to calculate the optimal local alignment of two sequences, the two partial sequences must be found, their optimal alignment score is maximum. Application examples for the calculation of local alignments are searching for the same sequence motifs or domains in proteins. The classic algorithm for calculation of optimal local alignment, the Smith -Waterman algorithm.

Multiple sequence alignment

While the optimal alignment of two sequences by using a computer rather quickly (ie in polynomial time ) can be calculated exactly ( running time O n, and m are the lengths of the sequences ), this is the multiple sequence alignment not ( engl. multiple sequence alignment) longer possible, since the term of the algorithm for the precise calculation of multiple alignment with the number of sequences increases exponentially ( where k is the number of sequences, and the longest of the n sequences to be compared is ).

However, to calculate a biologically or evolutionarily meaningful alignment from which can actually be derived similarities and differences in sequence, structure and function, it takes many long sequences. Therefore, heuristics are used, for example, so-called progressive strategies (also called Hierarchical Methods ). This process, all optimal pairwise alignments of sequences to be analyzed are calculated and (for example by using a Neighbour - Joining algorithm) derived by cluster analysis, a phylogenetic tree ( the so-called guide tree). Along this tree will eventually gradually ( progressively, according to the principle of a greedy algorithm ) provides a multiple alignment, which by this heuristic approach, the optimal solution is not guaranteed.

Alignment algorithms

  • Needleman - Wunsch algorithm ( global alignment )
  • Hirschberg algorithm ( global alignment in linear space)
  • Smith -Waterman algorithm ( local alignment )
  • Gotoh algorithm ( global alignment with affine Gapkosten )

Heuristic algorithms for pairwise alignment:

Heuristic algorithms for multiple alignment:

  • Popular fragment -based method DIALIGN -TX
  • Hierarchical methods (for example, Feng - Doolittle )

Related Topics

  • Methods, such as Jukes & Cantor, Kimura -2- parameter Felsenstein 81, HKY 85 ( Hasegawa, Kushino & Yano, 1985) or general time reversable ( GTR or REV) correct distance data from sequence alignments.

Software

Frequently used programs for general sequence alignments are ClustalW, MAFFT and TCoffee and BLAST for the database search.

A comprehensive list of available software categorized by algorithm and type of alignments can be found here: en: sequence alignment software.

Online interface

The STRAP program integrates almost all freely available programs for the calculation of sequence alignments. These are automatically installed and are then invoked with a comfortable graphical user interface. Thus, the user is spared the particular installation and learning the command line syntax of each program. Since the calculation of large alignments can take a long time to complete, results of lengthy calculations are stored in the cache. If 3D structures are available for at least two of the proteins, the combined use of sequence alignment and 3D structure overlay is recommended.

723241
de