Neighbor joining

The neighbor-joining algorithm is a mathematical method to compare data sets and hierarchical bifurcal ( bifurcated ) to order. This procedure was introduced in 1987 by Saitou and Nei 1988 and further developed by Studier and Keppler and simplified.

Application

In bioinformatics, the neighbor-joining method refers to a phänetische bottom-up clustering method, which is used to generate phylogenetic trees. With this, the probability of parentage or kinship relationship is to be calculated in a family tree -like representation based on varying characteristics in the data matrix.

Typically these trees from DNA or protein sequence data or classical morphological data sets are created. The algorithm requires knowledge of the distance between two pairs of taxa (thus, for example, species or sequences) in a tree.

Algorithm

Neighbor -joining is usually based on the " minimum evolution " criterion for phylogenetic trees: Starting from an initially star-shaped "tree", in which all taxa are connected to a " center ", in pairs, the DNA or protein sequences with the lowest genetic distance selected and combined to form a branch of the tree. The genetic distances of the sequences are re-calculated and re- assembled the most closely related to a branch with two taxa. This continues until all taxa have been added to the tree and the star structure of the tree was completely dissolved. In contrast to the neighbor-joining UPGMA considered that the rate of evolution is not constant: If a taxon from all other taxa is far away, so this is not due to a remote degree of relationship, but for accelerated evolution.

Example

The following is a typical table of distances between taxa is specified, the values ​​are purely hypothetical but realistic:

Since the table is triangular symmetrical, the lower half need not necessarily be stored. The values ​​in this table are as named.

Step 1: It must be the average distances of each taxon are calculated at each other. This is done using the following formula for the net divergence ri:

Where N is the number of taxa.

Interpretation: The rose has in our example, the largest net divergence, has therefore experienced a greater rate of evolution compared with the other taxa.

Step 2: We compute an intermediate matrix M.

For example, as between human and mouse:

Step 3: In this newly computed distance matrix M was to be the smallest value, ie, the smallest distance between two taxa, searched, and found the two taxa are combined to form a new subtree u = (i, j). In this example, so the two ways the human and mouse, or rose and tulip arise together into a subtree. We opt first for humans and mice.

The edge length of the node of the branch is calculated as follows:

So man is equal to MeMa

Step 4: In the original distance matrix, the new entry u = MeMa is added:

To calculate the distances of the new entry u = (i, j) = (1,2) = ( human, mouse ) = MeMa to the remaining taxa, the following formula is used:

Where the entries i and j is u joined to form a new entry, and the distance to the entry k is calculated. The distance between Rose and the new subtree is:

The "old", merged entries are deleted from the distance matrices.

Thereafter, again and calculated re-assembled and started again from scratch. This is repeated until only two taxa are left, which are then connected to the end.

The result of our example can be represented as additive tree:

People Rose     \ /      \ 2 3 /       \ 9 /        -----------------       / \      / 1 1 \     / \   mouse tulip or the output of the Phylip program

---- Mouse  !  ! ------------- Rose   ----------------------------------------- 1 2  ! ---- Tulip  !   -------- Man classification

Neighbor -joining part of the explicit methods. This means that for the calculation of genetic distances varying evolution models, ie, different probabilities of point mutations may be adopted. The correctness of this family trees based on the assumption that the variation of the observed characteristics shall contain no unknown intermediate steps. It is assumed therefore simplifies that " evolution is no detours " ( "minimum evolution" ).

The neighbor-joining algorithm calculates the family tree gradually and therefore is not necessarily the optimal tree topology with the smallest branch length. This is due to its design principle, as Greedy algorithm. In contrast to other algorithms calculates this not describe all possible trees and chooses the end of the optimum, but rejects already during the process some computation paths. Although the algorithm is suboptimal, it has been extensively tested and usually finds a tree that comes relatively close to the optimum.

Benefits

The biggest advantage of this method is its speed. You can apply it to huge amounts of data, even where other methods of phylogenetic analysis as maximum parsimony and maximum likelihood are no longer feasible. Unlike the UPGMA algorithm ( Unweighted Pair Group Method with Arithmetic mean) for phylogenetic tree reconstruction neighbor-joining does not assume that the evolution of lineages at the same rate (see also Molecular clock ) takes place and therefore consequently generates an unbalanced tree.

597043
de