Protein structure prediction

The protein structure prediction includes all methods, mathematically to determine the amino acid sequence of a protein three-dimensional structure of the folded molecule. She is one of the important goals of bioinformatics and theoretical chemistry. It results from the practical difficulty of measuring the atomic structure of a protein in nature with physical methods. In particular, for the exact atomic positions within the tertiary structure is a great need; they form the basis for drug design, and other methods of biotechnology.

The previously developed methods of protein structure prediction are based on the knowledge of the primary structure to postulate that the secondary structure and / or tertiary structure. Another detail problem is to determine the quaternary structure of this tertiary structure data. Implementations of the algorithms developed here are mostly in the source code or as a WWW server. Because of the enormous importance of a definitive solution to the problem has become an annual competition with CASP since 1994 established for the comparison of the best solution methods.

7.1 server / software to predict

Motivation

Determining the natural protein structure by physical methods, although for many, but not all, proteins associated far as possible and with high costs and expenditure of time. By 2012, could by NMR and X-ray structure analysis, the structures of about 50,000 different proteins are determined (this number is reduced to 30,000 when viewed proteins with more than 10 percent sequence difference ). This compares with an estimated more than 30 million protein sequences. For a reliable purely computational method for the determination of protein structure from the amino acid sequence there is therefore a great need. The foreseeable acceleration of whole genome sequencing, and even entire environmental metagenomes, the discrepancy between known primary and tertiary structures and thus the urgency of solving the problem is further increased.

Secondary structure considerations

The secondary structure prediction is a collection of bioinformatic techniques that aim, the secondary structure of proteins and RNA using its primary structure to predict ( amino acids or nucleotides). In proteins, the below concerns only the prediction is to highlight certain portions of the amino acid sequence as a probable α -helix, β -sheet, β - loop or as structureless. Success is measured by comparing the prediction with the result of the DSSP algorithm is compared, which is applied to the actual structure. Beyond these general structural motifs, there are also algorithms for detection of specific well-defined structural motifs such as transmembrane helices or coiled coils.

The best modern methods of secondary structure prediction can reach about 80 percent accuracy, which allows their use in the fold recognition, ab initio structure prediction, and the sequence alignment. The development of the accuracy of secondary structure prediction methods is documented by weekly benchmarks such as Live Bench and EVA.

Tertiary structure considerations

Since a complete recalculation ( ab initio ) of the protein structure by purely physico- energetic and quantum chemical methods is too complex even for small proteins to algorithms have prevailed for structure prediction, which either rely on a classification of individual parts of the amino acid sequence or predicted contact maps, and only calculate the final atomic positions in a second step.

Structural classes / domains

Various statistical methods have emerged for the classification of unknown proteins. The successful use Hidden Markov Models, and have been successful in solving the problem of speech recognition. The corresponding assignments can be downloaded from structural biology databases such as Pfam and InterPro. There is already a protein structure within a class known to the structures of other members can be calculated by comparative prediction. In the other case is consistent with the prediction of the contact map of a structural class, a new method is available, which is no longer dependent on physical structure determination.

Prediction from evolutionary information

With the availability of large amounts of genomic sequences, it is possible to examine the co-evolution of the amino acids in protein families. One can assume that does not substantially change the three-dimensional structure of proteins in the course of evolution within a structurally conserved protein family. The folding of the protein results here by the interactions between the individual amino acids. Changed by a mutation of the amino acids in the protein, the stability of the protein may be reduced and must be restored by compensatory ( correlated ) mutations.

Several statistical methods exist to identify evolutionarily coupled positions within a structurally classified protein family, which serves as an input the multiple sequence alignment of the respective family. Early methods helped themselves to local statistical models always look at only two amino acid positions in the sequence simultaneously, resulting in insufficient predictive accuracy due to transitive effects. Examples are the McLachlan Based Substitution correlation ( McBASC ) Observed versus expected frequencies of residue pairs ( omes ), statistical coupling analysis ( statistical coupling analysis, SCA ), and methods based on mutual information ( mutual information, MI).

Only through the use of global statistical approaches such as the maximum entropy method (inverse Potts model) or partial correlations it was possible to distinguish the causal co-evolution between amino acids of indirect, transitive effects. In addition to the superiority of global models for the contact prediction was first shown in 2011 that the predicted amino acid contacts can be used to predict 3D protein structures from sequence information alone. Neither related structures or fragments can be used, and calculations can be performed even for proteins with several hundred amino acids within a few hours on an ordinary PC. Subsequent publications showed that transmembrane proteins can be predicted with considerable accuracy.

Ab initio prediction

Each naive (unloaded with prior knowledge ) Protein structure prediction method must be able to traverse the astronomical size of the area to be searched, possible structures. To illustrate, the Levinthal paradox serves. Ab initio (including de novo ) methods are based merely on the use of physical principle (Quantum chemicals) to the known primary structure, in order to achieve a simulation of the folding operation. Other methods are of the possible structures and try to optimize a suitable evaluation function, which usually includes the calculation of the Gibbs free energy ( Anfinsen dogma ). Such calculations still require a supercomputer and can be performed only for the smallest proteins. The idea to provide computing power for the ab initio prediction available through distributed computing, has led to the realization of the projects Folding @ home, the Human Proteome Folding Project and Rosetta @ home. Despite the required computational power is ab- initio an active area of research.

Comparative prediction

Comparative protein modeling uses known ( physically measured ) structures as a starting point or template. This works in cases in which a homologous protein of known structure exists. Since the protein structures have not developed any, but are always associated with a biological function, proteins can be combined into groups that are both structurally homologous and are also functionally coherent manner, with the membership of such a group easily by means of machine learning (HMM ) to locate is (see above). On the other hand, structural biologists endeavor, at least for each of these protein groups a representative protein physically measured, so that, ideally, all remaining protein structures could be predicted by comparison.

Homology Modeling

In the comparative prediction now the homology modeling has prevailed: In a known protein structures (templates ) to be examined Amonisäuresequenz is transmitted by peptide binding and examined the resulting space fulfillments. From this it can be deduced that the structure examined sequence assumes a function of the template structure.

This is subject to template and sample sequence suitable to a common structural fold and can be aligned with each other, because that sequence alignment is the comparative modeling, the main problem Without qualifying doubt succeed with very similar sequences the best results.

Prediction of Contact Maps

The classification of proteins into structural groups allowed the prediction of a contact map for this group by calculating coupled positions in the alignment (see above). On the other side to receive structural biologists for the physical measurement of the protein structure by NMR, a first contact map. There have therefore developed early algorithms to obtain from a Contact Map conclusions on the protein tertiary structure. It is now possible in principle, be predicted reliably from any sequences protein structure as long as a large amount of sequences of proteins is the same groups available to coupled positions, and thus to determine a contact map. With the increasing pace of sequencing bacterial genomes enough ( almost 10,000 ) are already available, in order to successfully apply the method to this, and for example to model membrane proteins. But the number of eukaryotic sequences is sufficient in some cases, and the situation relaxed in this regard watching.

Prediction of side-chain geometry

The exact fitting of the amino acid side chains represents a separate problem within the protein structure prediction dar. this case, the protein backbone is assumed to be rigid and the possible conformations ( rotamers ) of the individual side chains changed so that the total energy is minimized. Methods that specifically perform the side-chain prediction, for example, the dead-end elimination ( DEE ) and self-consistent mean field ( SCMF ). Both methods use rotamer libraries, which experience has shown favorable conformations are listed with detailed information. These libraries can be backbone -independent, secondary- structure- dependent or dependent backbone indexed.

The side chain prediction is particularly useful in determining the hydrophobic protein core, where the side chains are packed most closely; it is less suitable for the flexible surface portions, where the number of possible rotamers significantly increases.

Quartärstrukturbetrachtungen

In the cases in which it is known based on laboratory results, a protein having a different or the same forms a protein complex as well as the tertiary structure (s) present can be found out of docking software by using, as the proteins in the complex are oriented to each other ( quaternary structure ). In addition to the genomic Contact Maps data are available that allow inferences about contact positions as these are functionally coupled. This applies to protein -protein interactions, in which case contact positions are viewed from gene pairs of the same species. First applications on toxin-antitoxin systems and other signaling networks in bacteria have already been presented.

Protein folding Coiled-Coil Benchmarking Protein domain Principle of maximum entropy Levinthal's paradox Anfinsen's dogma Hydrophobe Docking (molecular) Digital Object Identifier Proceedings of the National Academy of Sciences of the United States of America PubMed Central

662690