Bioinformatics (English bioinformatics, computational biology as well ) is an interdisciplinary science, the problems in the life sciences solves with theoretical computational methods. She has contributed to fundamental insights of modern biology and medicine. Awareness in the media reached the bioinformatics primarily in 2001 with her significant contribution to the sequencing of the human genome.

Bioinformatics is a broad area of ​​research, both in terms of the problems of the methods used. Significant areas of bioinformatics are the management and integration of biological data, sequence analysis, the Strukturbioinformatik and the analysis of data from high-throughput methods (~ omics ). As bioinformatics is essential to analyze data on a large scale, it forms an essential pillar of systems biology.

The bioinformatics is in the English language often the computational biology faced, which covers a wider area than the classical bioinformatics, usually one uses the two terms interchangeably, however.

Bioinformatics has become an established independent science, which is one of the basic sciences of biology and medicine and can be studied as such in Germany at many locations (see also: Science in Bioinformatics ).

Data Management

The rapidly growing amount of biological data, particularly DNA and protein sequences, their annotation ( annotation ), 3D protein structures, interactions of biological molecules and high-throughput data, for example, microarrays, places special demands on the handling of these data. Therefore, an important problem in bioinformatics is the information processed and stored in appropriately indexed and linked biological databases. The advantages are in the unified structure, ease of searchability and the automation of analysis by software.

One of the oldest biological databases is the Protein Data Bank, PDB, for data on 3D structures of biological macromolecules, mostly proteins. In the 80's databases for managing nucleotide sequences (EMBL Data Library, GenBank ) and amino acid sequences were constructed (Protein Information Resource, Swiss- Prot ). The combined in the International Nucleotide Sequence Database Collaboration nucleotide sequence databases are available as primary databases archives of original data submitted by the researchers themselves. However, any UniProt, the combination of PIR and Swiss-Prot, high-quality, well-maintained and annotated entries of protein sequences with extensive information about each protein prepared by experts, which are supplemented by automatically translated from the EMBL database protein sequences without further annotation.

Other databases contain repeating motifs in protein sequences ( Pfam ), information on enzymes and biochemical components ( BRENDA, KEGG LIGAND and ENZYMES ), via protein-protein or protein -DNA interactions ( TRANSFAC ) on metabolic and regulatory networks ( KEGG, Reactome ) and much more.

The scope of each database grows partly exponentially. Also, the number of relevant databases continues to grow (over 350 worldwide). In the search for relevant information therefore often bioinformatic meta search engines ( Bioinformatics Harvester, Entrez, EBI SRS ) are used.

The variety of globally available databases often leads to redundant and thus error- prone data management, especially DNA sequences are present in some fragments, some in fully assembled genomes. Ideally, the storage of genomic and proteomic data would allow a reconstruction of the rules of a whole organism. At the requisite illustration of identified proteins on which they have to metabolic genes coding and vice versa on the links among themselves to represent their interactions, and to the assignment of proteins and regulatory pathways are working hard.

Another task in data integration is the creation of controlled vocabularies and ontologies, which allow assignment of function names across all levels. The Gene Ontology Consortium (GO) is currently trying to create a consistent nomenclature for the molecular function, biological process and cellular localization of gene products.

Sequence analysis

The first pure bioinformatics have been developed for the DNA sequence analysis and sequence comparisons. In the sequence analysis, it is primarily to the rapid discovery of patterns in protein or DNA sequences. In the sequence comparison ( sequence alignment ) it comes to the question of whether two genes or proteins related to each other ( " homologous" ) are. For this, the sequences are superposed and aligned with each other, that the best possible match is achieved. If agreement is significantly better than would be expected by chance resemblance, one can infer relationship: When genes and proteins relationship always implies similar structure and most similar function. The central importance of sequence comparison for bioinformatics is therefore in his commitment to the sequence and structure prediction of unknown genes. For the application of dynamic programming algorithms come here and heuristic algorithms. Dynamic programming provides optimal solutions, but is not applicable to very long sequences or very large databases because of the required computational resources in practice. Heuristic algorithms are suitable for the search of large, globally available databases that archive all known sequences; Although they do not guarantee optimal results, but do so well that the daily work of bioinformatics and molecular biologists without the use of, for example, the BLAST algorithm would not be possible. More commonly used algorithms that perform different functions depending on the application, are FASTA, Needleman - Wunsch or Smith - Waterman.

Rare is required for biological issues, the search for exact matches of short sequences sections, typically for interfaces of restriction enzymes in DNA sequences, possibly also of sequence patterns in proteins from the PROSITE database.

A major role of Bioinformatics in genome analysis. The sequenced in small units of DNA fragments are joined together by means of bioinformatic methods to an overall sequence.

Furthermore, methods for finding genes in unknown DNA sequences developed ( gene prediction, Eng. Gene finding or gene prediction ). This problem is addressed with various computational methods and algorithms, including statistical sequence analysis, Markov chains, artificial neural networks for pattern recognition, etc.

Both on the basis of DNA and amino acid sequences can be phylogenetic trees create that represent the evolutionary development of today's creatures from mostly unknown and therefore hypothetical ancestors.


With the Enlightenment and extensive functional analysis of complete genomes of various shifted the focus of bioinformatics work on issues of proteomics, such as the problem of protein folding and structure prediction, ie the question of the secondary or tertiary structure for a given amino acid sequence. The question of the interaction of proteins with various ligands ( nucleic acids, other proteins or smaller molecules ) is studied because it Besides results for the basic research can be derived also important information for medicine and pharmacy, for example, about how a caused by a mutation altered protein body functions affected or in what manner which drugs act on different proteins.