Protein family

Protein family is a group of structurally similar proteins that relate to each other in an evolutionary context, and are encoded in the corresponding gene families. Gene family and the terms protein family are often used interchangeably, depending on whether the homology with respect to the genome and the DNA (genes) or on the level of gene expression, biosynthesis and biological function (proteins) is considered.

A classification of proteins into families due to their amino acid sequence and the architecture of protein domains within the sequence helps in the theoretical understanding of the evolutionary origin of these protein families and has practical applications in biotechnology and diagnostics.

  • 2.1 PIRSF classification 2.1.1 terminology
  • 2.1.2 rules

Basics

Evolution of protein families

The extension of a protein family - or the emergence of a new family - can be done in various ways; different mechanisms are not mutually mutually exclusive:

Formation of homologous genes two populations of the same species are separated geographically, for example and evolve independently. In the genome of progeny arise mutations that lead (ie modification of the primary structure, which in turn affects the stability and function of the protein ) to a change in the expression of proteins. Depending on the different living conditions of these mutations are selected in a natural way. Thereby established with time in this subpopulation of the gene coding for a protein having slightly different properties. This genetic drift resulting in one of the two separate species to a homologous protein variant of this protein family, or - after another and longer change - to orthologous protein family with mostly still similar amino acid sequence.

Formation paralogous genes Another possibility is a change in a gene by complete or partial gene duplication (or quadrature). This results in a copy of the gene; it results in a gene cluster with paralogous sequences. As one of the genes is still able to fulfill its original function, the other may diverge. By further mutations new features in the resulting proteins can form.

Some gene and protein families in the course of evolution experienced by a gene or genome duplication " extension " (eg opsin gene duplication on the X chromosome in Old World monkeys ).

Use of terms

The term protein family is not used uniformly but depends on the context in the literature. Protein family may have several very large groups of proteins having a lowest possible level mathematically detectable sequence homology include ( and thus to very different biological functions) or can be based on very narrow groups of proteins, which - compared to each other - almost identical sequences, three-dimensional structures and own functions.

When Margaret Oakley Dayhoff mid-1970s, the systematics of the protein superfamily ( engl. protein superfamily ) introduced, only 493 protein sequences were known. They were mostly small proteins with only one protein domain such as myoglobin, hemoglobin, and cytochrome c, which were divided by Dayhoff and colleagues in 116 superfamilies. The designations Super Family> Family> subfamily allowed a gradation and number- definitions have been specified for it.

In parallel, other terms such class of proteins ( protein class ) group of proteins ( protein group) and protein subfamily have been over the years, coined and used. These terms are used ambiguous depending on the context.

Importance of understanding of protein families

The total number of directly - or indirectly through the genes - sequenced proteins of living organisms and viruses is constantly increasing and requires based on the biological facts, meaningful structuring and classification. Some scholars give the number of protein families with at least 60,000.

On the one hand there is a theoretical interest in getting a better understanding of how different genes - and the functions of the so- encoded proteins - have changed and developed over the course of evolution, on the other hand, there are very specific applications where the knowledge of the relationships between protein families and domain architecture a play an important role. Examples are the enzymatic synthesis in industrial biotechnology, the development of new vaccines "made to measure " recombinant proteins, or the field of medical analysis ( proteomics ).

Sequence comparisons by phylogenetic and cluster analysis allow an assignment of proteins into families and the assignment of these in parent superfamilies. From these mappings to theoretical considerations in newly discovered proteins can make in terms of their potential secondary and tertiary structure and they open up possible approaches for the elucidation of unknown functions.

Classification systems

There are several systems for the classification of protein families, which differ in their approach and scheme. One such system is described in detail.

PIRSF classification

The Universal Protein Resource ( UniProt ) database from 2002 carried merger of databases TrEMBL of the European Bioinformatics Institute (EBI ), Swiss- Prot of the Swiss Institute of Bioinformatics (SIB ) and the Protein Information Resource (PIR ) of the Georgetown University Medical Center ( GUMC ) arose, representing the PIR superfamily classification system ( PIRSF ).

Terminology

Initially, based on the work of Dönhoff PIR classification superfamily, family and subfamily was linear hierarchical structure: A protein could and could only be assigned to a single protein family, and this only a single superfamily. This system had to be revised, as more and more primary structures ( by direct sequencing of purified proteins, but especially by reading the information encoded in genes sequenced proteins) were known. It was recognized that there were proteins that were structurally rather simple construction and others who had very complex structures:

  • Homeomorphic proteins ( engl. homeomorphic proteins) are proteins that are " topologically equivalent " are used, which means that they are homologous to the N- terminus to the C -terminus and have the same type, ( like ) the number and arrangement of domains ( and domain structure called or domain architecture ), but may have different sequence length.
  • Domain proteins (german domain proteins) are constructed on the basis of gene fusions, deletions and / or insertions of complex and contain various domains ( or domains in different scale arrangement ) that are otherwise only found in very different homeomorphic proteins.

From 1993, PIR therefore differed between homeomorphic superfamilies (English homeomorphic super families ) and domain superfamilies (English domain super- families ).

Regulate

PIRSF the system is based on the following rules:

  • The inputting of a new protein in a superfamily, family or subfamily is not performed automatically but manually; Results of automatic sequence alignment and cluster analysis are consulted there.
  • Each entry is annotated as detailed as possible and other classification schemes, as well as items from other similar databases are mentioned.
  • Thus both biochemical and biological functions of a protein will be clearly explained and also to proteins having less well (or not) as defined domains can classify the PIRSF system is based on the classification of entire proteins, and not on the classification of individual or isolated domains.
  • A hierarchical structure can shifts of domains ( engl. domain shuffling), which occurred in the course of evolution, not represent. Therefore, the system is PIRSF " a network type classification system based on the evolutionary relationship of entire proteins ." Primary network node ( primary nodes, parent node) are homeomorphic families of proteins that contain proteins that bind both homologous ( orthologous or paralog, ie a common precursor ( " protein ancestors", " Urprotein " ) ) than are homeomorphic, ie hold over the entire length of the primary structural similarity and a similar arrangement of the domain (s); defined parameters for the mathematical algorithms for the determination of "likeness " by sequence alignment, it can be used.
  • Above this node is the homeomorphic protein families, the nodes are arranged from other (domain ) superfamilies. This evolutionarily more distant superfamilies (and still no family assigned to individual proteins) based on domains that are the underlying superfamilies common ( One can lower lying homeomorphic protein family - but need not - be associated with multiple above -lying domain superfamilies ). This arranged above superfamilies can be homeomorphic protein superfamilies, but it is more likely that there are domain superfamilies, when the protein regions that comprise the domains do not extend over the entire length of the protein.
  • Below the homeomorphic protein families are nodes of subfamilies (English child " subfamily " nodes ), homologous groups and homeomorphic (English clusters ) of proteins with functional specialization and / or a variation of the domain architecture within the protein family. Each subfamily has only one parent network node ( parent node).

Examples of protein (super ) families

As a result, there is a non-exhaustive list of protein families and super - families.

  • Immunoglobulin
  • Histone H1 histone H1F
  • Histone H1H1
  • Histone H2AF
  • Histone H2A1
  • Histone H2A2
  • Histone H2BF
  • Histone H2B1
  • Histone H2B2
  • Histone H3A1
  • Histone H3A2
  • Histone H3A3
  • Histone H41
  • Histone H44
  • Dynein
  • Kinesin
  • Myosin
  • Histidine kinase
  • Receptor tyrosine kinase
662904
de