CpG island

CpG islands (English: CpG islands ) are regions in the genome of eukaryotes with statistically elevated CpG dinucleotide density. This density is related to the single nucleotide and Dinukleotidfrequenzen throughout considered genome cutout. CPG is cytosine phosphatidyl guanine. The p with given to better distinguish between the intended here CG within a DNA strand ( = DNA ) molecules and the CG base pairing of a DNA duplex. Because p stands for the phosphodiester bond between the cytosine and guanosine nucleosides. CpG islands are defined as DNA fragments of 0.5 kb and 2 kb in length within the eukaryotic promoter having a higher GC content of over 60%. The GC content of the whole genome is 41 %. CpG islands are caused by mechanisms that have to do with the use of genetic material as an information carrier. This CpG islands are important markers for the genetics, medicine and bioinformatics have importance, for example.

Origin and Meaning

In mammals, the cytosines of a cell, depending on the species, is between 2 % and 7% methylated. Mostly it is the cytosines of 5' -CpG -3 ' dinucleotides, which bear on the two complementary DNA strands a methyl group to form a palindromic methylation pattern. Two cytosines methylated in this configuration, they together cause a change in the three-dimensional structure in the major groove of double-stranded DNA. A large part ( about 60-90 %) of all CpGs in the genome of mammals are methylated, with unmethylated CpG dinucleotides are mainly to be found within CpG islands.

The average GC content in humans is 41 %, which the dinucleotide CpG should be calculated at a frequency of 4% in the genome. In fact, CpG dinucleotides are strongly under-represented by 0.8 %, which is mainly explained by the relatively spontaneous reaction of 5 -methylcytosine to thymine by deamination (see explanation and illustration below). Thus, the CpG Dinukleotiddichte in CpG islands is 10-20 times higher than in other areas of the average genome of vertebrates.

CpG islands are well ( of reading ) is used for regulation of the expression of the genes, and thus a mechanism for epigenetic gene regulation. Methylation of CpG islands of a gene means that these genes are not read ( gene repression ). Approximately 40-45 % of all human genes have CpG islands in their promoter regions.

Methylation of CpG islands plays a role both in the development of cancer ( as a mechanism for switching off of tumor suppressor genes ) as well as in genomic imprinting.

The two cytosines in a CpG dinucleotide in the human genome are mostly methylated ( DNA methylation). In some areas, the methylation is permanently suppressed. Often these areas CpG islands and genes are often present ( the so-called promoter regions ). The CpG these regions are usually not methylated and thus escape a mutation pressure, which is described below:

Cytosines are chemically unstable. You can in the cell from oxidative deamination ( from -NH 2 is = O) subject. Methylated cytosine thereby is thymine, from unmethylated cytosine ( for example, in the CpG islands ) is uracil. Thymidine during a "normal" nucleobase of the DNA part of uracil in the DNA. Uracil - actually an RNA base - is very well recognized and replaced by cytosine. The DNA - repair mechanisms of the cell take on the opposite DNA strand existing guanosine as the basis for error correction. In the methylated CpG dinucleotides produced by the deamination contrast, thymidine. This " error " is tolerated much more common than uracil and leads to a permanent mutation.

The following diagram shows the possible mutations by deamination and the consequences of DNA repair or by permanent establishment of mutations.

1 2 3                                                                  |       Methylated: |         m | m a) - CpG - deamination - TpG - often - CpG - | → - CpG -       - GpC - - GpC - - GpC - | - GpC -           mmm | m                                                                  |                                                                  | b ) rare - TpG - | → - TpG -                                                     - ApC - | - ApC -                                                         m |       Unmethylated: |                                                                  | c ) - CpG - deamination - UpG - very often - CpG - |       - GpC - - GpC - - GpC - |                                                                  |                                                                  |                                                                  | d) very rare - UpG - | → - TpG -                                                     - ApC - | - ApC -                                                                  | Legend of the Scheme: Shown are two CpG dinucleotides, one of which is located in a methylated region [a ) and b) ], while the other in the unmethylated region - is localized [c ) - eg a CpG island and d) ]. The " eye-catching " nucleobases are highlighted in bold.

1 deamination resulting in a new dinucleotide in which the complementary base pairing is removed.

2 For the subsequent recovery of complementary base pairing two versions are available, that run with different probabilities. The difference between a) and b ) with frequently and rarely arises from the fact that the opposite strand has a methylation of CpG. This will be understood as " older " conserved strand in this region of the DNA repair system of this strand. The major difference between c ) and d) with very common and very rarely goes back to the fact that uracil is not a DNA base.

3 Following the mutational events false methylation or nucleobases are replaced, if necessary.

Bioinformatic Analysis

Locating CpG islands with the help of Markov chains

Specifies the number of st- pairs on CpG islands and otherwise (not CpG islands ) with. The transition probabilities are based on maximum likelihood: and The sequence determination is based on sections of which it is known whether this is or is not CpG islands. Consider now an unknown sequence X. question: " Is it a CpG island? " designations:

  • P ( | X ) probability that X CpG island
  • P ( - | X ) probability that X is not CpG island

In addition, a score function is defined:

As "Prior " means the total length of all CpG islands is used relative to the total length of the genome.

Locating CpG islands using the Hidden Markov model

As a visible state is called here the bases (G, C, A, T) at the respective positions in the DNA sequence. The non-visible state says something about whether this base is part of a CpG island or not ( , -). There are four possible transition probabilities:

.

Each hidden state s produced with an emission probability of a visible state b ( a base):

The likelihood of a visible state is emitted from a hidden state resulting from accordingly:

With: (see Markov chain )

This results in:

As the effort to maximize P ( Z | X) increases exponentially with the length of the sequence, the recursive Viterbi algorithm is suitable to solve the problem.

205945
de