German Reference Corpus

The German Reference Corpus (short DEREKO ) is an electronic archive of the German text corpora of written language, which exists since 1964 and the Institute for German Language (IDS ) is maintained in Mannheim and continuously expanded. With currently over 5 billion running words of text (as of February 2012) DEREKO is the world's largest collection of electronic corpora of the German language, which is intended for scientific purposes. The web application COSMAS II DEREKO is publicly available.

Alternative names

On the German reference corpus is often referred to under different names, among others, these are the names Mannheim corpora, IDS corpora, COSMAS corpora, Archives of corpora of written language in the presence of IDS. The term German reference corpus ( DEREKO ) was originally used only for part of today's archive, which was built between 1999 and 2002 in a same project, in which several institutions were involved. Since 2004 she has been the official name for the entire corpus archive.

Design and composition

The German reference corpus contains literary, scientific and popular texts, a large number of newspaper articles and various other types of text. The texts cover the period since the mid-20th century on up to the present.

Unlike some other well-known corpora and corpus archives (such as the DWDS core body or the British National Corpus ), the German reference corpus is explicitly not designed as a balanced corpus: So the lyrics are neither certain predetermined percentages to the individual text types distributed yet evenly distributed over the period covered.

This concept follows the fact that in principle (ie, a fixed population ) can only be assessed in relation to a fixed voice cut whether a corpus is a well balanced or even representative sample. Different linguistic issues can relate to very different language excerpts but - so far as the German reference corpus is conceived as a kind of Ur - sample for the use of the written German language from which depending on the problem and the associated population targeted a balanced sample can be collected. Such compiled from a text corpus consisting archive body is also referred to as a virtual body.

Access

Because of copyright and licensing conditions for the DEREKO archive may not be copied and especially not offered for download. It is through the interface COSMAS II researched and analyzed, with particular users register and commit to a purely scientific and non-commercial use must. COSMAS II provides, inter alia users the possibility to selectively compile a suitable to their question virtual body from the German reference corpus and use.

Currently, over 25,000 users worldwide for COSMAS II registered and can perform in DEREKO scientific research and analysis.

229147
de