Text corpus

The term corpus ( neuter, plural corpora; short even corpus or corpus; Latin corpus, body ') generally refers to a collection of written texts or recorded in writing, oral statements in a particular language. Text corpora are available in different scientific disciplines, mainly in linguistics, literary studies and historically oriented disciplines but also for example in the jurisprudence of importance. They are the means through which, for example, a particular language can be described or the works of an author to be explored; Corpora but also serve as sources for the study of certain (eg historical ) Topics and issues.

Corpora - as long as it is from such living languages ​​- compiled according to certain scientific criteria and include a specific type and number of texts. Such compilations have gained immense importance with advent of machine facilities, especially in several disciplines of linguistics and were decisive for the very recent establishment of corpus linguistics.

A text corpus is now typically in digital form. For the purpose of describing particular individual languages ​​have been in many national languages ​​large, ie many millions to billions of part -word corpora created that are designed to represent a certain proportion of individual types of text in the appropriate language. In addition, there are a large number of Spezialkorpora such as Kindersprachkorpora, Dialektkorpora, corpora consisting of a total expenditure of literary works, inter alia, m. Also specially designed text corpora are created increasingly used for linguistic individual studies.

Types of corpora

After formal and content criteria text corpora can be categorized in different ways. Among other things, it may be:

Text corpora in linguistics

Text corpora provide the opportunity to examine the system of a language and its use by means of actual speech data expressed in various ways. The term " body " in terms of a set of voice data to make general statements as a result of this sampling is used in various disciplines of linguistics for decades.

This empirical orientation is contrary to the rationalist orientation of generative grammar, which immediately represents a dominant paradigm in theoretical linguistics. Accordingly critically is seen especially in regard to questions about the grammar of representatives of this direction of the use and benefits of text corpora. However, corpora also be taken in this area increasingly used for verification of hypotheses to complete.

Linguistic sub-areas, which currently corpora are increasingly used, the corpus linguistics and computational linguistics. Here large corpora are evaluated as possible in order to make general statements about a language. Examples of the use of corpora in corpus linguistics are about determining the meaning of words based on concordances (ie based on citations to specific texts), the elicit of collocations ( words of common occurrence of a word with determining other words ) or the answering of questions about the syntax of a language. In the field of computational linguistics and mathematical linguistics include word frequency and word distributions in texts, collocations or sentence and word lengths, and the like of interest. In linguistic branch of discourse analysis text corpora of different sizes mainly from the public voice range ( politics, media ) are used to locate from such voice data to draw conclusions about latent and attitudes of a social grouping at certain things and situations out or their understanding of certain terms.

It is true that the World Wide Web is a collection specific language you are using, but it is not to be regarded by saying scientific understanding as a text corpus in the proper sense. Nevertheless, it is used under certain restrictions on certain issues with care. For example, regional sites were used in the preparation of the variant dictionary of the German besides various printed texts.

Reference corpora of individual languages

To describe national languages ​​or linguistic varieties extensive text corpora are created, which are very often also available online today. In the latter cases, the analysis needed for this software already on the World Wide Web is implemented and can be applied by users, without having to install such a program on your own PC.

The first text corpus in a national language variety was the word types already created in the 1960s and completely defined by 80 annotated Brown Corpus, which should represent the contemporary American English. ( The name derives from Brown University in Providence, Rhode Iceland forth, where the body was created. ) It comprises 1 million words and consists of 500 excerpts, each 2,000 words together, with texts from 15 different text types (various newspaper and literary types of texts, religious texts, literature, etc. ) were used. The view that a text sample in size from 2,000 words, the type of text for a text corpus and to population density, applies today. The Brown Corpus was the basis for the American Heritage Dictionary, the first dictionary that was created solely on the basis of such a body. The Brown Corpus was followed, among others, in the 1980s, which also fully annotated Lancaster - Oslo-Bergen corpus ( short LOB Corpus ), which is modeled after the Brown corpus of texts in British English.

Today, for the English, among others, the British National Corpus, the American National Corpus and the International Corpus of English ( with texts from different English-speaking countries ) is important.

As the most comprehensive body of Germans is true that at the Institute compiled for German Language in Mannheim German reference corpus, which consists of more than 4 billion words (as of the beginning of 2011 ) of written language and is generally open to all to use.

Within the research project " Digital dictionary of the German language of the 20th century " the biggest balanced text corpus of the German language of the 20th century was provided. In addition, footnotes and other corpora, such as the complete online archives of the magazine "Time " (from 1996), the " Tagesspiegel " (from 1996) and the " Potsdamer Latest News " and a large corpus of Jewish Periodicals ( Germania Judaica ). The corpora are associated with a large monolingual German dictionary, the dictionary of the German language. When querying a keyword not only the concordance, but also information about synonyms, hyponyms, Hyperonymen and collocations are generated.

In the Natural Language Processing department of the University of Leipzig is also working on and with large corpora and maintains inter alia, a corpus of around 1.5 billion words ( approximately 100 million records). However online are some statistical data, only a smaller corpus queried.

Furthermore, there is an accessible online since 2010 Swiss text corpus.

Now a large corpora also in many other national languages ​​. This applies not only to the Indo-European -speaking countries, but also for other speaker- rich languages, especially in Asia. But smaller languages ​​of Asia and Africa are documented in the form of text archives or less extensive annotated corpora.

Special text corpora

In addition to the large reference corpora, there is a growing number of text collections that can be found not only under the name " body " but also as " ( text ) archive" or under " database ". Among these are for example Dialektkorpora or corpora of spoken language, such as are present, for example, in the Bavarian Archive for Speech Signals. Another type of text Spezialkorpora are total expenditure, such as that created at the Austrian Academy of Sciences Austrian Academy Corpus, which comprises the total expenditure of the essay- journals " The Torch " and " The burner".

Especially for psycholinguistics and clinical linguistics is the study of the normal and of the faulty language acquisition of children the database " CHILDES " of meaning, in which transcripts of spoken language of children present in an extensive degree.

In the context of large-scale projects to digitize old book collections more and more encyclopedias, dictionaries, encyclopedias and literary works are recorded and made ​​available online. These include things such as find the " German Text Archive ," which seeks to provide a comprehensive range of historical texts from several centuries. Such text collections provide the optimum case, a free, online full-text search can be carried out in the whole stock. However, in such cases frequently not possible to use these texts for linguistic purposes in the same convenient way as specially designed corpora, because the search software is not designed for it.

766703
de