Corpus linguistics

The corpus linguistics is a currently emerging field of linguistics. It will be new knowledge about language gained general or on certain individual languages ​​or existing hypotheses reviewed, although used as a basis quantitative or qualitative data, which are obtained from the analysis of specific text corpora or (more rarely) corpora of spoken language. Large dissemination found the corpus linguistics in German-speaking in the second half of the 1990s. It is considered epistemologically, to the currently prevailing paradigm of Generativismus contrary. It is still disputed whether it is a method or to its own new branch of linguistics at the Corpus Linguistics.

Data and research subject

Subject of corpus linguistics is language in its various manifestations. The corpus linguistics is characterized by the use of authentic language data that are documented in large corpora. In such text corpora are collections of linguistic expressions that are put together according to certain criteria and with a specific research objective. The findings of corpus linguistics are thus based on natural expressions of a language, to language as it is actually used. These statements can either be created in writing or it may be spontaneous or elicited spoken language. Most corpora are now available in digital form and can be used for linguistic research using certain software.

The aim of corpus linguistics is to verify on the basis of these data, either existing linguistic hypotheses ( confirm or refute ) or win by exploratory data analysis of new hypotheses and theories on the subject. One speaks in the first case of " corpus- based " linguistic analysis and in the second case of " corpus-based " linguistic analysis.

Corpus Linguistic issues concerning both the linguistic system itself ( " Langue " by Ferdinand de Saussure and " competence" by Noam Chomsky ) and the use of language ( "Parole " by de Saussure and " performance " according to Chomsky ). The corpus linguistics is thus the effect going to pick up the dominant language in linguistics dichotomous view.

A typical question concerning the language system, for example:

  • Can the advance of a German sentence be occupied more than once? If so, what sentence members? Are there rules that can describe the possibilities of multiple run- occupation?

Typical questions are concerning about the use of language:

  • If, in texts of e- mails more often than in traditional letters spelling mistakes? What types of errors are characteristic of e- mails?
  • What mistakes do learners of German ( different source language) at a particular level very often, certain words or grammatical constructions of these learners be avoided?

For many research questions, trying to answer the corpus linguistics, however, is not unique to decide which of the two domains Langue and Parole is a phenomenon to you, such as in the questions:

  • Which adjectives, the noun "hair" typically occurs together on?
  • Are modal particles common in the spoken language, less frequently or otherwise used than in written language?

On the one hand, the distribution of adjectives with "hair" and the modal particles as a phenomenon of a particular language or - after comparison with other languages ​​- as a feature of language apply generally, but on the other hand regarded as the result of a specific language use.

( An insight into the facets korpuslinguistischer research offer for example the work of Lemnitzer / Zinsmeister (2010 ) for the German and McEnery / Xiao / Tono (2006) for English. )

Methodological problems

A major methodological problem of Korpuslingustik is the ratio of the data base, ie, the body, the object being examined. The data base could theoretically cover the subject completely, if it is a language still in use today. But you can not be regarded as a in the sense of statistical inference valid sample of a corpus, since the object referred to by the sample, in practice as a whole - ie a specific language or a specific use of language - is not detectable. One manages so today, a corpus no longer ( as originally requested ) as a "representative" to refer to in the statistical sense, for the investigated object and to consider findings that are obtained on the basis of corpora only as provisional plausible. The compilation of large corpora should therefore be "balanced ", ie in a certain ratio consist of different types of text.

The basic assumption of corpus linguistics that lessons can be learned or reviewed on the basis of real language spoken utterances, brings two further methodological problems or objections with it:

In the first case, one can results that were obtained by corpus analysis, try to support by a parallel speaker survey. In the second case only helps the investigation of further data or, as a last resort, also a spokesman survey.

Corpus linguistics vs.. Generative grammar

The corpus linguistics is based on the use of natural languages ​​. It is an inductive / empirical method to gain knowledge about the language: the observation of as many specific individual examples leads to the formulation of a general statement about the subject. This approach ( " from particular to general " ) is assigned to the empiricism that assumes that all knowledge is based on experience. In contrast, the deductive method, which is derived from the philosophical tradition of rationalism: Based on the consideration as procure a certain linguistic phenomenon, it tries to find in the languages ​​of documents as confirmation ( " from general to specific " ).

This differs fundamentally from the corpus linguistics founded by Noam Chomsky generative transformational grammar and their successors, whose stated goal is the study of the language ability of the competent speaker as a cognitive performance. Chomsky himself has repeatedly denied the clear value of authentic language documents for the linguistic insights. He noted that, are unsuitable for studying the performance authentic language data, as present in text corpora, as in the production of language always occur error. Therefore, no valid conclusions about the linguistic system could be made ​​based on data obtained in this way. Chomsky therefore focused methodically on introspection and on native speaker judgments that are elicited under laboratory conditions by competent native speakers ( More information: The Linguistics Wars - Lakoff against Chomsky ). The corpus linguistics, however, dispense with the consideration of the difference between linguistic competence and performance- that Chomsky considers essential. It is to be observed but recently an approach between the two positions. In both camps are now considered its own data base of critical and is ready to use the other side of the preferred data at least as a means of controlling one's findings.

History and applications

The prevalence and the importance of the English language as well as an overall high affinity for empirical research in linguistics are two reasons why the computerized Datenananalyse, as it is the corpus linguistics, a, first developed in the Anglo-American world.

The local modern corpus linguistics was founded in 1967 by Henry Kucera and Nelson Francis motivated by their work, " Computational Analysis of Present - Day American English". Their results were ( exactly: " Brown University Standard Corpus of Present - Day American English " ) from the "Brown Corpus " won. This originally consisted of around 1 million words. Other English-language corpora followed, such as in the 1980s, the same size, " Lund - Oslo-Bergen corpus " (LOB ). A new milestone has been achieved by the creation of a number of these far -border text corpus under the lexicographical work with the English Collins Publisher. Its result was the first edition of the " Collins Cobuild Dictionary of English ". He was followed in a new order, the non- commercial creation of a balanced, 100 million running words comprehensive " British National Corpus ", which is still used today as a reference corpus for linguistic studies of British English. He now enters the " American National Corpus " to the side. Other regional varieties of English are recognized as " International Corpus of English" (ICE).

Pioneer of the German corpus linguistics were the Institute for Communication Science and Phonetics (IKP ) at the University of Bonn and the Institute for German Language in Mannheim. Today, to name a German -language corpora, especially the following:

  • The " German reference corpus " ( DEREKO ) at the Institute for German Language in Mannheim, the collection of billions of text words
  • The core corpus of the "digital dictionary of the German language" ( DWDS ) at the Berlin- Brandenburg Academy of Sciences
  • The corpus of the project " German Vocabulary " at the University of Leipzig (mainly texts from online media )
  • The " Swiss Text Corpus " at the University of Basel (currently in trial operation and extension)

In addition to these freely accessible to the public corpora with guaranteed long-term care there are a variety of Spezialkorpora for many language levels and varieties of German. ( An overview of this type Lemnitzer / Zinsmeister (2010). )

Be corpora, as the example of the Collins Cobuild project, but also the American Heritage Dictionary ( 1969) show, used by a lexicography that the user not only prescriptive (how can a word be used), but also descriptive (such as a word actually used ) plans to offer descriptions. Quantitative surveys of word frequency statistics to control and objectify the Lemmaauswahl for many types of dictionaries. Today, the use of corpora is also established in German dictionary publishers. Some types of lexical information can only be obtained on the basis of the analysis of large text corpora (eg staggered frequency profiles), others can be better protected by corpora as the linguistic competence of individual lexicographers.

Corpora are now also increasingly used in language teaching as a research base. Based on the results, as a language is actually used, the teaching materials are designed, and so-called Lernerkorpora on show in which learning stages dominate the errors in language production.

For specific linguistic issues and other special corpora are developed in an increasing extent, are understandably much smaller in scope than reference corpora that are to acquire a language as a whole. Such there are, for example in the field of studies of language use in politics and in the media.

Corpus linguistics - method or discipline?

The question of whether the corpus linguistics is a method of general or applied linguistics or is a separate linguistic discipline has not yet been answered conclusively.

For the assessment as a method is the fact that many branches of linguistics, from the theoretical to the Forensic linguistics is an empirical, corpus -based analysis technique in methodologically reflected manner use, although usually not exclusively. A genuine object of corpus linguistics, however, is not recognizable. Such would be necessary if one were to award her the status of an independent scientific discipline.

For the assessment that corpus linguistics is an autonomous discipline, supported by the fact that they are decidedly the use of language as their object of knowledge is determined and thus settles schools of linguistics, to the language ability of people or the general structures of language as a semiotic system object have.

Notwithstanding this fundamental consideration, the corpus linguistics has established itself as a branch of science in academic life. This is suggested by the existence of several thematic journals, out of a comprehensive two-volume manual ( Lüdeling / Kytö 2008, 2009 ) as well as two dedicated departments at the University of Birmingham and at the Humboldt University in Berlin.

203403
de