British National Corpus

The British National Corpus (BNC ) is a 100 million -word collection of written and spoken language. It encompasses a variety of sources in order to present a representative cross -section of the British English of the late 20th century can.

About 90 percent of the BNC consist of speech data written language, such as extracts from regional and national newspapers, trade journals, magazines from many different areas of interest, academic books, popular fiction (novels, etc.), official and private letters, essays from college and university as well as many other types of texts.

The remaining ten percent of oral language data comprise primarily informal conversation, recorded by volunteers of different ages, different backgrounds and different social class to reach a demographic balance. The recorded calls originated in a variety of contexts, ranging from formal to business and government meetings to radio shows and phone conversations.

Work on the BNC began in 1991 and lasted until 1994. Upon completion of the project no new texts have been added, yet the corpus before the publication of the second edition has been revised slightly with the name " BNC World ". Two sub- corpora with extracts from the BNC were published: the BNC Sampler (a collection of one million words of written and spoken language) and BNC Baby ( four million words of four different genres).

The BNC has four main characteristics:

  • It is monolingual. The BNC includes modern British English, no other languages ​​that are used on the British Isles. Nevertheless immerse words on non British origin in the BNC.
  • It is synchronic. The BNC covers only the British English of the late 20th century and allows no insight into the historical developments that have produced it.
  • There is general. The BNC contains many different styles and varieties, and is not limited to a specific thematic area, genre or register. In particular, it includes examples of both written and spoken language.
  • It includes snippets of text called " Samples". For the written sources 45,000 words from different parts of a single text of the author were taken. Shorter texts up to a level of 45,000 words, or even texts by several authors such as magazines and newspaper articles have been included in full in the BNC. The inclusion of excerpts allows a larger range of different texts within the 100 - million mark to represent and thus avoids over- representation of idiosyncratic lyrics.
147077
de