N-gram

N -grams are the result of the decomposition of a text fragments. The text is broken down there and each combined fragments as N- grams. The fragments can be letters, phonemes, words, and the like. N -grams are used in cryptology and linguistics, especially in computational linguistics, computer forensics and quantitative linguistics. Individual words, sentences or whole texts are divided here for analysis or statistical analysis into n-grams.

Types of N -grams

Important N -grams are the monogram, the bigram ( sometimes referred to as Digramm ) and the trigram. The monogram consists of a character, for example, only one letter, the bigram and the trigram of two of three numbers. Generally, one can also speak of multi -grams, if it is a group of " many " characters.

The prefixes of the names are usually formed with the help of Greek numerals. Examples are mono for "alone" or " only " for tri "three", for tetra "four", penta for " five ", hexa for " six ", hepta for " seven ", okto for " eight " and so on. Bi and multi prefixes are of Latin origin and stand for " two " and " many."

The following table lists sorted by the number of characters together with an example in which alphabet letters were taken as a sign, an overview of the name of the n-grams:

Formal definition

Be a finite alphabet, and be a positive integer. Then, a gram is a word of length over the alphabet, that is.

Analysis

The N- gram analysis is used to answer the question, how likely will be followed for a particular letter or word order, a certain letter or a particular word, for example, the English character "for ex ... ". The conditional probabilities for the letters of the alphabet in the English language are, in descending order: a = 0.4, b = 0.00001, c = 0, ... with a total of 1 Based on the N-gram frequencies thus appears a continuation of the fragment "a" → "for exa ( mple ) " significantly more likely than the alternatives.

The language used for the analysis is not important, but rather their statistics: the N-gram analysis works in every language and alphabet. Therefore, the analysis has been proven in the fields of speech technology: Many approaches to machine translation are based on the data that were obtained with this method.

Of particular importance then comes to the N- gram analysis when large amounts of data, such as e -mails are to be examined on a certain topic area out. The word frequencies in an email closer to those are in the reference document, the more likely it is that the content to the theme of which revolves and under certain conditions: Due to the similarity with a reference document, such as a technical report on atomic bombs or polonium, located clusters can form - in this example - could be any terrorism - relevant, even if keywords that clearly on terrorism, not even show up.

Commercially available programs that exploit these fault-tolerant and extremely fast method are spelling tests and forensics tools.

Google corpus

The company Google released in 2006, six DVDs with English n-grams of one to five words that originated in indexing the web. Here are some examples from the Google corpus for 3 -grams and 4 -grams at the word level (ie, n is the number of words) and the frequencies with which they occur:

3 -grams:

  • Ceramics collectables collectibles (55 )
  • Ceramics collectables fine ( 130)
  • Ceramics collected by ( 52)
  • Ceramics collectible pottery (50 )
  • Ceramics collectibles cooking ( 45)

4 -grams:

  • Serve as the incoming ( 92)
  • Serve as the incubator (99 )
  • Serve as the independent (794)
  • Serve as the index (223 )
  • Serve as the indication (72 )
  • Serve as the indicator (120 )

That is. The length of the vector is limited by upward, the length of and the binomial coefficient is.

A data set from Google Books with date July 2009 was provided with a web interface and graphical analysis. By default, it shows the normalized frequency to the number of existing books for this year for up to 5 -grams. With operators, several concepts can be summarized in a graph ( ), a multiplier for very different terms occurring Installation ( * ) representing the ratio between two terms (-, / ) or compare different corpora (:). The graphics can be used freely ( "freely used for any purpose" ), with the indication of the source and a link are desired. The basic data are split for their own evaluations in individual packages can be downloaded and are available under the Creative Commons Attribution License. In addition to an evaluation option for English in general, there are specific queries for American English and British English (differentiated on the basis of publication places ), as well as for English Fiction ( based on the classification of the libraries ) and English One Million. In the latter, the number of published books and scanned up to 6,000 books per year, in proportion from 1500 to 2008 randomly selected. In addition, there are also corpora for German, Simplified Chinese, French, Hebrew, Russian and Spanish. The spaces were used simply to tokenization. The N-gram education happened across sentence boundaries, but not across page boundaries. There were only recorded words that appear in at least 40 books.

A new body with a date July 2012 was made available to the end of the year. As a new language was added in Italian, English One Million was not formed again. Basically it is based on a larger number of books, enhanced OCR technology and improved metadata. The tokenization happened here after a set of hand-written rules, except for Chinese, where a statistical method was used for segmentation. The N-gram Education now ends at block boundaries, but it is now across page boundaries. With the now publicized sentence boundaries new features have been introduced for the 2012 corpus, which at 1 -, 2 - and 3 -grams with high probability can be evaluated in the set position and can be distinguished as, for example, also in English homo graphe nouns and verbs, although this works better in modern language.

Dice coefficient

The Dice coefficient indicates how similar two terms. He determined to the proportion of n-grams that are present in both terms. The formula for two terms and is

Wherein the amount of N -grams is the term. d is always between 0 and 1

  • Term a = " more "
  • Term b = " work"
  • T ( a) = { w § §, § wi, we irk, rk §, k § § }
  • T ( b) = { w § §, § where, wor, ork, ork §, k § § }
  • T ( a) T ( b) = { § § w, k § §, § rk }
  • Spelling correction ( correction for proposals )
  • Search for similar keywords (monitoring, voice recognition )
  • Basic word reduction ( Stemming ) in Information Retrieval

Statistics

As N-gram statistics is called a statistic on the frequency of n-grams, sometimes by word combinations of N words. Special cases are the Bigrammstatistik and the tri-gram. Applications are N-gram Statistics in cryptanalysis and in linguistics, there especially in speech recognition systems. The system checks during the recognition of the various hypotheses together with the context and can thus distinguish homophones. In the quantitative linguistics interested, among other things, the rank order of n-grams on the frequency and the question of what laws they follow. A statistic of digrams ( and trigrams ) in German, English and Spanish can be found at Meier and Beutelsbacher.

For meaningful statistics sufficiently large text bases of several million letters or words should be used. As an example, results in the statistical evaluation of a German text base of around eight million letters "I" as the most common trigram with a relative frequency of 1.15 per cent. The following table gives an overview of the ten ( in this text -based) as the most frequent trigrams determined:

124143
de