Letter frequency

The letter frequency is a statistical value that indicates how often a particular letter in a word or a collection of texts ( " body "). It can be specified as an absolute number or a percentage of total number of letters of the text. The frequency distribution of the letters depends on the language. While previous assumptions believed lump sum to predict the statistical distribution of letters by the frequency Zipf law, the quantitative linguistics has shown that a number of other probability distributions are to be considered ( Best 2005). Counts the frequency of letters or sounds in texts or corpora are detectable at least since the early 19th century. For some purposes, it is also interesting, as often occurs, a letter of the word beginning or end of the word.

Application

The letter frequency is used in the decryption process of substitution in cryptanalysis as well as the data compression and encoding. For simple encryption method as the Caesar cipher text can be decrypted simply by frequency analysis. The frequency of the different characters are found in the ciphertext, and then compared with the frequency of characters in a text of suspected language. Now, the letters of the ciphertext through the "normal" characters equal frequency are replaced. The most frequent letter in the ciphertext corresponds to, for example, then the plaintext letter " e". This method is obviously particularly well suited for longer decipherable text because the statistical deviation of the found letters frequency of the expected prevalence is lower.

For the typing lessons, it is extremely important that the teacher about the letter frequency in a language is well informed and teaching content are matched accordingly. Frequent letters as the E or I need to be sufficiently trained to achieve the highest possible number of key presses and a good writing security. When creating ergonomic keyboards the letter frequency also plays a major role. Manufacturer of word games like Boggle or Scrabble into account in national variants also the frequency and, if present, also the value of the letters.

One of the first applications was the Morse alphabet, short for frequent character codes used (for example, E = · ); for rarely used characters, however, longer codes (for example, Q = - · - ).

Continuation

The continuation of the letter frequency is the frequency of letter pairs and triples - and word frequency. Any examination instead with written once with the spoken language, so you can completely according to sound or Phonemhäufigkeit also carry out surveys to.

Letter frequencies in German lyrics

The umlauts ä, ö and ü were counted as ae, oe and ue, the ligature sz ß as an independent character.

In comparison, an equal distribution of the 27 letters of the relative frequency deceive each 3.704 %.

First letter

The incidence of first letter indicates how often a letter as the first letter of a word. You relatively strongly depends on the type of text. For body text are the five most common first letter:

For encyclopedias results in a different distribution. The letter " D", " E", " I" and " W" are compared to flow text much less frequently in the initial position before, "S " might be by far the most common:

Final letters

The incidence of final letter indicates the frequency with which a letter as the last letter of a word. ( As a sample text based on the novel Effi Briest by Theodor Fontane was evaluated, was "ß" as always "ss" counted. The text is based on all 36 chapters of this work with a total of 572 849 characters.)

Frequency diagrams

Monogram frequency Mountains: The letter - frequency distribution of a longer German text.

Bigram frequency Mountains: Distribution of the most common bigrams in a German text.

Trigram frequency Mountains: Distribution of the most common trigrams in a German text. The triple ER_ and EN_ are most common ("_ " stands for the space).

Letter frequencies in selected languages

150931
de