Index of coincidence

The coincidence index (engl: Index of coincidence, abbreviated as IC ) are obtained by statistical analysis of the individual characters ( usually the individual letters ) one or two texts. With his help, or unintelligible encrypted texts can be examined for linguistic properties. It is used especially in deciphering historical written documents and commonly in cryptanalysis. The method was developed by the American cryptanalyst William F. Friedman for cryptographic purposes and in his seminal work "The index of coincidence and its applications in cryptography " ( German: " The coincidence index and its applications in cryptography " ) in 1922 published.

In its basic form, the coincidence index is determined by counting the individual numbers of the different individual characters of a ciphertext, so for example, how many times the letter A occurs, how often B, and so on. These are multiplied by the formula stated above with the reduced to 1 single numbers and letters for all added up ( for example, from A to Z). The sum is eventually divided by the total number N of letters in the text (that is, length of the text ), and the text length is reduced by 1. The result is the Friedman coincidence index IC.

Definitions

In general form, four functions are grouped under the term coincidence index, which are usually denoted by the Greek letter and ( Kappa, Chi, Psi and Phi ). It is often referred to as the coincidence index, from the historical point of view probably has more of the right to the name.

Given two equal-length texts on the same alphabet. Then this is the two texts

Where the Kronecker delta designated (ie, if and else).

So that is a number between 0 and 1, where 1 if and only occurs when both texts are equal. If the characters chosen at random with equal probability from an alphabet with letters, then the expected value of the same, since each term with probability is equal to ( and otherwise equal to 0 ).

Since you want in cryptanalysis often squeeze out of short texts much information is the function, as the following functions on Friedman's employees Solomon Kullback (1935 ) goes back, sometimes more meaningful:

Which indicates how often the - th character of the alphabet in the text or occurs. Thus, the function depends solely on the letter frequencies of the two texts. Now

While applied to two texts from random uniformly distributed character as has the expected value, which is the case for no more, as is always equal to 1. Therefore, one concludes usefully in the summation, the character in the same position and defines

As can be calculated solely from the letter frequencies of the two texts, but it is for a text from random character of the expected value of the same, while he is bigger ( ie ). The difference, especially for short texts is striking.

Importance

Judging from texts from uniformly distributed random characters over to texts that are written in a particular natural language, so the expected value changes significantly. A rule of thumb is that it is about twice as large.

( Ignoring Umlauts are replaced by ae, oe, ue, ß by double -s or sz, gaps and punctuation ) Take for example the 26 characters of our familiar Latin alphabet, so is the value of German -language texts, and mostly at 0.0762 ( 7.62% ), while for English-language texts of the expected value is about 6.61%. This is in both cases - and, incidentally, for all other languages ​​- significantly higher than in the case of the uniform distribution of the letters. If all the letters of the alphabet on the same frequency as ( " random text " ) or " strongly encrypted " ciphertexts is for " random" generated string of letters is the case, then the expected value of 1/26 is therefore about 3.85 %. The much higher value for the German language to the English language primarily reflects the much greater frequency of the dominant letter E in German ( about 17.5 %) compared with the English ( about 12.7 %). Because the expected value for the language can be derived from the letter frequencies according to the formula

Calculate the probability of the - th character of the alphabet indicating texts in the corresponding language.

Often similar in related languages ​​, the expected values ​​, so that in case of unknown texts on the basis of the coincidence index conjectures on the associated linguistic area can be made.

Application to cryptanalysis

The essential feature here is that neither change with a simple mono substitution cipher encryption nor provided the texts involved are encrypted in the same way. A correlation in sufficiently long texts is thus possible on a statistical basis.

In a periodic polyalphabetic substitution encryption of the coincidence index is even more valuable, because the key length of the encryption can be estimated as follows ( Friedman test). For the language, the formula for the key length

This formula can be theoretically derived under the assumption that all the key characters are different. The value is therefore especially when encrypted with short keys, short texts revealing, especially in combination with the Kasiski test. If you have encrypted with longer key words longer texts available, so the following procedure is more precise.

If you remove the text once the first character, and then the last characters, one obtains two texts, which can be determined. Is a multiple of the key length, as should be, since the compared individual characters are encrypted with the same key. If it is not a multiple of the key length, it is a much lower value for expected because the strings being compared are rarely encrypted equal. Repeated sequences in the keyword with which you can outsmart the Kasiski test and the Friedman test affect the results here only partially, and should be detected normally.

Even with periodic polyalphabetic ciphers, these methods use profitably. In particular, it can be seen in two of the same one- time pad encrypted texts by calculating immediately these cryptographic sin and can be applied for example by the method of the expected word try on one of the texts, to produce plain text sequences in the other text.

The coincidence index is thus suitable for so-called ciphertext-only attacks ( where nothing needs to be known about the content of the encrypted text ) on substitution ciphers, making this method (of course, except a properly applied one-time pad) must be regarded as highly uncertain.

482329
de