How it works:


The index of coincidence is defined as the probability that two letters, picked from two different texts, will be the same.

The probability of an evcent occuring is given by the total number of ways that event could occur divided by the total number of possible outcomes.

If we first assume that we have two completely random texts, then the chance of picking any given letter from either text is:

Nx/N

Where Nx is the total number of occurances of the letter x in the text N is the total number of letters in the text

The chance that the next letter we pick is the same as the first is thus:

(Nx/N) * (Nx/N)

Now, the alphabet has 26 letters in it. So there are 26 different letters that we could pick in each case.

If the distribution of letters is completely random, then the chance that both letters we pick are the same specified letter is:

(1/26)*(1/26) = 0.0015

But there are 26 possible letters. So the cahances of them both being the same, where we do not care what the letter is, is 26 times this:

26*(1/26)*(1/26) ~ 0.038

If however the letters are not randomly distributed, but instead follow the statistical distribution of english text, then the situation is different.

The following list shows the probability (as a percentage) of any letter, selected at random, being a given letter from an english plaintext:

Distribution of leters
in English plaintext (%)
a8.2
b1.5
c2.8
d4.3
e12.7
f2.2
g2.0
h6.1
i7.0
j0.2
k0.8
l4.0
m2.4
n6.7
o7.5
p1.9
q0.1
r6.0
s6.3
t9.1
u2.8
v1.0
w2.4
x0.2
y2.0
z0.1

Thus in this case the index of coincidence is equal to:

(0.082*0.082)+(0.015*0.015)+..... = 0.066

So, if we pick two letter from english plaintext, we expect the index of coincidence to be approximately 0.07

However, if we pick two letters from random text, we expect the index to be around 0.4

Similarly, if we pick two letters from a cyphertext encrypted using the ceaser shift, we still expect the index to be equalt to 0.07 (whatever the shift). This is because the ceasar shift works by shifting every letter in the plaintext by n places to the left or right in the alphbabet.

So while the frequency distribution will change shift to the left or right, the overall value of the index will not be affected.

The Vigenere cipher works by placing a repeating keyword underneath the plaintext. Every letter of the keyword correspond to a different shift in the plaintext. So if, for instance, the letter "e" is shifted by 5 places in one section of the text, it may be shifted by 3 places in another.

This flattens out the frequency distribution, making decryption harder.

However, if the keyword is N characters long, then every Nth character will have been shifted by the same amount. Thus, if we were to extract every Nth character we would have a text encrypyted with a basic caesar shift, and the index of coincidence calculated would be as for english text - approximately 0.07

If we arange two cypher texts, each on a single line, above the other, with the lower text shifted by N from the one above, then by comparing each character to the one below, and counting how often the characters match, we can acheive a value fror the index with this shift.

This is because the index is a measure of the probability of two letters selected from text(s) being the same. This value is unaffected by a simple ceasar shift, and the two letters we are comparing in each column will have been shifted by the same amount (even though this ammount will vary from column to column). So if we count the number of matching letters and divide by the ciphertext length, we have our value.

If our guess at N was correct, then we should expect a value similar to that for plaintext - approx 0.4. If however our guess was incorrect, then each letter was shifted by a different amount from its companion, and we would exp[ect something similar to that for random text, or about 0.07.

If we repeate this for every possible key length, we can then pick those that seem most likely.

Having done this, we can then split the text up into a number of sub texts, each of length N, and each of which has been shifted by the same amount. These can then be individually decoded by simple frequency anaylsis, and then the plaintext (and Key) reconstructed.