Skip to main content

Section 3.4 Vigenére Cryptanalysis

Subsection 3.4.1 Index of Coincidence

Now that we have a good understanding of calculating probabilities, we can start computing the Index of Coincidence for ciphertexts and using that to determine if the ciphertext is monoalphabetic or polyalphabetic.

Definition 3.4.1.

The index of coincidence for a ciphertext is the probability that two letters picked at random from the ciphertext are identical.

We'll start with an example with a small ciphertext.

Example 3.4.2.

Let's compute the index of coincidence for a ciphertext assuming there are only 5 different letters. If our ciphertext is \(ACDBEADEBAAECBA\) what is the probability of picking two identical letters at random? The order of letters does not matter since we do not care what the letters spell, only if they are the same or not.

We first compute the total number of ways to pick two letters from this set. There are a total of 15 letters so that is \(C(15,2)=\frac{15!}{{2!}{13!}}=\frac{15 \times 14}{2}=105.\) We next need to compute the total number of ways to pick two identical letters. To do this we compute the probability of picking AA, BB, CC, DD or EE. There are five As, so the number of ways to pick two of them is \(C(5,2)=10\text{.}\) Similarly there are three Bs, two Cs, 2 Ds and 3 Es. Thus, the index of coincidence for the given ciphertext is

\begin{equation*} \frac{C(5,2)+C(3,2)+C(2,2)+C(2,2)+C(3,2)}{C(15,2)}=\frac{18}{105}=17.1\% \end{equation*}

To calculate the index of coincidence for a ciphertext with more letters we make a similar calculation for each letter in our language.

From Investigation: Index of Coincidence we saw that the index of coincidence for an English text where all letters are evenly distributes tends towards \(\frac{1}{26}\) as the size of the text goes to infinity. Here's another way to see that. If we have a very large amount of text we could assume that the probability of picking an A is \(\frac{1}{26}\text{.}\) If the text is very large, then the probability of picking a second A is also very nearly \(\frac{1}{26}\text{.}\) (Note, this will not be true for a text with relatively few letters.) So the probability of picking two As is approximately \(\frac{1}{26} \times \frac{1}{26}\) and the probability of picking two identical letters is

\begin{equation*} IC_{even} \approx \ \left(\frac{1}{26}\right)\left(\frac{1}{26}\right)+ \left(\frac{1}{26}\right)\left(\frac{1}{26}\right)+ \cdots + \left(\frac{1}{26}\right)\left(\frac{1}{26}\right)=\frac{1}{26}=3.85 \% \end{equation*}

How should this compare to the index of coincidence for a standard text in English? We again assume that we have a very large amount of text so that if the percentage of As in the English language is \(8.17\%\text{,}\) then the probability of picking an A the first time is \(8.17\%\) and the probability of picking an A the second time is very nearly \(8.17\%\text{.}\) So we can approximate the probability of picking two As from a very large text as \((8.17\%)^2\text{.}\) We repeat this for every letter of the alphabet to obtain the index of coincidence for standard English based on the percentages in the table in  Table 2.10.4.

\begin{equation*} IC_{eng} \approx (.0817)^2 + (.0149)^2+ (.0278)^2 + \cdots (.0007)^2=6.5\% \end{equation*}

Note that if two letters are picked at random from a standard English text then, it is much more likely to get two repeated letters than for a text with a uniformly distributed alphabet.

Question 3.4.4.
What should the index of coincidence be for a monoalphabetic cipher? What should it be for a polyalphabetic cipher? Answer
The index of coincidence for a monolaphabetic cipher should be the same as for the English language. The letters may have been replaced with different symbols, but their basic frequency and distribution has not been altered. The goal of a polyalphabetic cipher is to distribute the letters more evenly so that frequency analysis cannot be used. Thus, the index of coincidence should be closer to the index of coincidence of evenly distributed text. Thus, we can compute the index of coincidence for a ciphertext and determine if it is a monoalphabetic or polyalphabetic cipher by whether its index of coincidence is closer to \(0.0385\) or \(0.065\text{.}\) In fact, the how close the index of coincidence of a polyalphabetic cipher is to \(0.0385\) gives us important information about the length of the key which we will see in Subsection 3.4.3.

Subsection 3.4.2 Kasiski Test

From Investigation: Monoalphabetic versus Polyalphabetic you should have also noticed important patterns in the ciphertext which we can use to get information about the length of the keyword. For example, in the ciphertext

GWGET QGWGE TQWWU OIZGB DRCNO MRLZQ ECDQI ZGKMX TPUWZ ECNET QIQXO MFRNM IMZQG EQIWH QZWPQ FIGOL DRVNR QXDVP EIWHM KHWRQ MIWIZ GBAHW RIZAJ EARTA IJMWI ZGBAQ OGHWR GWSDL PHZEI WDNIZ GJXLV PZWDP AEWQZ JTUCI DGAXH OMQLA ZTQWA ILVSI WDDKT DZYRN BREQU NGOBD RCNOM XLSQD PQOTN UWFKJ ALTMQ LNXJN OMORW XLBIL BTDJM EWAQA NOWAG BTHVF KMOKI DPQEI QDPIZ GOARL PRCNO MPRCN OMFRQ XDVPW ZAXJX HNUUM NXZZD VPFIG OLDRV NXJNO M

there are some strong repeated patterns. For example, the same six letters GWGETQ are repeated twice at the beginning. How could this happen? It's possible that it is a just a random coincidence. However, it is also possible that the same letters in the plaintext are repeated and are in a position in the ciphertext where they are encrypted with the same letters of the keyword. Let's look for all such patterns in the above ciphertext and see what we can learn about the length of the keyword.

Example 3.4.5. Applying the Kasiski Test.

Run the code below which will apply the Kasiski test to the above ciphertext to compute the distances between repeated three- and four-letters patterns. What happens? What does that tell us about the length of the key?

Sage Computation 3.4.6. Applying the Kasiski Test.

Let's start with the strongest patterns, the four-letter patterns. (If there aren't enough four-letter patterns, then we would use three-letter patterns.) We see that there are many of them at the following distances apart 6, 21, 87, 99, 108, 189, 219, 246, 270, 276. While some of these might just be coincidence, it is likely that at least some of these repeated patterns correspond to the same underlying plaintext letters. What does this tell us about the length of the keyword? The length of the keyword would need to divide the distance between these repeated patterns. So we are looking for a number that divides many of the above distances. In this case, 3 divides many of these distances so we guess the keyword has three letters. We can also use the index of coincidence to give us information about the length of the keyword.

Subsection 3.4.3 Friedman Test

In order to use the index of coincidence to give us information about the length of the keyword we compute the index of coincidence in two different ways. First, using the exact number of each letter in the ciphertext as in Theorem 3.4.3. Second, we assume that \(n\) is the number of letters in the ciphertext and \(k\) is the length of the keyword. We divide the text into \(k\) columns and \(\frac{n}{k}\) rows as in Figure 3.4.7. Note that we are assuming \(k\) divides \(n\) evenly which is a small approximation. We then compute the probability of picking two identical letters as the probability of picking two identical letters from the same column or the probability of picking two identical letters from different columns.

Figure 3.4.7. Dividing a ciphertext into \(k\) columns.
Question 3.4.8.

What is the probability of picking two identical letters from the same column given that we are picking two letters from the entire text? To answer this question, we need to know:

  1. How many ways are there to pick two letters from the text of \(n\) letters?
  2. How many ways are there to pick two letters from the same column?
  3. How many ways are there to pick two identical letters from the same column?

For the first question, there are \(n\) letters in the ciphertext, so there are \(C(n,2)\) ways to pick two of them.

For the second question, we must first determine the number of ways to choose a column. There are \(k\) columns, so there are \(C(k,1)\) number of ways to choose one column. Once we have chosen a column, we must pick two letters from that column, so we next need to know how many ways there are to pick two letters from a column. Each column has \(\frac{n}{k}\) letters in it, so there are \(C\left(\frac{n}{k},2\right)\) ways to pick two letters from each column. Thus, the number of ways to pick two letters from the same column is the (number of ways to pick a column)\(\times\) (number of ways to pick two letters in a single column)

\begin{equation*} =C(k,1) C\left(\frac{n}{k},2\right)\text{.} \end{equation*}

For the third question, after we have picked two letters from the same column, what percentage of them will be identical? In this case, it is important to remember how each column will be encrypted in the Vigenère cipher. Each column is just a shift since every letter is encrypted with the same letter of the keyword! Thus the distribution of letters will be similar to normal English. Thus, the percentage of identical letters in this case should be close to 6.5%. Thus, the number of ways to pick two identical letters from the same column is the \(0.065\times\) (number of ways to pick a column)\(\times\) (number of ways to pick two letters in a single column)

\begin{equation*} =0.065 \, C(k,1) C\left(\frac{n}{k},2\right)\text{.} \end{equation*}

Combining all of that information gives:

Question 3.4.10.

What is the probability of picking two identical letters from different columns given that we are picking two letters from the entire text? To answer this question, we need to know:

  1. How many ways are there to pick two letters from the text of \(n\) letters?
  2. How many ways are there to pick two letters from the different columns?
  3. How many ways are there to pick two identical letters from different column?

The first question we already answered. For the second question, we first need to compute the number of ways to pick two different columns. Since there are \(k\) columns there are \(C(k,2)\) ways to choose two of them. Once we have chosen two columns, we must pick one letter from each column. Each column has \(\frac{n}{k}\) letters in it, so there are \(C\left(\frac{n}{k},1\right)\) ways to pick one letter from one column and \(C\left(\frac{n}{k},1\right)\) ways to pick one letter from the other column. Thus, the number of ways to pick two letters from different columns is the (number of ways to pick two columns)\(\times\) (number of ways to pick a letter from column~1)\(\times\) (number of ways to pick a letter from column 2)

\begin{equation*} =C(k,2)\left(\frac{n}{k}\right)\left(\frac{n}{k}\right). \end{equation*}

For the third question, after we have picked two letters from the same column, what percentage of them will be identical? In this case, it is important to remember different columns will be encrypted in the Vigenère cipher. They will be encrypted with different letters of the keyword so the distribution of letters should be more similar to an evenly distributed language rather than normal English. Thus, the percentage of identical letters in this case should be close to 3.85%. Thus, the number of ways to pick two identical letters from different columns is the 0.0385\(\times\)(number of ways to pick two columns)\(\times\) (number of ways to pick a letter from column 1)\(\times\) (number of ways to pick a letter from column 2)

\begin{equation*} =0.0385\,C(k,2)\left(\frac{n}{k}\right)\left(\frac{n}{k}\right). \end{equation*}

Combining all of that information gives:

We can now combine Proposition 3.4.9 and Proposition 3.4.11 to get an alternate way to calculate the index of coincidence in a Vigenère cipher.

The Friedman test (named after the William Friedman discussed in Chapter 1) sets these two different ways of calculating the index of coincidence equal to each other to solve for \(k\text{.}\) Let \(IC_{CT}\) be the index of coincidence calculated from Theorem 3.4.3. Then we must have

\begin{equation*} IC_{CT}=\frac{0.065 \, C(k,1)C\left(\frac{n}{k},2\right)+0.0385\,C(k,2)\left(\frac{n}{k}\right)\left(\frac{n}{k}\right)}{C(n,2)}. \end{equation*}

Solving for \(k\) in a lovely exercise in algebra (indeed, you will do this in the exercises) and yields

\begin{equation*} k =\frac{ 0.0265n}{(n-1)IC_{CT} +0.065 - 0.0385n} \end{equation*}

This gives us one estimate for the keylength. There were a fair number of approximations that we used in this formula so it will not be exact. However, if we combine both the Friedman and Kasiski tests together, we can get a pretty good idea for what the length of the keyword is.

Example 3.4.13. Applying the Kasiski and Friedman Tests.

Run the code below which will apply the Kasiski and Friedman tests. Use the ciphertext

GWGET QGWGE TQWWU OIZGB DRCNO MRLZQ ECDQI ZGKMX TPUWZ ECNET QIQXO MFRNM IMZQG EQIWH QZWPQ FIGOL DRVNR QXDVP EIWHM KHWRQ MIWIZ GBAHW RIZAJ EARTA IJMWI ZGBAQ OGHWR GWSDL PHZEI WDNIZ GJXLV PZWDP AEWQZ JTUCI DGAXH OMQLA ZTQWA ILVSI WDDKT DZYRN BREQU NGOBD RCNOM XLSQD PQOTN UWFKJ ALTMQ LNXJN OMORW XLBIL BTDJM EWAQA NOWAG BTHVF KMOKI DPQEI QDPIZ GOARL PRCNO MPRCN OMFRQ XDVPW ZAXJX HNUUM NXZZD VPFIG OLDRV NXJNO M

and determine your best estimate for the length of the keyword.

Sage Computation 3.4.14. Applying the Kasiski and Friedman Tests.

Remember that the Friedman test will not be exact. From the Friedman test we see that the keyword length is around 4. From the Kasiski test we saw that 3 divided most of the distances. Since 3 is close to 4, this is consistent, and we feel pretty confident that the length of the keyword is 3. But how does this help? It turns out once we know the length of the keyword it is pretty easy to find the actual keyword.

Subsection 3.4.4 Finding the Keyword

Once we know the length of the keyword we need to find the keyword itself. The important idea in this step is to remember that each letter of the keyword produces a shift cipher and we already know a lot about decrypting shift ciphers! We want to examine all the letters of the ciphertext that are encrypted by the same letter of the keyword. Then we can use frequency information to determine which shift is most likely.

For example, for the ciphertext

GWGET QGWGE TQWWU OIZGB DRCNO MRLZQ ECDQI ZGKMX TPUWZ ECNET QIQXO MFRNM IMZQG EQIWH QZWPQ FIGOL DRVNR QXDVP EIWHM KHWRQ MIWIZ GBAHW RIZAJ EARTA IJMWI ZGBAQ OGHWR GWSDL PHZEI WDNIZ GJXLV PZWDP AEWQZ JTUCI DGAXH OMQLA ZTQWA ILVSI WDDKT DZYRN BREQU NGOBD RCNOM XLSQD PQOTN UWFKJ ALTMQ LNXJN OMORW XLBIL BTDJM EWAQA NOWAG BTHVF KMOKI DPQEI QDPIZ GOARL PRCNO MPRCN OMFRQ XDVPW ZAXJX HNUUM NXZZD VPFIG OLDRV NXJNO M

we have already discovered the the most likely length for the keyword is 3. We arrange this text into three columns in Table 3.4.15 and need to examine the frequency information for each column separately to identify the shift used in each column.

Table 3.4.15. Arranging the ciphertext into three columns
G W G
E T Q
G W G
E T Q
W W U
O I Z
\(\vdots\)

The first column consists of every third letter in the ciphertext starting with the first letter G. All of the ciphertext letters in this column were encrypted with the first letter of the keyword. We want to compare the frequency distribution of all letters in this column and find a shift that will match it up to the frequency distribution of English. That is, we want to identify a shift that will match Figure 3.4.16.(a) with Figure 3.4.16.(b).

(a) First Column of Ciphertext
(b) Standard English
Figure 3.4.16. Comparing Frequency Distributions

There are a number of strong patterns in the English language to help us match up distributions. Since E is the most frequent letter in the English language we might try to match the most frequent letter in the ciphertext with E. Even better, we can use that both A and E in English should be common, so we can look for two spikes in the ciphertext with three much lower frequencies in between. We can look for the R,S,T spike, where three letters in a row should be common. Another good option is to look for the X,Y,Z low values. There should be three relatively low values in a row, with especially low values at X and Z. We will use Sage to help us visualize shifting the frequency distribution around until we find the best match with English.

Sage Computation 3.4.17. Finding the first letter of the keyword.

From Sage Computation 3.4.17 we can see that a shift of 0 is a very poor match for the English language. The cream bar graph and the red bar graph do not match up well at all in Figure 3.4.18.(a). However, in Figure 3.4.18.(b) we can see that the distributions do match up well (albeit imperfectly) for a shift of 3.

(a) Shift=0
(b) Shift=3
Figure 3.4.18. Applying Different Shifts with Sage Code

Thus we guess that the first letter of the keyword corresponds to a 3-shift which gives us the first letter of the keyword as D. (Remember that A corresponds to a 0-shift.) We proceed similarly to determine each letter of the keyword using Sage Computation 3.4.19.

Sage Computation 3.4.19. Finding each letter of the keyword.

Once you have found the keyword, we are all set to decrypt the message! Try it out in Sage Computation 3.4.20.

Sage Computation 3.4.20. Decrypting Vigenére with keyword.

Subsection 3.4.5 Cracking Vigenére Ciphers: A Summary

In this section we briefly summarize all of the steps needed to perform cryptanalysis on a Vigenére cipher.

  1. First Step: Use the the Kasiski and Friedman tests to determine the most likely length of the keyword.
  2. Second Step: Match frequency distributions to identify the shift for each letter of the keyword.
  3. Third Step: Decrypt the message using the keyword.

The Sage cell in Sage Computation 3.4.21 combines all of the code for decrypting a Vigenére cipher in one place.

Sage Computation 3.4.21. Combined Code for Vigenére Cryptanalysis.

Subsection 3.4.6 Perfect Security

We have seen how to attack Vigenére if the keylength is short compared to the ciphertext, but what if the keylength is long compared to the ciphertext? In this case we could not identify enough letters that were encrypted with each letter of the keyword to apply frequency distribution to identify the shift. Even worse, what if keylength were equal to the length of the text and consisted of random letters?

Example 3.4.22.

Suppose we intercept the ciphertext RCDMESVDJI and all we know is that the keyword consists of ten randomly chosen letters. What could the message say?

  1. If the keyword is KYSBQWHMYF we get the plaintext HELLOWORLD.
  2. If keyword is FYZTEZIPVV we get the plaintext MEETATNOON.
  3. If keyword is OOQTEZCDHY we get the plaintext DONTATTACK.

In fact, if the ciphertext is ten letters and the keyword is consists of ten randomly chosen letters than the plaintext could be any combination of ten letters! In this case the ciphertext us no information about the plaintext except the number of letters it has. Thus, if we use a random key which is as long as the message, and use a new key for each message than this method of encryption is perfectly secure because there is no way to get any information about the plaintext from the ciphertext. This is called a one time pad and is used in situations where you need an absolute guarantee that the ciphertext cannot be broken. Of course, you need a secure way to store many, long keywords since each keyword must be random and different for each encryption. This is not a very easy system to use.