Frequency Analysis

Section 2.10 Frequency Analysis

While keyword monoalphabetic ciphers are much easier to remember, it also makes it easier for someone else to guess the plaintext-ciphertext pairing. A random monoalphabetic cipher where each letter of the plaintext is associated to a random letter or symbol is the hardest case to crack. Let's have a look at one and see if we can crack it! The following cipher text is a monoalphabetic cipher but the correspondence between plaintext and ciphertext has been chosen randomly. Word breaks are not preserved and the letters are arranged in blocks of five (a tradition that dates back to the era of transmitting messages via telegraph.) What techniques can we use to break this one?

LXSBQ ZOYZD OXAAB EWQAX JJAQS QZVEI QIJEJ JZELJ XYICY ZYIQO BYQTS QLJRZ QRHAJ REJYI LQOXJ BYHJA EPYZC YZJBQ ZQXRE WERJE VYHIJ YCSHZ QAKZY HJXIQ AEPYZ XIJBQ SZQSE ZEJXY IYCCZ QNHQI LKJEP AQRJB QZQEZ ZEIFQ VQIJY CLXSB QZCYZ QTEVX IEJXY IEIUJ BQJZX EAEIU CXJJX IFYCA QJJQZ JYAQJ JQZPQ CYZQJ BQVQR REFQP QFXIR JYESS QEZ

Since word breaks are not preserved, we lose many of our previous techniques. What about a brute force attack in this case? How many random monoalphabetic ciphers are there? For the letter A it could be enciphered with any of 26 possible symbols. For B, one symbol is already used so it could be enciphered with any of the 25 remaining symbols. Similarly, C could be enciphered with any of the 24 remaining symbols. Thus, the total number of random monoalphabetic ciphers is

\begin{equation*} 26 \times 25 \times 24 \times \cdots \times 3 \times 2 \times 1 = 26!\text{.} \end{equation*}

In general, for a set of \(n\) objects, the number of different ways to arrange all \(n\) objects in the set, is \(n!=n \times n-1 \times n-2 \times \cdots \times 2 \times 1\text{.}\)

If we calculate 26! we get a very large number,

\begin{equation*} 403291461126605635584000000=4.0329 \times 10^{26}\text{.} \end{equation*}

This is way too many options to try all of them!

A better strategy than brute force for breaking monoalphabetic ciphers is to try to match the statistics of the cipher text to the statistics of the underlying language. This technique, called frequency analysis, dates back to the Arab scholar Al-Kindi (c. 801–873 AD). In his manuscript, A Manuscript on Deciphering Cryptogaphic Messages he writes

One way to solve an encrypted message, if we know its [original] language, is to find a [different clear] text of the same language long enough to fill one sheet or so and then we count [the occurrences of] each letter of it. We call the most frequently occurring letter the “first”, the next most occurring the “second” the following most occurring the “third” and so on, until we finish all different letters in the cleartext [sample]. Then we look at the cryptogram we want to solve and we also classify its symbols. We find the most occurring symbol and chnge it to the form of the “first” letter [of the cleartext sample], the next most common symbol is change dot the form of the “second” letter, and the following most common symbol is changed to the form of the “third” letter and so on, until we account for all symbols of the cryptogram we want to solve.

Quote taken from Ibrahim A. Al-Kadit (1992) “Origins of Cryptology: The Arab Contributions”, Cryptologia, 16:2, 97-126, DOI: 10.1080/0161-119291866801

Figure 2.10.1. A depiction of Al-Kindi on a Syrian postage stamp.

Quirky Question: What do the words algebra, algorithm, zero and cipher have in common? Answer

The all originate from words in Arabic. Algebra comes from al-jabr which means 'the reunion of broken parts’ and especially from the title of the book ‘ilm al-jabr wa'l-muqābala' written by a ninth century mathematician known as al-Ḵwārizmī. His names leads to the origin of the word algorithm. Interesting, zero and cipher come from the same word sifr which derived from the word for empty or nothing. The concept of zero was initially a troubling and confusing one, so this may have led to the usage of the term as a hidden message.

To apply frequency analysis, then, we need a thorough understanding of the distribution of letters in the appropriate language. We introduce frequency analysis for the English language with a humorous presentation about alphabet characteristics. This was written by another pioneer in American cryptology, Elizebeth Smith Friedman. Elizebeth was a codemaker and codebreaker for the U.S. Navy and the U.S. Treasury Department's Bureau of Prohibition and Bureau of Customs. She was also married to the William Friedman discussed earlier. She wrote a manuscript on how to crack codes and ciphers which unfortunately was never published.

If I were appointed to act as final arbitrator at the peace conference between the warring letters what I would do would be to send my intelligence agents out and tell them to find out which ones deserved to be placed at the head of the class on the basis of the amount of work each one did. My agents would snoop around on gum shoes, spend a lot of my money and when they got tired would sit down and make an actual count of letters in a large amount of ordinary telegrams, and then like all that species called efficiency engineers, they’d probably turn in a report like this:

[parts omitted]

On the basis of our investigation if we were to arrange the letters in the order of their importance, judged by the work they do, they would appear as follows:

ETOANIRSHDLUCMPFYWGBVKJXZQ

You will also note from Figure 1 that ten letters, E, T, O, A, N, I, R, S, H, and D, do almost 75% of the work; the four vowels, A, E, I, and O doing about 35% of it; the six consonants, D, H, N, R, S, and T doing a bit more 40%. This leaves 16 letters to do the rest, that is 25% of the work. The hardest workers are E, T, O, A, and N, and the worst slackers are J, K, Q, X, and Z.
Attached hereto is our bill for services rendered.

Very truly yours,

F. FISH ENSEE
―Elizebeth Smith Friedman, unpublished manuscript, George C. Marshall Foundation Friedman Collection

Figure 1 was not completed in her unfinished work but it would have looked something like the frequency distribution of the English alphabet in Figure 2.10.2.

Figure 2.10.2. A frequency distribution for the English alphabet based on a text approximately 100 letters.

As a government employee Elizebeth Friedman was subject to regular efficiency reviews. One suspects she found some aspects of them comical.

Figure 2.10.3. A sketch of Elizebeth Smith Friedman from the Washington Evening Star, June 6, 1936.

As we can see, the first important component in frequency analysis is understanding the distribution of individual letters. This information is summarized in the chart below.

Table 2.10.4. Frequency Distribution of standard English language

Letter	A	B	C	D	E	F	G	H	I	J	K	L	M
Frequency (%)	8.17	1.49	2.78	4.25	12.70	2.23	2.02	6.09	6.97	0.15	0.77	4.03	2.40

Letter	N	O	P	Q	R	S	T	U	V	W	X	Y	Z
Frequency (%)	6.75	7.51	1.93	0.10	5.99	6.33	9.06	2.76	0.98	2.36	0.15	1.97	0.07

We can visualize this distribution to see how unevenly letters are used in the English language.

Figure 2.10.5. Frequency Distribution of standard English language

From this we can see the most frequent letters in the English language. While these percentages vary a little based on different texts in English, this provides important information relative to breaking monoalphabetic ciphers.

Table 2.10.6. Top Ten Most frequently occurring letters in English

Letter	E	T	A	O	I	N	S	H	R	D
Frequency (%)	12.70	9.06	8.17	7.51	6.97	6.75	6.33	6.09	5.99	4.25

The English language also has strong patterns with how letters combine with other letters. So the next important piece of information is digraph and trigraph frequency statistics. Digraphs are pairs of letters and trigraphs are groups of three letters. Again, these percentages will vary a little depending on the type of English text used. We reference an older collection from Parker Hitt in Figure 2.10.7, but it is fairly consistent with modern language.

Figure 2.10.7. Frequency Distribution of digraphs and trigraphs in English

Let's return to our ciphertext and examine its frequency information to try to determine the plaintext. We will first use some Sage code to calculate the frequency information for single letters, digraphs and trigraphs.

Sage Computation 2.10.8. Frequency Analysis for Monoalphabetic Cipher.

Example 2.10.9.

From the Sage code, we see that the most frequent trigraph is JBQ. Since Q and J are the most frequent letters in the ciphertext, B is a medium frequency letter in the ciphertext, and JB is a common digraph, we have a strong guess that JBQ corresponds to THE in the plaintext. Replacing the letters J,B,Q with t,h,e respectively gives us:

LXShe ZOYZD OXAAh EWeAX ttAeS eZVEI eItEt tZELt XYICY ZYIeO hYeTS eLtRZ eRHAt REtYI LeOXt hYHtA EPYZC YZthe ZeXRE WERtE VYHIt YCSHZ eAKZY HtXIe AEPYZ XIthe SZeSE ZEtXY IYCCZ eNHeI LKtEP AeRth eZeEZ ZEIFe VeItY CLXSh eZCYZ eTEVX IEtXY IEIUt hetZX EAEIU CXttX IFYCA etteZ tYAet teZPe CYZet heVeR REFeP eFXIR tYESS eEZ

We can now make a guess for ciphertext Z. The letter Z is the third most frequently occurring letter in the ciphertext, ZQ and QZ are frequently occurring digraphs, and BQZ is a frequently occurring trigraph. So we must have a common letter in English that pairs in both directions with Q=E, so _E and E_ must be common and the trigraph HE_ must also be common. We guess that Z is most likely N or R. Since RE and ER is more common than NE and EN, we guess Z=r. Making this replacement yields:

LXShe rOYrD OXAAh EWeAX ttAeS erVEI eItEt trELt XYICY rYIeO hYeTS eLtRr eRHAt REtYI LeOXt hYHtA EPYrC Yrthe reXRE WERtE VYHIt YCSHr eAKrY HtXIe AEPYr XIthe SreSE rEtXY IYCCr eNHeI LKtEP AeRth ereEr rEIFe VeItY CLXSh erCYr eTEVX IEtXY IEIUt hetrX EAEIU CXttX IFYCA etter tYAet terPe CYret heVeR REFeP eFXIR tYESS eEr

Note the word "there" appears in the plaintext, which is encouraging. We also see "IFYCA etter tYAet terPe" occurring so we can start to guess words with "etter". We might guess "better" but A is common in the ciphertext and b is not that common in English, so perhaps "letter" is a better guess.

LXShe rOYrD OXllh EWelX ttleS erVEI eItEt trELt XYICY rYIeO hYeTS eLtRr eRHlt REtYI LeOXt hYHtl EPYrC Yrthe reXRE WERtE VYHIt YCSHr elKrY HtXIe lEPYr XIthe SreSE rEtXY IYCCr eNHeI LKtEP leRth ereEr rEIFe VeItY CLXSh erCYr eTEVX IEtXY IEIUt hetrX ElEIU CXttX IFYCl etter tYlet terPe CYret heVeR REFeP eFXIR tYESS eEr

We might now consider E and Y in the ciphertext. These are very common in the ciphertext and likely correspond to a or o in the plaintext. Common digraphs include YZ and JY, so _T and H_ should be common in English. The digraphs at and ha are more common than ot and ho, so we guess Y=a and E=o.

LXShe rOarD OXllh oWelX ttleS erVoI eItot troLt XaICa raIeO haeTS eLtRr eRHlt RotaI LeOXt haHtl oParC arthe reXRo WoRto VaHIt aCSHr elKra HtXIe loPar XIthe SreSo rotXa IaCCr eNHeI LKtoP leRth ereor roIFe VeIta CLXSh erCar eToVX IotXa IoIUt hetrX oloIU CXttX IFaCl etter talet terPe Caret heVeR RoFeP eFXIR taoSS eor

OOPS! We now have the plaintext phrase "l etter talet ter". This doesn't make sense. However, "letter to letter" would make sense. Let's go back and swap the guesses for E and Y.

LXShe rOorD OXllh aWelX ttleS erVaI eItat traLt XoICo roIeO hoeTS eLtRr eRHlt RatoI LeOXt hoHtl aPorC orthe reXRa WaRta VoHIt oCSHr elKro HtXIe laPor XIthe SreSa ratXo IoCCr eNHeI LKtaP leRth erear raIFe VeIto CLXSh erCor eTaVX IatXo IaIUt hetrX alaIU CXttX IFoCl etter tolet terPe Coret heVeR RaFeP eFXIR toaSS ear

This looks better! We should be able to guess some of the words where we have a majority of the letters. From "lX ttle" we guess X=i. From "h aWe" we guess W=v. From "at traLt" we guess L=c.

ciShe rOorD Oillh aveli ttleS erVaI eItat tract ioICo roIeO hoeTS ectRr eRHlt RatoI ceOit hoHtl aPorC orthe reiRa vaRta VoHIt oCSHr elKro HtiIe laPor iIthe SreSa ratio IoCCr eNHeI cKtaP leRth erear raIFe VeIto CciSh erCor eTaVi Iatio IaIUt hetri alaIU Citti IFoCl etter tolet terPe Coret heVeR RaFeP eFiIR toaSS ear

You should now be able to continue this process of identifying likely words in the plaintext to determine the remaining letter substitutions. Give it a try! If you get stuck, you should find a quote from Parker Hitt which we saw in the Introduction. Answer

Cipher work will have little permanent attraction for one who expects results at once, without labor, for there is a vast amount of purely routine labor in the preparation of frequency tables, the rearrangement of cipher for examination, and the trial and fitting of letter to letter before the message begins to appear.