Skip to main content

Section 3 Substitution Ciphers

A substitution, or monoalphabetic, cipher is a letter-for-letter substitution (or symbol-for-letter substitution) where each plaintext letter is replaced by a different ciphertext letter (or symbol) and it is replaced by the same ciphertext symbol wherever it appears in the plaintext. Substitution ciphers date back to Julius Caesar in 50 B.C. and have been used extensively throughout history. They were first broken by Arab scholars in the 800s, but were used well into the 1800s.

Subsection 3.1 Shift Ciphers

Suetonius, a Roman historian in the first century, wrote The Lives of the Twelve Caesars in the first century. In The Life of Julius Caesar he writes about one of the techniques Julius Caesar used for encrypting messages in approximately 50 B.C.

There are also letters of his to Cicero, as well as to his intimates on private affairs, and in the latter, if he had anything confidential to say, he wrote it in cipher, that is, by so changing the order of the letters of the alphabet, that not a word could be made out. If anyone wishes to decipher these, and get at their meaning, he must substitute the fourth letter of the alphabet, namely D, for A, and so with the others.

―Suetonius, The Life of Julius Caesar

In honor of Julius Caesar, this cipher is often called the Caesar cipher. While Suetonius describes the decryption process, encryption is the reverse. Each letter of the plaintext is shifted down by three letters in the alphabet.

As an example of how to encrypt or decrypt with this cipher, we encrypt the following famous quotation from Shakespeare's Julius Caesar.

To encrypt the plaintext “Beware the ides of March” we shift each letter down three letters in the alphabet.

Table 3.2. Caesar shift of a plain text
B E W A R E \(\;\;\) T H E \(\;\;\) I D E S \(\;\;\) O F \(\;\;\) M A R C H
C F X B S F U I F J E F T P G N B S D I
D G Y C T G V J G K F G U Q H O C T E J
E H Z D U H W K H L G H V R I P D U F K

Thus, our cipher text is EHZDUH WKH LGHV RI PDUFK.

We can think of the encryption process for this cipher for every letter of the alphabet in Table 3.3 where each plaintext letter (PT) will be encrypted by the letter below it in the table. Decryption corresponds to the reverse process. Note that at the end of the alphabet we need to wrap around to the beginning of the alphabet.

Table 3.3. The Caesar shift encryption
PT A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
CT D E F G H I J K L M N O P Q R S T U V W X Y Z A B C

While the historical reference involves shifting the alphabet by three places, we could easily perform similar ciphers with a shift of any amount between 1 and 25. These are called shift ciphers or sometimes Caesar ciphers. Curiously, Suetonius also writes about the nephew of Julius Caesar, Augustus Caesar also using a shift cipher.

Whenever he wrote in cipher, he wrote B for A, C for B, and the rest of the letters on the same principle, using AA for Z.

―Suetonius, The Life of Augustus Caesar

Its unclear why Augustus didn't like the idea of wrapping around in the alphabet to just use A for Z.

Its especially convenient to encrypt a shift cipher with a cipher disk. Some interesting historical cipher disks are pictured in Figure 3.4.

Figure 3.4. Captain Midnight Secret Squadron Decoder Badge from 1946
Figure 3.5. A replica of a Civil War era cipher disk.

To use a cipher disk to encrypt or decrypt a shift cipher, rotate the disk to the appropriate shift. For example, if A on the outer wheel is aligned with D on the inner wheel, then a shift by three for all letters of the alphabet could be found around the wheel with plaintext letters on the outer wheel and ciphertext letters on the inner wheel. You can examine a cipher wheel in action in the Sage interactive code in Computation 3.6. Just click Evaluate (Sage) and use the slider to select the desired shift. In this case, Sage is implementing the given encryption shift with the plaintext letter on the outer wheel and the ciphertext letter on the inner wheel.

Computation 3.6. Cipher Wheel.

Encrypt the message “LOVELY” with a 10-shift. Answer

VYFOVI

Decrypt the message RYGNI that was encrypted with a 10-shift. Answer

HOWDY

Subsection 3.2 Pigpen Ciphers

A Pig Pen cipher is a geometric substitution cipher, where symbols are used to represent letters based on how the letters are drawn in grids or “pens”. They have been used by the Society of Freemasons and are also sometimes called the masonic or Freemason's cipher. This cipher is believed to be quite ancient and variations have been used by the Knights Templar and Rosicrucians. It was documents by George Washington's army, and was used by Union prisoners in Confederate prisons. It was also used in the 17th century in civil wars in England. One way that letters can be assigned to a grid is given in Figure 3.9.

Figure 3.9. A grid arrangement for a pigpen cipher

Each letter is represented by the portion of the grid surrounding the letter. Since each grid is repeated twice, the second version also include a dot to distinguish the letter.

Figure 3.10. Sample letter conversions for a pigpen cipher

Variations of the pigpen cipher involve other arrangements of the letters, or other types of grids or pens to hold the letters.

Encrypt the message “LOVELY” with a standard pigpen cipher. Answer

Decrypt the message given below with a standard pigpen cipher.

Answer
HELLO

Subsection 3.3 Polybius Cipher

The Polybius Cipher also known as the Polybius square or Polybius checkerboard dates back to the ancient Greeks and is described in Polybius Histories. This cipher also uses a grid but a \(5 \times 5\) or a \(6 \times 6\) grid. Letters are represented by two numbers, one indicating the row and the other indicating the column of the grid. The Greek alphabet had only 24 characters, so a \(5 \times 5\) grid was sufficient. For the standard English alphabet we will have to combine I and J as a single letter to make this work.

1 2 3 4 5
1 A B C D E
2 F G H I K
3 L M N O P
4 Q R S T U
5 V W X Y Z

Encrypt the message LOVELY. Answer

31 34 51 15 31 54

Decrypt the message 22 34 34 14 12 54 15 Answer

GOODBYE

Note the disadvantage that the length of the ciphertext is twice as long as the plaintext since each letter is represented by two numbers. However, this was actually part of the design as this can be used to send messages over long distances either by fire/light or flags. For example, the letter L could be signaled by 3 torches in the right hand and 1 torch in the left hand. One could also signal with knocks, flashing lamps, blasts of sound, drums or smoke signals.

We can make this more difficult to break by using a keyword, or keywords, such as WONDERFUL AWARD, to rearrange the alphabet in the square. We can only enter each letter once in our grid, so we only keep a letter the first time it appears in the keyword. Thus, the keyword will be entered in our grid as WONDERFULA since we must eliminate the second W, A, R, and D from the key word.

1 2 3 4 5
1 W O N D E
2 R F U L A
3 B C G H I
4 K M P Q S
5 T V X Y Z

Encrypt the message LOVELY. Answer

24 12 52 15 24 54

Decrypt the message 11 25 32 41 54 Answer

WACKY

The grid can be expanded to a \(6 \times 6\) to add the numbers 0-9 (and not have to double up any letter). For example, we will create a grid with the keyword “MOUNTAINS”. A method must also be agreed upon for how to place the digits 0-9 into the grid.

1 2 3 4 5 6
1 M O U 1 N T
2 A 2 I S B 3
3 C D E 4 F G
4 H 5 J K L 6
5 P Q 7 R V 8
6 W X 9 Y Z 0

Encryption and decryption work just as before with the additional row and column.

Subsection 3.4 Frequency Analysis

We have seen several different types of substitution ciphers, and there are many more ways to create such a cipher. A random substitution cipher where each letter of the plaintext is associated to a random letter or symbol is the hardest case to crack. Let's have a look at one and see if we can crack it! The following cipher text is a monoalphabetic cipher but the correspondence between plaintext and ciphertext has been chosen randomly. Word breaks are not preserved and the letters are arranged in blocks of five (a tradition that dates back to the era of transmitting messages via telegraph.) What techniques can we use to break this one?

LXSBQ ZOYZD OXAAB EWQAX JJAQS QZVEI QIJEJ JZELJ XYICY ZYIQO BYQTS QLJRZ QRHAJ REJYI LQOXJ BYHJA EPYZC YZJBQ ZQXRE WERJE VYHIJ YCSHZ QAKZY HJXIQ AEPYZ XIJBQ SZQSE ZEJXY IYCCZ QNHQI LKJEP AQRJB QZQEZ ZEIFQ VQIJY CLXSB QZCYZ QTEVX IEJXY IEIUJ BQJZX EAEIU CXJJX IFYCA QJJQZ JYAQJ JQZPQ CYZQJ BQVQR REFQP QFXIR JYESS QEZ

Since word breaks are not preserved, we must focus on characteristics of the given alphabet. What about a brute force attack in this case? How many random monoalphabetic ciphers are there? For the letter A it could be enciphered with any of 26 possible symbols. For B, one symbol is already used so it could be enciphered with any of the 25 remaining symbols. Similarly, C could be enciphered with any of the 24 remaining symbols. Thus, the total number of random monoalphabetic ciphers is

\begin{equation*} 26 \times 25 \times 24 \times \cdots \times 3 \times 2 \times 1 = 26!\text{.} \end{equation*}

In general, for a set of \(n\) objects, the number of different ways to arrange all \(n\) objects in the set, is \(n!=n \times n-1 \times n-2 \times \cdots \times 2 \times 1\text{.}\)

If we calculate 26! we get a very large number,
\begin{equation*} 403291461126605635584000000=4.0329 \times 10^{26}\text{.} \end{equation*}
This is way too many options to try all of them!

A better strategy than brute force for breaking monoalphabetic ciphers is to try to match the statistics of the cipher text to the statistics of the underlying language. This technique, called frequency analysis, dates back to the Arab scholar Al-Kindi (c. 801–873 AD). In his manuscript, A Manuscript on Deciphering Cryptogaphic Messages he writes

One way to solve an encrypted message, if we know its [original] language, is to find a [different clear] text of the same language long enough to fill one sheet or so and then we count [the occurrences of] each letter of it. We call the most frequently occurring letter the “first”, the next most occurring the “second” the following most occurring the “third” and so on, until we finish all different letters in the cleartext [sample]. Then we look at the cryptogram we want to solve and we also classify its symbols. We find the most occurring symbol and chnge it to the form of the “first” letter [of the cleartext sample], the next most common symbol is change dot the form of the “second” letter, and the following most common symbol is changed to the form of the “third” letter and so on, until we account for all symbols of the cryptogram we want to solve.
Quote taken from Ibrahim A. Al-Kadit (1992) “Origins of Cryptology: The Arab Contributions”, Cryptologia, 16:2, 97-126, DOI: 10.1080/0161-119291866801

Figure 3.17. A depiction of Al-Kindi on a Syrian postage stamp.

Quirky Question: What do the words algebra, algorithm, zero and cipher have in common? Answer

The all originate from words in Arabic. Algebra comes from al-jabr which means 'the reunion of broken parts’ and especially from the title of the book ‘ilm al-jabr wa'l-muqābala' written by a ninth century mathematician known as al-Ḵwārizmī. His names leads to the origin of the word algorithm. Interesting, zero and cipher come from the same word sifr which derived from the word for empty or nothing. The concept of zero was initially a troubling and confusing one, so this may have led to the usage of the term as a hidden message.

To apply frequency analysis, then, we need a thorough understanding of the distribution of letters in the appropriate language. We introduce frequency analysis for the English language with a humorous presentation about alphabet characteristics. This was written by another pioneer in American cryptology, Elizebeth Smith Friedman. Elizebeth was a codemaker and codebreaker for the U.S. Navy and the U.S. Treasury Department's Bureau of Prohibition and Bureau of Customs. She was also married to the William Friedman discussed earlier. She wrote a manuscript on how to crack codes and ciphers which unfortunately was never published.

If I were appointed to act as final arbitrator at the peace conference between the warring letters what I would do would be to send my intelligence agents out and tell them to find out which ones deserved to be placed at the head of the class on the basis of the amount of work each one did. My agents would snoop around on gum shoes, spend a lot of my money and when they got tired would sit down and make an actual count of letters in a large amount of ordinary telegrams, and then like all that species called efficiency engineers, they’d probably turn in a report like this:

[parts omitted]

On the basis of our investigation if we were to arrange the letters in the order of their importance, judged by the work they do, they would appear as follows:

ETOANIRSHDLUCMPFYWGBVKJXZQ

You will also note from Figure 1 that ten letters, E, T, O, A, N, I, R, S, H, and D, do almost 75% of the work; the four vowels, A, E, I, and O doing about 35% of it; the six consonants, D, H, N, R, S, and T doing a bit more 40%. This leaves 16 letters to do the rest, that is 25% of the work. The hardest workers are E, T, O, A, and N, and the worst slackers are J, K, Q, X, and Z.

Attached hereto is our bill for services rendered.

Very truly yours,

F. FISH ENSEE

―Elizebeth Smith Friedman, unpublished manuscript, George C. Marshall Foundation Friedman Collection

Figure 1 was not completed in her unfinished work but it would have looked something like the frequency distribution of the English alphabet in Figure 3.18.

Figure 3.18. A frequency distribution for the English alphabet based on a text approximately 100 letters.

As a government employee Elizebeth Friedman was subject to regular efficiency reviews. One suspects she found some aspects of them comical.

Figure 3.19. A sketch of Elizebeth Smith Friedman from the Washington Evening Star, June 6, 1936.

As we can see, the first important component in frequency analysis is understanding the distribution of individual letters. This information is summarized in the chart below.

Table 3.20. Frequency Distribution of standard English language
Letter A B C D E F G H I J K L M
Frequency (%) 8.17 1.49 2.78 4.25 12.70 2.23 2.02 6.09 6.97 0.15 0.77 4.03 2.40
\(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\) \(\;\;\)
Letter N O P Q R S T U V W X Y Z
Frequency (%) 6.75 7.51 1.93 0.10 5.99 6.33 9.06 2.76 0.98 2.36 0.15 1.97 0.07

We can visualize this distribution to see how unevenly letters are used in the English language.

Figure 3.21. Frequency Distribution of standard English language

From this we can see the most frequent letters in the English language. While these percentages vary a little based on different texts in English, this provides important information relative to breaking monoalphabetic ciphers.

Table 3.22. Top Ten Most frequently occurring letters in English
Letter E T A O I N S H R D
Frequency (%) 12.70 9.06 8.17 7.51 6.97 6.75 6.33 6.09 5.99 4.25

The English language also has strong patterns with how letters combine with other letters. So the next important piece of information is digraph and trigraph frequency statistics. Digraphs are pairs of letters and trigraphs are groups of three letters. Again, these percentages will vary a little depending on the type of English text used. We reference an older collection from Parker Hitt in Figure 3.23, but it is fairly consistent with modern language.

Parker Hitt, "Manual for the Solution of Military Ciphers"", George C. Marshall Foundation Friedman Collection
Figure 3.23. Frequency Distribution of digraphs and trigraphs in English

Let's return to our ciphertext and examine its frequency information to try to determine the plaintext. We will first use some Sage code to calculate the frequency information for single letters, digraphs and trigraphs.

Computation 3.24. Frequency Analysis for Monoalphabetic Cipher.

From the Sage code, we see that the most frequent trigraph is JBQ. Since Q and J are the most frequent letters in the ciphertext, B is a medium frequency letter in the ciphertext, and JB is a common digraph, we have a strong guess that JBQ corresponds to THE in the plaintext. Replacing the letters J,B,Q with t,h,e respectively gives us:

LXShe ZOYZD OXAAh EWeAX ttAeS eZVEI eItEt tZELt XYICY ZYIeO hYeTS eLtRZ eRHAt REtYI LeOXt hYHtA EPYZC YZthe ZeXRE WERtE VYHIt YCSHZ eAKZY HtXIe AEPYZ XIthe SZeSE ZEtXY IYCCZ eNHeI LKtEP AeRth eZeEZ ZEIFe VeItY CLXSh eZCYZ eTEVX IEtXY IEIUt hetZX EAEIU CXttX IFYCA etteZ tYAet teZPe CYZet heVeR REFeP eFXIR tYESS eEZ

We can now make a guess for ciphertext Z. The letter Z is the third most frequently occurring letter in the ciphertext, ZQ and QZ are frequently occurring digraphs, and BQZ is a frequently occurring trigraph. So we must have a common letter in English that pairs in both directions with Q=E, so _E and E_ must be common and the trigraph HE_ must also be common. We guess that Z is most likely N or R. Since RE and ER is more common than NE and EN, we guess Z=r. You should now be able to continue this process of identifying likely words in the plaintext to determine the remaining letter substitutions. Give it a try if you like! Answer

Cipher work will have little permanent attraction for one who expects results at once, without labor, for there is a vast amount of purely routine labor in the preparation of frequency tables, the rearrangement of cipher for examination, and the trial and fitting of letter to letter before the message begins to appear.

While we can use the technique of frequency analysis for any substitution cipher, we will first examine some common puzzles called cryptograms. While originally the word cryptogram meant any sort of encrypted message, currently it most often refers to a specific kind of cipher often found in newspapers and puzzle books. A cryptogram is a letter-for-letter substitution where each plaintext letter is replaced by a different ciphertext letter (or symbol) and it is replaced by the same ciphertext symbol wherever it appears in the plaintext. Word breaks are preserved and no letter is enciphered as itself. We'll start with an example.

Let's decrypt our first secret message! In this case we know the process for encrypting the message is the cryptogram process given above, but we don't know the exact letter for letter substitution that we should use. We will perform cryptanalysis to determine the letter for letter substitution and decrypt the message.

DKUUNDD AD MQ BUUAZNML. AL AD PBOZ FQOS, RNODNCNOBMUN,

TNBOMAMW, DLKZJAMW, DBUOAEAUN BMZ GQDL QE BTT, TQCN

QE FPBL JQK BON ZQAMW QO TNBOMAMW LQ ZQ. –RNTN

What patterns do you see that can help identify underlying plaintext? Remember that word breaks are preserved.

Hint

There are many strong letter patterns in the cipher text. One of the easiest places to start is two letter words.

  • For example, the two letter word combination AL AD. There are relatively few two letter words that start with the same letter. The options are IN, IS, IT, IF and AS,AT,AM, and OR, OF, ON. Since the combination of two-letter words starts a senctence the most likely possibilities are IT IS or AS AN.

  • We also have the pattern AL and later LQ. The repeated L, once as a first letter and once as a second letter is a strong pattern. LQ is most likely TO or SO. In either case, Q must be O.

After two letter words, we might try three letter words.

  • Another strong pattern is BTT. Note all letters in the English letter make sense as double letters. And there are few three letter words with double letters. The most common are ALL or TOO. Since we have already guessed that Q is O, we guess that BTT is ALL.

Another pattern that we might consider is common word endings.

  • Three of the words end in -AMW. One of the most common word endings is ING. So we guess AMW corresponds to ING.

In addition to word patterns, we can examine single letter patterns.

  • We might also consider the cipher word RNODNCNOBMUN. This single word has four Ns. And N occurs quite frequently in the cipher text. Thus we guess this letter corresponds to one of the most frequent letters in English, E, T, A, O, I, or N.

Use the guesses we have made for letters above and decrypt the rest of the message. After trying it out, you can check your answer. But try it on your own first!

Answer

The plain text for the message is: Success is no accident. It is hard work, perseverance, learning, studying, sacrifice and most of all, love of what you are doing or learning to do. –Pele

As we saw in the previous example, there are lots of properties of the English language that help us to solve cryptograms. Some starting points for solving any cryptogram are short words, double letters, and frequency of letters. It is especially useful to know the following:

  • The most common one letter words are: A, I.
  • Common two letter words are: am, an, as, at, if, in, is, it, of, on, or, do, go, no, so, to, be, he, me, we, by, my, up, us.
  • Common three letter words with repeated letters are : all, see, did, too.
  • Other common three letter words are: the, and, for, are, but, not, you, any, can, had, her, was, one, our, out, day, get, has, him, his, how.
  • Common four letter words with repeated letters are: that, will, been, good.
  • Other common four letter words are: with, have, this, your, from, they, know, want, much, some, time.
  • Common double letters include: LL,EE,SS,TT,OO,MM,FF,PP,RR,NN,CC,DD.
Activity 3.1. Cryptograms.
Solve the following cryptograms. These cryptograms taken from Redlands Daily Facts Newspaper. (It is likely that they are terrible puns.) Each cryptogram uses a different substitution cipher.
  1. UCF JQGMNV VKX'U CKSF OFUKNMFO NXPAWIKUNAX KGAQU UCKU AMO US JAAVC. NU'T MKTTNF-PNFO IKUFWNKM.

  2. WY PZBV KYNZBYC VR CJBFZBB KRBYEZBWYB AJVW DY SOGDRKY. JV ASB VRR DZFW RN S VWRKOG BZEPYFV.

  3. UYDYAKHIQ JMYYB WMHW ZYTHNY PHNEVJ PEI VJLAD WMY YAWIQ TENNHAK "EBYA JYJHNY!": HUL ZHH-ZHH

Hint
Hints: 1) S=V 2) K=R 3) P=F,B=P