Substitution Ciphers

(Last Mod: 27 November 2010 21:38:01 )

Overview

As the name implies, a substitution cipher is one that only involves symbol substitution using some defined set of rules that both parties know. The rules can be broken into two sets: the first set is common to every application of the cipher and forms the basic algorithm while the second set varies from one application to the next and forms the key. In general, it should be assumed that an adversary has access to the basic algorithm and that the only thing that can be kept secret from them is the key. The strength or weakness of the cipher is then evaluated based on this assumption, generally known as Kerckhoffs' Principle, also known as Shannon's Maxim. For example, in the simple shift cipher we will look at shortly, the general rule that applies to every use of the cipher is that each letter in the plaintext is replaced by the letter that is N places further along in the alphabet, while the key rule that applies to a particular use of the cipher is the value of N. Since it is assumed that the attacker has full knowledge of all the general rules, the strength of the cipher has an upper bound established by the size of the key space. For the simple shift cipher, the value of N is limited by the number of characters in the alphabet; in the case of English, that limits N to a value of 26 (of which only 25 are non-trivial).

Alphabetic Ciphers

While we normally think of the "alphabet" as being the set of characters that make up the plaintext language, it is really more general than that and is more properly thought of as the set of symbols from which the plaintext is constructed. For instance, our normal interpretation would mean that, if our plaintext consisted of nothing more than uppercase English letters, that the size of our alphabet would be 26 and that the length of our message is simply the number of characters in it. However, there is nothing to prevent us from defining the symbols of our alphabet to be pairs of adjacent letters, in which case our message would only be half as long, but our alphabet would consist of 26*26=576 symbols (and we have to decide how to deal with messages that start out with an odd number of characters and, most importantly, to do so in a way that does not provide an avenue for attack). In a similar manner, we could define each symbol to be a block of four bytes in the data, making the alphabet consist of over four billion symbols. The point being that, while our example ciphers will use the term "alphabet" in the intuitive sense, the principles involve translate directly to more generalized, and arbitrarily large, alphabets.

The rules of a substitution cipher involve mappings of the symbols in the plaintext language onto the symbols in the ciphertext language. The two languages do not have to consist of the same set of symbols, though they generally do. In fact, it is not required that the mappings be one-to-one, although they almost always are. If they aren't, then the plaintext is, in general, not exactly recoverable from the ciphertext. However, some historical ciphers accepted this shortcoming in favor of more easily remembered and used ciphers, relying on the recipient's ability to detect and correct the resulting errors. Unless otherwise noted, however, we will limit the discussion to ciphers in which the mappings are one-to-one and in which the plaintext and ciphertext alphabets are identical.

Each mapping from the plaintext language to the ciphertext language can be represented in many forms. Perhaps the most general is to provide an explicit list of which plaintext symbols map to which ciphertext symbols, such as (A=>D), (B=>E), (C=>F), and so forth up through (Y=>B) and (Z=>C). This representation requires the no assumptions about either alphabet, in particular, it doesn't matter whether the alphabet has any accepted ordering or not. If the alphabet does have an accepted ordering, then a simpler representation can be used in which the ciphertext symbols are simply listed in the order to which they map from the plaintext symbols, such as {D,E,F,...,B,C}. As long as the plaintext alphabet ordering is well understood, that representation still permits any possible one-to-one mapping to be described. However, if we constrain the allowed mappings to a sufficiently structured subset of possible mappings, then it becomes possible to associate each mapping with a much simpler nomenclature that is easier to remember and transfer. For instance, notice that the mapping we have been using so far is nothing more than the alphabet rotated three places to the left. Thus, as long as all of our mappings consist of simple rotations of the alphabet to the left by N places, then we can convey the entire map by conveying just the value of N. In doing so, however, we have established a general rule (remember, a general rule is one that applies to every application of the given cipher) which says that the map is a simple shift of the plaintext alphabet to the left by N places. Our key (remember, the key is the information that can vary from one application of the cipher to the next) is the value of N. In this case, the values that N can assume are limited to the 26 values from 0 to 25 (inclusive).

Mono-Alphabetic Ciphers

A Mono-Alphabetic Cipher is one in which only a single mapping is used and that same mapping is applied to each symbol in the plaintext to generate the ciphertext.

The Caesar Cipher

Perhaps the earliest such cipher to be used was the Caesar Cipher, invented by Julius Caesar and used by the Romans during the Gaelic Wars. It involved nothing more than replacing each letter in the Latin alphabet with the corresponding character from the Greek alphabet. Notice that there was no key involved at all since the entire mapping was part of the general rule describing the cipher. Hence, as soon as the enemy learned the general rule, they could decipher any intercepted messages immediately.

The Caesar Shift Cipher

Another cipher credited to Julius Caesar is the Caesar Shift Cipher, in which each letter in the plaintext was replaced by the letter three places further down the alphabet. You might already recognize this as nothing more than the simple shift mapping described previously with a fixed value of N=3. Like the Caesar Cipher, there was no key (the value of N was fixed, and hence applied to all uses of the cipher)l thus, like the Caesar Cipher, as soon as the enemy learned the general rule, the cipher was broken.

The Generic Shift Cipher

By removing the constraint of a fixed N from the Caesar Shift Cipher and permitting it to take on any permissible value that can change from message to message, we have added a key to it (the value of N) which, in theory, keeps the attacker from deciphering the message even if they know all of the general rules. In practice, however, little additional security is obtained because the key space (the set of possible keys) is so small that an attacker can simply try all of them in turn until a meaning message is recovered. This is known as a brute force attack.

The Affine Cipher

One way to increase the key space while keeping the key easy to express (and therefore memorize and avoid ever having to write it down) is to apply mathematical relationships and use the specific parameters as the key. For instance, we can easily assign a number to every letter of the alphabet, starting with 0 for 'A' and 25 for 'Z'. We can then describe the mapping between plaintext and ciphertext using a simple equation such as:

C = (aP + b) mod 26

Where P and C are the plaintext and ciphertext characters, respectively, and a and b are the slope and intercept, respectively. If a is equal to 1, then we simply have the Generic Shift Cipher. Although not obvious, we can't use just any value we want to for a. Unless we use a value that is relatively prime to 26, we will end up with multiple plaintext symbols mapping to the same ciphertext symbol and, conversely, many ciphertext symbols going unused. Thus, the only viable choices for a are: {1,3,5,7,9,11,15,17,19,21,23,25}. The size of the key space is therefore 12*26=312.

Before we proceed, consider what would happen if we randomly added three additional symbols to our set of plaintext characters and then sprinkled them more-or-less randomly amongst the plaintext. That would give us 29 symbols, which is a prime number. Hence we could use all non-zero values of a that are less than 29, yielding a key space that has a size of 28*29=812. In the days before computers, this probably would have been rather attractive. In fact, consider another alternative: The general rule (which we assume the enemy finds out sooner or later) is that the key consists of three numbers, {a,b,N}, where a and b are as already described and N is the modulus. Now, instead of actually using N different symbols, the 26 values of C that map to the normal members of P are computed and the symbols of the ciphertext language are assigned to them in the increasing order of the values for C. The result is that the ciphertext will always consist of just the normal 26 symbols, which means that it will give no hint to the fact that a modulus other than 26 is being used. We now can use arbitrarily large values for the parameters, although there are some choices that are worse than others. Even if we keep ourselves restricted to quite modest values of N, say values less than 100, the key space explodes to the order of a million keys.

The Random Substitution Cipher

The most powerful mono-alphabetic substitution cipher is one in which the mapping from plaintext symbol to ciphertext symbol is random. Keeping the discussion limited to a one-to-one mapping using uppercase English, there are 26!=4e26 possible keys. To give an idea of how large this number is, consider that the mass of the Earth is about 1.3e25 pounds, which means that 4e26 ounces would be more than twice the mass of the entire planet. The difficulty of attacking such a large key space by brute force can be appreciated by noting that if every person on the planet was employed in the attack, each using a computer capable of examining one billion potential keys each second, that the key space would still require nearly two years to complete.

Cryptanalysis of Mono-Alphabetic Ciphers

Despite the fact that a random substitution cipher can have a truly enormous key space, making any kind of brute for attack impractical, mono-alphabetic ciphers are considered extremely weak and fairly easy to break. In fact, it was breaking such a cipher that cost Mary, Queen of Scots, her head in 1587. The reason is that mono-alphabetic ciphers fail to alter, let along suppress, the symbol distribution of the plaintext message. As we will see, suppressing structure and non-randomness in the ciphertext will be the major goal of the cryptologist and devising ways to exploit what structure and non-randomness remain will be the major goal of the cryptanalyst.

So let's examine the structure of the plaintext message and see if there are hints that will survive the encryption process and remain visible in the ciphertext. If all of the letters in the alphabet were used with equal frequency, then in any sufficiently long piece of plaintext we would expect to see each letter appear about 1/26th (3.85%) of the time. However, as most English speakers are aware, some letters in the alphabet are used much more frequently than others, and this is true of all languages, even the binary codes used by computers. The most common letter in normal, written English is the letter 'E', followed by other common letters such as 'T', and 'A'. In fact, just five letters, {E,T,A,O,I}, account for nearly half of the letters in the average piece of suitably long plaintext while the top ten account for nearly 75%. On the other side of the spectrum, letters such as 'Z', 'Q', and 'X' are seldom seen; in fact, the five least frequently occurring letters, {Z,Q,X,J,K}, account for barely 1% of the letters, and the bottom ten don't even account for 10%. Most other languages have similarly skewed distributions. Thus, the frequency with which the symbols of the language (the characters of the alphabet, in this case) are used are far from random. The frequency distribution for typical, written English is shown below (and note that different distributions from different sources will all vary slightly from one another).

12.70

9.06

8.17

7.51

6.97

6.75

6.33

6.09

5.99

4.25

4.03

2.78

2.76

2.41

2.36

2.23

2.02

1.97

1.93

1.49

0.98

0.77

0.15

0.10

0.07

For a mono-alphabetic cipher, the fact that each symbol in the plaintext always maps to the same symbol in the ciphertext means that all of the 'E's, for instance will all map to, say, 'H'. More importantly, if the plaintext consists of 13% 'E's, then the ciphertext will consist of 13% 'H's. Simply put, if we expect 'E' to be the most common letter in the plaintext, then whichever symbol is the most common one in the ciphertext is most likely the symbol that 'E' maps to. As with anything statistical in nature, the degree to which the frequency of the symbols in a given ciphertext correlates to the statistics of the plaintext language depends on a number of things, the most important two of which are the length of the message and how representative the message is of the dialect of the language whose statistics are being used. The first factor simply reflects the well known fact that the more data is available, the most closely it will tend to match the expected distribution while, conversely, very short messages tend to deviate strongly from the average. The second is due to the fact that different dialects of a language can have significantly different statistics that those that apply to the overall language. Here, dialect means much more than simply the variant of a language spoken in different regions of a country or the world; it also refers to the fact that different communities and different types of writing, such as technical, literary, formal, medical, and legal, tend to have significantly different statistics as well. In fact, an analysis of the frequency distribution in a suitably long piece of plaintext (and, since a mono-alphabetic encryption does not alter the frequencies, in the ciphertext) can be used to make a good guess not only about what the basic plaintext language is, but also what region of the country or world it represents and what the general nature of the content (e.g., technical, formal, legal, etc.) might be.

Now that we have a basic appreciation for what frequency analysis is and what kind of incite it can give regarding a piece of ciphertext, let's see how it can be used to break a mono-alphabetic cipher with a fairly high degree of success. First, let's consider the case of a simple shift cipher; to break such a cipher we only need to correctly determine the mapping between a single plaintext symbol and its ciphertext counterpart since the mapping for all of the remaining symbols have the same mapping relationship. Assuming we are reasonably confident that the plaintext language is English, we simply determine which symbol in the ciphertext occurs most often and guess that this symbol probably represents the letter 'E'. From that we can determine the most likely shift and proceed to decipher the message. If the decipherment is unsuccessful, then it probably just means that, in this particular piece of plaintext, the letter 'E' just didn't happen to be the most frequent. But even if that is the case, it is extremely likely that the letter 'E' will still be one of the most common plaintext characters and therefore we can proceed down the list of most frequency ciphertext symbols and use each as the next most likely guess in turn.

In the case of an Affine Cipher, we can leverage the mathematics in order to aid us in finding the enciphering parameters (i.e., the key). By using frequency analysis as described above, we can probably arrive at reasonable guess for the mappings of the few most frequency occurring characters (and perhaps a few of the least occurring characters as well). Picking any two of them, we can solve for the parameters as follows:

C₁ = (aP₁ + b) mod N

C₂ = (aP₂ + b) mod N

(C₂-C₁) = a(P₂-P₁) (mod N)

a = (C₂-C₁) * (P₂-P₁)^-1 (mod N)

b = (C₁ - aP₁) mod N

Once we have our most likely values for a and b, we then attempt to decipher the message. If that fails, then we simply move on to the next most likely pair. We will likely discover the correct parameters after a relatively few number of tries. However, note that this is assuming that we know the value of N (the modulus in use). If we don't, then we could still use this method and simply try all of the reasonable values of N. Note that, in theory, we could use three suspected mappings and simply treat it as a problem in three equations and three unknowns. However, having an unknown modulus makes this a very non-linear problem. None-the-less, using trial and error to solve such a system of equations might prove a more fruitful route than using trial and error decipherments