Main Page | See live article | Alphabetical index

Frequency analysis

In the field of cryptanalysis, frequency analysis is a methodology for "breaking" simple substitution ciphers, like the Caesar cipher. These cyphers replace one letter of the plaintext with another to produce the cyphertext, and any particular letter in the plaintext will always, in the simplest and most easily breakable of these cyphers, turn into the same letter in the cypher. For instance, all E's will turn into X's.

Frequency analysis is based on the fact that certain letters, and combinations of letters, appear with characteristic frequency in essentially all texts in a particular language. For instance, in the English language E is very common, while X is not. Likewise, ST, NG, TH, and QU are common combinations, while XT, NZ, and QJ are exceedingly uncommon, or "impossible". Given our example of all E's turning into X's, a cyphertext message containing lots of X's already seems to suggest one pair in the substitution mapping.

In practice the use of frequency analysis consists of first counting the frequency of cyphertextletters and then assigning "guessed" plaintext letters to them. Many letters will occur with roughly the same frequency, so a cypher with lots of X's may indeed map X onto E, but could also map X onto A or S which are also very common in English. Thus the cryptanlyst must generally try several combinations of mappings in order to find the correct mapping for these common letters. Once the common letters are 'solved', the technique typically moves on to pairs and other patterns. These often have the advantage of linking less commonly used letters in many cases, filling in the gaps in the candidate mapping table being built. Frequency analysis is extremely effective against the simpler substitution cyphers and will break astonishingly short cyphertexts with ease. This fact which was the basis of Edgar Allan Poe's claim, in his famous newspaper demonstrations in the middle 1800's, that no cypher devised by man could defeat him. Poe was overconfident in his proclamation, however, for polyalphabetic substitution cyphers (invented by Alberti around 1467) do defeat simple frequency analysis attacks, and the electro-mechanical cypher machines of the first half of the 20th century (eg, the Hebern machine, the Enigma, the Japanese Purple machine, the SIGABA, the Typex, ...) were, if properly used, essentially immune to straightforward frequency analysis attack. They were broken using other attacks.

Frequency analysis was first discovered in the Arab world, and is known to have been in use by about 1000 CE. It is thought that close study of the Koran first brought to light that Arabic has a characteristic letter frequency. Its use spread, and was so widely used by European states by the Renaissance that several schemes were invented by cryptographers to defeat it. These included use of several alternatives to the most common letters in otherwise monoalphabetic substitution cyphers (ie, for English, both X and Y might mean E), use of several alphabets -- chosen in assorted, more or less, devious ways (Leon Alberti seems to have been the first to propose this), culminating in such schemes as using only pairs or triplets of plaintext letters as the 'mapping index' to cyphertext letters (eg, the Playfair cipher invented by Charles Wheatstone in the mid 1800s). The disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both encyphering and decyphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could learn it as Wheatstone and Playfair had shown, 'our attaches could never learn it!'.

Frequency analysis requires a basic understanding of the language of the plaintext, as well as tenacity and some problem solving skills. During WWII, both the British and Americans recruited codebreakers by placing crossword puzzles in major newspapers and running contests for who could solve them the fastest. Several of the cyphers used by the Axis were breakable using frequency analysis (eg, the 'consular' cyphers used by the Japanese). Mechanical methods of letter statistical analysis (generally IBM card machinery) were first used in WWII. Today, the hard work of letter counting and analysis has been replaced by the tireless speed of the computer, which can carry out this analysis in seconds.

However, modern ciphers are much more complex that the ciphers of WWII, and are immune to simple frequency analysis, and even to advanced statistical methods. The best of them must be attacked using fundamental mathematical methods not based on the peculiarities of the underlying plaintext language. See differential cryptanalysis or linear cryptanalysis as examples of such techniques.

Further reading

Both Poe's The Gold Bug, and Arthur Conan Doyle's Sherlock Holmes tale The Dancing Men, are examples of literary use of frequency analysis to attack simple substitution cyphers. The cypher in the Poe story is encrusted with several deceptive measures, but this is more a literary device than anything significantly cryptographic.

Many introductory books on cryptanalysis, especially those for children, give a reasonable account of frequency analysis, though without much history of the technique and too often with Poe'sque statements of universal applicability. See David Kahn's The Codebreakers for a readable thorough history, or Simon Singh's The Code Book for a briefer, more anecdotal account. Both are reliable (a concern in books about cryptography) and cover more than frequency analysis.