Translation and Interpreting in 150+ Languages
Statistical Machine Translation Code-Buster
November 21, 2011 - By: - In: Machine Translation - 14 comments

The thing with 18th Century ophthalmological societies in the Holy Roman Empire, or at least all the secret ones, is that they wrote everything in secret code. So secret, that generations of cryptographers have been unable to decrypt the writings of these eye-surgery-obsessed secret-code-writers for the last several centuries.  Because their code was uncrackable, there writings were mute testimony to an unknown mystery.

The Copiale Cipher, a hand-lettered 105-page manuscript from the late 18th century was “discovered in an academic archive in the former East Germany after the Cold War, The elaborately bound volume of gold and green brocade paper holds 75,000 characters, a perplexing mix of mysterious symbols and Roman letters. The name comes from one of only two non-coded inscriptions in the document.”

Kevin Knight, an IT scientist at the USC, worked with Beata Megyesi and Christiane Schaefer of Uppsala U in Sweden to finally crack the code using statistics-based translation techniques.

They ran into lots of dead ends trying to find their way through a maze of misleading script. At first they assumed that all the meaning was in the Roman script and disregarded the crazy made-up characters. Turns out they were 180 degrees off. The familiar Roman letters were a blind, only place markers to show the spaces between words. All the meaning resided in the crazy characters the coders had compiled.

The first task of the researchers was to digitize the handwritten script with its  mix of familiar and made-up symbols unique to the code, looking something like this, which was then transcribed in ABCs, like so:

pi oh j v hd tri arr eh three c. ah ni arr lam uh b lip uu r o.. zs

This code consisted of 90 different characters, including 26 plain old letters. Once the count was tabulated, the researchers looked at character frequencies to guess a symbol’s meaning in German. They total the number of occurrence of each character in the text. Then they looked at the relationship between the characters as they occurred. Patterns were beginning to emerge.

By looking at the relationships between the occurrences of letters in German and knowing that in German, C is almost always followed by H, similar to the QU pattern in English, but with much greater frequency. Then on to CHT, and then the code characters began to fall like German dominos.

Check out the paper for the details. Very similar to some of the automation work we do in translation, part of the same discipline, really. Experts think Knight and company have a good method that can be applied to other secret codes, and Knight is good to go.

“There are these books and ancient languages of real historical value that contain historical information that we just can’t get out yet, and that’s of interest to a lot of people,” Knight says.

Knight is now interested in the Voynich manuscript but has been stymied so far. Did a post on the Yoynich a few months ago here.

Here is the USC PR video interview with Knight. Nice piece.

LiveZilla Live Chat Software