Statistical Machine Translation Code-Buster

by Translation Guy on November 21, 2011
14 comments

The thing with 18th Century ophthalmological societies in the Holy Roman Empire, or at least all the secret ones, is that they wrote everything in secret code. So secret, that generations of cryptographers have been unable to decrypt the writings of these eye-surgery-obsessed secret-code-writers for the last several centuries.  Because their code was uncrackable, there writings were mute testimony to an unknown mystery.

The Copiale Cipher, a hand-lettered 105-page manuscript from the late 18th century was “discovered in an academic archive in the former East Germany after the Cold War, The elaborately bound volume of gold and green brocade paper holds 75,000 characters, a perplexing mix of mysterious symbols and Roman letters. The name comes from one of only two non-coded inscriptions in the document.”

Kevin Knight, an IT scientist at the USC, worked with Beata Megyesi and Christiane Schaefer of Uppsala U in Sweden to finally crack the code using statistics-based translation techniques.

They ran into lots of dead ends trying to find their way through a maze of misleading script. At first they assumed that all the meaning was in the Roman script and disregarded the crazy made-up characters. Turns out they were 180 degrees off. The familiar Roman letters were a blind, only place markers to show the spaces between words. All the meaning resided in the crazy characters the coders had compiled.

The first task of the researchers was to digitize the handwritten script with its  mix of familiar and made-up symbols unique to the code, looking something like this, which was then transcribed in ABCs, like so:

pi oh j v hd tri arr eh three c. ah ni arr lam uh b lip uu r o.. zs

This code consisted of 90 different characters, including 26 plain old letters. Once the count was tabulated, the researchers looked at character frequencies to guess a symbol’s meaning in German. They total the number of occurrence of each character in the text. Then they looked at the relationship between the characters as they occurred. Patterns were beginning to emerge.

By looking at the relationships between the occurrences of letters in German and knowing that in German, C is almost always followed by H, similar to the QU pattern in English, but with much greater frequency. Then on to CHT, and then the code characters began to fall like German dominos.

Check out the paper for the details. Very similar to some of the automation work we do in translation, part of the same discipline, really. Experts think Knight and company have a good method that can be applied to other secret codes, and Knight is good to go.

“There are these books and ancient languages of real historical value that contain historical information that we just can’t get out yet, and that’s of interest to a lot of people,” Knight says.

Knight is now interested in the Voynich manuscript but has been stymied so far. Did a post on the Yoynich a few months ago here.

Here is the USC PR video interview with Knight. Nice piece.

14 Comments

  1. garrymoore says:

    hi to all this is my first post and thought i would say hi –
    regards speak again soon
    garry

  2. Digitizing the text and letting computers help must have saved so much time. Maybe others before couldn’t figure it out because they didn’t have the aide of modern translation programs and techniques. If I did it all by hand and mind I probably would have lost interest and gave up too.

  3. Kim Sellars says:

    I sometimes wonder what it’s all for. Is it the actual information that is discovered or is it the thrill of the puzzle that get people to do this kind of translation from an unknown language. I can’t imagine that anything learned that is centuries old could actually have any impact on society today.

  4. Ian Cain says:

    Amazing how ancient things can be deciphered these days. Modern translation techniques are pretty cool.

  5. Shag Dawg says:

    There were a lot of pages in this document. I think once a part of the pattern was deciphered then it didn’t take the computers long to figure out the rest. Still, a remarkable job nonetheless.

  6. These are the kinds of things that get you a job with the CIA or the secret service. Being able to crack the code, so to speak, is a pretty amazing talent when it works.

  7. leaffan1967 says:

    I think I would have looked for patterns as well, but my brain would have stuck to the familiar Roman script. I’m not one for looking outside the box.

  8. I found the most info at wikipedia about the copiale cipher. Interesting note about the secret society initiation ceremony. – http://en.wikipedia.org/wiki/Copiale_cipher

  9. I looked into this a little more and found this link from the LA Times that gave a tiny bit more info, but not much. http://articles.latimes.com/2011/oct/26/local/la-me-usc-code-breaker-20111026

  10. Leo Lassiter says:

    I think it’s interesting that in this day and age, where people occupy most of the planet, that things like this are still discovered (found). I always think of the DaVinci Code when I hear of things like this.

  11. My question is, why would a group of people who do surgery on the eye need to hide anything? Unless that was taboo.

  12. I look at those old texts and they are so beautiful. Of course, they’re just gibberish to me, but they look nice.

  13. Millie says:

    Quite the feat. I believe others tried before him to translate this document (more like 100 pages) and were unsuccessful.

  14. May Flowers says:

    An interested qay to look at translation by Warren Weaver. Thinking all things are written in English, on in code.

LiveZilla Live Chat Software