Crypto-Machine Translation

by Translation Guy on July 22, 2011

Quantity has a quality all its own when it comes to machine translation. The more parallel language data you can load into a machine translation engine, the better the translation will be, statistically speaking. That’s why Google’s statistical translation tools are so good—and bad. But what to do when you don’t have that data? Time to call in the cryptographers to crack some code.

USC scientists Sujith Ravi and Kevin Knight have become linguist spooks to see if they can crack the code for language pairs like Pashto or Farsi that don’t have datasets on the scale of more widely used commercial languages such as Spanish or Japanese.

Jacob Aron reports on this in New Scientist. Take caution with this article, since he doesn’t have a clue about MT, but there’s a useful patch that he must have cut and pasted from a press release on these guys, so I’ll do the same.

“Ravi and his colleague Kevin Knight treat translation as a cryptographic problem, as if the foreign text were simply English written in an advanced cipher. Their software cracks the code by estimating the probability that a foreign word matches an English word based on the number of times it appears in the text—a frequently occurring word is more likely to mean ‘the’ or ‘a’ than ‘antidisestablishmentarianism’.

“To ensure the translation makes sense, the pair use another piece of software to evaluate the quality of English that comes out. This in turn tweaks the probabilities used in the translation software. They tested the system on a collection of short phrases such as ‘last year’ and ‘the fourth quarter’, attempting to translate the Spanish equivalents back into English, along with a number of movie subtitles that existed in both languages.

“The resulting translations—known, confusingly, as ‘monolingual’ translations—rated highly compared with standard computer translation techniques.”

Of course, machine translation started with a different kind of “monolingual” translations. Ol’ timey, rules-based translation can still crank out some pretty good translations in languages and domains that statistical translation can’t touch, and it does just fine in many of the more common English language pairs too, dag gummit.

Critics opine that this crypto-cracking technique of Ravi and Knight is only a baby step, since these guys are working on texts about the length of a movie subtitle. Having squeezed subtitles onto movie screens for many years, it’s hard for me to imagine a worse sample set for testing translation tools, but then I don’t have a PhD in computational linguistics.

That’s why I was equally mystified by Oxford MT expert Phil Blunsom’s comment, “It’s not something you’re going to see popping up in commercial systems any time soon,” since here at 1-800-Translate, we’ve been incorporating similar frequency analysis into our own machine translation and translation memory efforts since the first quarter of this year. And that’s all you’ll get out of me.

Now that I’ve reread what I’ve written (I actually do, you know, sometimes), I’m afraid I’ve posted a piker. I’m counting on comments from those more knowledgeable in the field to see what I have missed. Kirti!


  1. Kirti says:

    Given how difficult it is to get really good systems even when you do have data, I am skeptical on the possibilities of this approach on it’s own. Scientists can sometimes get stuck inside their theories, but it is useful to push forward as many hypotheses as possible as each may contribute in some way to a larger “final” solution.

    SMT is already a cryptographic approach to some extent so I am not sure what is so new here. However, I would not be surprised that the approach does yield *some* patterns with surprising accuracy.

    To better solve the problem I would expect that a multi-dimensional approach would be better i.e. data based linguistics, the monolingual data analysis described here and probably some good old fashioned linguistic review and rule based guidelines. I also think that there are other approaches that will provide better results in the short term e.g. Laotian may have many common patterns to Thai and and thus using Thai as a foundation may lead to better clues and better systems. Data is important and where data is not available linguistics becomes critical. I would expect that combining both will in the long-term produce the best results.

    • Ken says:

      Thanks Kirtee, we’ve seem similar cross-language benefits between Korean and Japanese MT

LiveZilla Live Chat Software