Quantity has a quality all its own when it comes to machine translation. The more parallel language data you can load into a machine translation engine, the better the translation will be, statistically speaking. That’s why Google’s statistical translation tools are so good—and bad. But what to do when you don’t have that data? Time to call in the cryptographers to crack some code.
USC scientists Sujith Ravi and Kevin Knight have become linguist spooks to see if they can crack the code for language pairs like Pashto or Farsi that don’t have datasets on the scale of more widely used commercial languages such as Spanish or Japanese.
Jacob Aron reports on this in New Scientist. Take caution with this article, since he doesn’t have a clue about MT, but there’s a useful patch that he must have cut and pasted from a press release on these guys, so I’ll do the same.
“Ravi and his colleague Kevin Knight treat translation as a cryptographic problem, as if the foreign text were simply English written in an advanced cipher. Their software cracks the code by estimating the probability that a foreign word matches an English word based on the number of times it appears in the text—a frequently occurring word is more likely to mean ‘the’ or ‘a’ than ‘antidisestablishmentarianism’.
“To ensure the translation makes sense, the pair use another piece of software to evaluate the quality of English that comes out. This in turn tweaks the probabilities used in the translation software. They tested the system on a collection of short phrases such as ‘last year’ and ‘the fourth quarter’, attempting to translate the Spanish equivalents back into English, along with a number of movie subtitles that existed in both languages.
“The resulting translations—known, confusingly, as ‘monolingual’ translations—rated highly compared with standard computer translation techniques.”
Of course, machine translation started with a different kind of “monolingual” translations. Ol’ timey, rules-based translation can still crank out some pretty good translations in languages and domains that statistical translation can’t touch, and it does just fine in many of the more common English language pairs too, dag gummit.
Critics opine that this crypto-cracking technique of Ravi and Knight is only a baby step, since these guys are working on texts about the length of a movie subtitle. Having squeezed subtitles onto movie screens for many years, it’s hard for me to imagine a worse sample set for testing translation tools, but then I don’t have a PhD in computational linguistics.
That’s why I was equally mystified by Oxford MT expert Phil Blunsom’s comment, “It’s not something you’re going to see popping up in commercial systems any time soon,” since here at 1-800-Translate, we’ve been incorporating similar frequency analysis into our own machine translation and translation memory efforts since the first quarter of this year. And that’s all you’ll get out of me.
Now that I’ve reread what I’ve written (I actually do, you know, sometimes), I’m afraid I’ve posted a piker. I’m counting on comments from those more knowledgeable in the field to see what I have missed. Kirti!