The word-witch-doctors over at Bab.la have cooked up a a big machine translation showdown to test which is the best — Prompt, Systrans, Google or Bing.
The challenge: 500 sentences in 10 languages per pair, English to French, German, Spanish, Italian, Portuguese, two-way. Not back translated, but different translations going in either direction.
Results were scored from zero to 3. Zero for incomprehensible, 1 for educated guess, 2 for good gist, bad grammar, and 3 for good ‘nuff. (I renamed these categories, as bab.la’s were pretty clunky, and I’m thinking of using this re-named system myself.)
Each translation batch had 5 sentences from 10 different domains or subject areas: advertising, business, finance, food, law, literature, medicine, religion, slang and of course Tweets.
And the winner is… Google! Google, Bing places, Systran shows and Prompt bringing up the rear. Well, like they say, “the guy with the biggest servers wins.”
Look at the bar graph. Those bars make it look all sliced and diced, scientific like. But we aren’t there yet. First off, why is the Spanish machine translation (MT) half as good as the other Romance languages? Seems like a pretty big deviation for four linguistic variants of what is basically Latin. Does a small sample size mean that a single reviewer attached at the hip to the Academia Real Spanish dictionary can skew the odds that much? What it does confirm is that the categories used by bab.la are subjective, nothing wrong with that, but the criteria testers used in the bab.la study are probably quite different than the criteria actually used by users of MT surfing online.
Next question, why FIGS? Things start to get interesting in MT once you leave Western Europe behind. East Asian machine translation quality is a critical problem, and results can be opaque, since bilingualism between English and these languages is much lower than among Euro–languages, so translation problems are harder for users to detect. Certainly the common Western European languages are transparent to far more users, so bab.la’s evaluation of other language pairs would be more enlightening, maybe, but, I’m not sure that bab.la’s testing is really all that relevant to how machine translation is actually used on the Web.
Because I’m not sure either how MT is actually used on the Web. Even after selling and giving it away machine translation for over a decade, I still have not figured it out, which I guess says more about me than about the quality of bab.la’s testing.
Great to see bab.la’s work on this, because right now it is front-burner for me. Because, it just so happens that we are bringing back a free translation service to 1-800-Translate.com.
Years ago, we did a free translation feature on the website, and we still have a lot of incoming traffic from people looking for that old page from legacy links that are still out there.) So to respond to that demand for machine translation, we’ve looked at a lot of different systems and also worked up a few versions on our own for user testing. It’s been very interesting, because we’ve found some clues that suggest everything you know about machine translation is wrong.
Whoops. Sorry, I misspoke. I mean everything we (at 1-800-Translate) know about MT is wrong. For one thing, we think monoglots are very unlikely to use online MT. Successful translation is like crack. If it’s good, once you start, you won’t stop. This because translation via MT, as in any communication, is like a tennis volley. If the MT tool keeps dropping the ball on your message, there will be no answer back, so no volley. Only successful users are repeat users.
We think the most active users are bilinguals using MT to speed up or improve their bilingual communication efforts, because they have the linguistic resources available to handle the high rate of error. Also, people are using these tools for the oddest reasons and in the oddest ways, but that’s another post.
The most interesting thing about machine translation is the nature of translation error. Even the best MT is often wrong, which means that some sentences do better in one tool than another, depending not on the tool, but on the sentence under translation. So results from the engines vary widely, and even an engine that often produces superior results will not do so a good part of the time. To us it looks as if the problems with machine translation accuracy are not related to deficiencies in the actual software, but are caused by a user interface that stops users from getting to the best translation.
So, in order to learn more about these issues, we are going to go live soon with a new iteration of our free translation page, this time called the Free Translation Challenge, which will allow users to share their MT experience with other users. It’s an attempt to look at machine translation quality from an ISO 9001 perspective, which requires that quality is defined by the customer, and no one else. No panels of experts need apply. Just us chickens, or should I say, just you chickens. Cluck, cluck.