Real-Time Chinese Machine Interpretation

by Translation Guy on November 16, 2012

Rick Rashid, Chief Research Officer at Microsoft Research, demonstrated the latest breakthrough in speech recognition and machine translation a few weeks ago by giving a speech in Taipei using real-time computer-generated Chinese audio translation.

Rashid kicked off in English, providing a great summary of the history of machine translation and voice recognition. It was a good overview of the 60-year effort to build computer systems that can understand what a person says when they talk, and to translate what they said.

Way back when, voice recognition started off with simple pattern matching of voice prints. Because each speaker’s voice was so different, it was hard to recognize speech that  deviated even slightly from the pattern. Later, scientists programmed statistical speech models constructed from the recorded voices of many speakers. The software used to integrate these voices is known as hidden Markov modeling and was the breakthrough needed to get the ball rolling.

In the last 10 years, better software and faster computers have led to more practical uses. Now it seems as if machines do most of the talking on the phone, but their capabilities are still quite limited, as we all have frustratingly experienced. Even the most robust systems are still reporting error rates of around 25% when handling general speech, according to Rashid. Machines do a lot better when they’ve been trained for an individual voice. A few posts ago I blogged about my own experience writing this blog by dictation. Untrained tools remain error-prone.

Researchers at Microsoft Research and the University of Toronto have applied a new technique, called Deep-Neural-Network Speech Recognition, which is patterned after human brain behavior. Results were about 30% better. According to Rashid, “This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979 and, as we add more data to the training, we believe that we will get even better results.” Note that this increase is without the speech adaptation required to improve earlier systems such as the one I rely on.

At 6:41 into the video, Rashid begins to use the tool, which has been modified to match his voice. The tool transcribes his voice, translates to Chinese, and then reads it out loud in Chinese programmed to match Rashid’s voice in English. The affect is uncanny, and the Chinese-speaking audience received the translation with enthusiastic applause at the apparently successful translation of each simple line, translated slowly and consecutively. It looked really impressive.

Rashid blogs, “Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful.

“Most significantly, we have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice, which is what I demonstrated in China.”

I have no way of telling if the translation is any good, so I encourage our Chinese-speaking readers to listen in and report.

But quality might not be that important. This kind of tool doesn’t have to be good, just good enough.


  1. Just good enough for communication purposes. I would even say: great! But in case of legal arrangements – no machine will replace a translator/interpreter

  2. Ivana Aghov says:

    I really wish my Chinese was better so I could judg the quality of the technology.

  3. Deb says:

    I think from the video, that for the first evolution of this technology, tha is incredibly impressive. I can’t help but wonder where we will be in five years.

  4. Toman Petrov says:

    Has Microsoft done this for more languages than just Chinese? Or is it more narrowly focused, if so, they may achieve some level of high quality with time and limiting themselves to one language.

  5. Ruza Bironov says:

    I certainly think that it’s a wonderful innovation on the way to an important communication milestone, but your job and my job seem pretty safe for the immediate future.

  6. Rocky says:

    The question I have, iss how long that few moments of of Chinese took to produce, as it is said in the video that they spent some time calibrating it to work with his voice. How practical is this in the rel world, beyond just a carefully constructed presentation?

    • Ken says:

      A stage trick.

  7. David Romero says:

    1 in 7, that’s what? Roughly 15%, that is still a incedibly impractical threshold for error. Cute toy, but not a game changer.

  8. See, I was going to invest in some voice recognition software, but a 25% error rate for even the very best just makes me think it’s a waste of money.

  9. Jean says:

    Chinese isn’t a particular strong suit of mine, but I’m proficient and as far as I can tell, the translation is pretty good for a machine, at least better than I’ve ever seen.

LiveZilla Live Chat Software