Rick Rashid, Chief Research Officer at Microsoft Research, demonstrated the latest breakthrough in speech recognition and machine translation a few weeks ago by giving a speech in Taipei using real-time computer-generated Chinese audio translation.
Rashid kicked off in English, providing a great summary of the history of machine translation and voice recognition. It was a good overview of the 60-year effort to build computer systems that can understand what a person says when they talk, and to translate what they said.
Way back when, voice recognition started off with simple pattern matching of voice prints. Because each speaker’s voice was so different, it was hard to recognize speech that deviated even slightly from the pattern. Later, scientists programmed statistical speech models constructed from the recorded voices of many speakers. The software used to integrate these voices is known as hidden Markov modeling and was the breakthrough needed to get the ball rolling.
In the last 10 years, better software and faster computers have led to more practical uses. Now it seems as if machines do most of the talking on the phone, but their capabilities are still quite limited, as we all have frustratingly experienced. Even the most robust systems are still reporting error rates of around 25% when handling general speech, according to Rashid. Machines do a lot better when they’ve been trained for an individual voice. A few posts ago I blogged about my own experience writing this blog by dictation. Untrained tools remain error-prone.
Researchers at Microsoft Research and the University of Toronto have applied a new technique, called Deep-Neural-Network Speech Recognition, which is patterned after human brain behavior. Results were about 30% better. According to Rashid, “This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979 and, as we add more data to the training, we believe that we will get even better results.” Note that this increase is without the speech adaptation required to improve earlier systems such as the one I rely on.
At 6:41 into the video, Rashid begins to use the tool, which has been modified to match his voice. The tool transcribes his voice, translates to Chinese, and then reads it out loud in Chinese programmed to match Rashid’s voice in English. The affect is uncanny, and the Chinese-speaking audience received the translation with enthusiastic applause at the apparently successful translation of each simple line, translated slowly and consecutively. It looked really impressive.
Rashid blogs, “Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful.
“Most significantly, we have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice, which is what I demonstrated in China.”
I have no way of telling if the translation is any good, so I encourage our Chinese-speaking readers to listen in and report.
But quality might not be that important. This kind of tool doesn’t have to be good, just good enough.