The Holy Grail: automatic speech recognition for low-resource languages
Most people use automatic speech recognition on electronic devices without giving a thought to the complex programming behind the convenience. Computer programs like Siri enable people to quickly and easily get directions, search the web, or find out the name of the song that’s playing on the radio.
For those who speak English, or another language that is prevalent in First World nations, Siri or other voice recognition programs do a pretty good job of providing the information wanted. However, for people who speak a “low-resource” language—one of more than 99 percent of the world’s languages—automatic speech recognition (ASR) programs aren’t much help. Preethi Jyothi, a Beckman Postdoctoral Fellow, is working towards creating technology that can help with the development of ASR software for any language spoken anywhere in the world.
“One problem with automatic speech recognition today is that it is available for only a small subset of languages in the world,” said Jyothi. “Something that we’ve been really interested in is how we can port these technologies to all languages. That would be the Holy Grail.”
Low-resource languages are languages or dialects that don’t have resources to build the technologies that can enable ASR, explained Jyothi. Most of the world’s languages, including Malayalam, Jyothi’s native south Indian language, do not have good ASR software today. Part of the reason for this is that the developers do not have access to large amounts of transcriptions of speech—a key ingredient for building ASR software.
Jyothi and Mark Allan Hasegawa-Johnson, a full-time faculty member in Beckman’s Artificial Intelligence Group and professor of electrical and computer engineering, have come up with a novel way to transcribe the low-resource languages: using crowdsourcing, they hired primarily native speakers of English or Mandarin Chinese who don’t speak one of the low-resource languages to listen carefully to the speech, and then write down English or Chinese nonsense syllables corresponding most closely to what they think the speaker is saying.
“When Mark first suggested this idea, it sounded very interesting but extremely challenging,” recalled Jyothi. “At that point, we didn’t know what kind of data we would get from crowd workers.
“So, I designed a pilot experiment to collect data. Then it occurred to me that if we had several people transcribe the same sound clip, even though they are all highly error-ridden, the errors can be made to cancel out with each other, and you will be left with systematic biases. These systematic biases can be removed because they can be learnt from the data using standard machine learning techniques.”
Systemic biases, said Jyothi, refer to the differences in phoneme inventory between languages. For example, most Indian languages distinguish between a breathy “b” and a non-breathy “b”; English-speaking listeners write down both sounds using the same symbol, “b.”
When Jyothi presented their first results at the Association for the Advancement of Artificial Intelligence conference in January 2015, she included a video clip from YouTube in her talk, in which a Russian song was subtitled using English words that sounded similar. Hasegawa-Johnson used a similar video in his presentation at the conference in which Tamil lyrics had been transcribed using English words. “If you don’t know the language, it sounds almost exactly like what’s written,” said Jyothi.
The videos were not only funny, but they also provided some clues for the researchers. “We don’t want to read too much into this one example, but it really represents the patterns we see in the task overall,” said Hasegawa-Johnson. “For example, the title of one spoof video is ‘Fine, Benny Lava,’ which is how the transcriber mis-heard the Tamil words ‘kaayndha nilaavo.’ Why did the transcriber hear ‘Benny Lava’ at the end of the song instead of ‘Danny Lava’? I think it’s because the ‘dh’ sound in Tamil is so different from any English consonant, so the transcriber couldn’t really figure out what it was—instead of hearing it as ‘d,’ which would make sense. He just tried to make the best guess that he could. In our experiments, we find that segments that don’t exist in English suffer a whole lot more variability than segments that do exist in English, and we have to model that using a probability distribution.”
Their project was selected for the 2015 Jelinek Summer Workshop on Speech and Language Technology held at the University of Washington. The team’s goal was to build on the initial successes Jyothi and Hasegawa-Johnson had with recovering information from non-native speakers and develop techniques for using such information in building an ASR system.
“The transcripts we obtain from this process are probabilistic in nature, and cannot be used in the same way the deterministic transcripts from a professional transcriber can be used to train an ASR system,” explained Jyothi. “We still needed to figure out the best way in which they can be used. That was the focus of the summer workshop. Before the results from the workshop, it was not entirely clear if we would be able to get significant gains compared to alternatives that do not use the information from the non-native transcribers.”
The team developed the paradigm of “probabilistic transcription” as a framework within which to place their new crowdsourcing methodology. According to the Jelinek workshop team, a “deterministic transcript” is a sequence of words or phonemes representing the content of a speech signal—exactly the meaning with which the word “transcript” is used in courtroom reporting, or in television broadcasting. A “probabilistic transcript,” on the other hand, is a probability distribution over possible phoneme sequences.
“Experts in machine learning have always maintained this convenient fiction that human labelers provide a ‘gold standard,’ a label that is guaranteed to be true,” said Hasegawa-Johnson. “We have always known that human labelers make mistakes, but we never really had a systematic mathematical framework with which to characterize the mistakes made by human labelers. Probabilistic transcription provides that framework.”
The project focused on seven languages: Arabic, Cantonese, Dutch, Hungarian, Mandarin, Swahili, and Urdu. “We were able to get significant improvements on a speech recognition system for the languages,” said Jyothi. “The system started with data that was not for that particular language and then we adapted those systems using these transcriptions.”
The performance improvements obtained were particularly marked for Swahili. “The pronunciation of Swahili words can be figured out by looking at how it’s written, so the correspondence between the letter and the sound is much more systematic than for a language such as English,” said Jyothi.
The current project used only native English speakers, but the researchers are hoping to expand upon their transcriber base to include native speakers of other languages whose characteristics better match those of the language being recognized.
“If you want to recognize Cantonese or a dialect of Cantonese, which is only spoken in some remote regions in China, we could expect native speakers of Mandarin to provide more useful information than native speakers of English,” said Jyothi. “So, can we choose languages which are closer somehow to the language which we are trying to recognize? Then a research question is what does ‘closer’ mean?” Jyothi points out that the relevant characteristics are the nature of the sounds in the languages.
“For instance, in Hindi, you have ‘cuh’ and you also have the sound ‘khuh’ (aspirated ‘k’). A native English speaker may not differentiate between these sounds because such a distinction is not common in English. But there are many other languages in which changing an aspirated sound to its unaspirated version changes the word. So, the language background of the transcribers would have a significant effect on how well different sounds are detected.”
In ongoing work supported by an National Science Foundation EAGER grant, Hasegawa-Johnson, Jyothi, and Lav Varshney, an affiliate in Beckman’s Image Formation and Processing Group and assistant professor of electrical and computer engineering, are investigating how to carefully select a set of transcribers with different native languages, in order to ensure adequate coverage of the sounds in the target language.
The applications for ASR for low-resource languages are multiple, said Jyothi. When a disaster occurs in a remote part of the world, cell phone users would easily be able to report on the crisis, but the usefulness of such reports is greatly multiplied if they are made available as text that can be automatically searched and collated. Further, such a system will continue to empower the society even after the emergency is over, as thousands of
citizen journalists would be able to share their reports in a useful form.
ASR can also improve business practices in emerging economies like India, which is home to a large number of low-resource languages.
“Already, in India, there are experimental projects which let farmers find prices for their agricultural commodities using automated telephonic services,” Jyothi explained. “Such systems will be much more powerful if they covered the vast populations in remote parts of the country who speak languages and dialects which are truly low-resourced.”