Microtasks speed speech recognition research

1/18/2017 Doug Peterson

Hasegawa-Johnson’s team has found a way to correlate the nonsense syllables that people hear when listening to a foreign language with the actual sounds, or phonemes, of that language. Using the data that result, they are able to develop speech recognition systems—the kind of system behind Siri, the famed voice on Apple phones.

Written by Doug Peterson

ECE ILLINOIS researchers are finding ways to make sense out of nonsense, and they have given a dramatic boost to the creation of speech recognition systems for languages around the world.

“Everyone knows that when a person listens to a foreign language, the sounds make no sense,” says Professor Mark Hasegawa-Johnson.

However, Hasegawa-Johnson’s team has found a way to correlate the nonsense syllables that people hear when listening to a foreign language with the actual sounds, or phonemes, of that language. Using the data that result, they are able to develop speech recognition systems—the kind of system behind Siri, the famed voice on Apple phones.

Before their project, most experts dismissed the idea of using non-native speakers to transcribe sounds as “absurd,” he says. But his team showed it could work.

According to Hasegawa-Johnson, there are close to 7,000 spoken languages in the world today, but automatic speech recognition systems exist for only about 40 of them. Speech recognition systems for cell phones are offered for even fewer—less than 20 languages.

With this new approach, Illinois researchers are able to create automatic speech recognizers faster and cheaper. “What I would like to see happen is that we reduce the cost enough that Google, Apple, or Microsoft will offer a speech recognizer for more languages,” he explains.

Mark Hasegawa-Johnson
Mark Hasegawa-Johnson

Hasegawa-Johnson says their team aims to create low-cost speech recognition systems in 200 languages over the next two years. So far, they have developed software for roughly 20 “low-resource languages”—languages for which there is there is no good speech recognizer.

The research project arose near the end of 2013 when Hasegawa-Johnson’s graduate student at the time, Preethi Jyothi, tried to find available data to create an automatic speech recognition system for her native language of Hindi. This data, based on transcribed audio, is used to train a system to recognize a particular language.         

After considerable searching, Jyothi eventually found some non-transcribed audio that could be used to create a speech recognizer for Hindi, a language spoken by hundreds of millions of people worldwide. To transcribe the audio, one option was to find a single well-trained expert to meticulously do the transcriptions, but this would be time-consuming and expensive.

So Hasegawa-Johnson says, “We did the opposite.” They turned for help from hundreds of English speakers who couldn’t speak a word of Hindi.

Hasegawa-Johnson’s team hired English speakers on the crowdsourcing site, Mechanical Turk, to transcribe the Hindi sounds into the nonsense sounds that they hear. Suddenly, Jyothi and Hasegawa-Johnson had access to hundreds of people who could perform these microtasks—transcribing one-second audio clips of a foreign language.

After the non-expert listeners wrote down what they heard, the researchers mapped out connections between the various nonsense syllables and the actual phonemes, or sounds, in the target foreign language. Then they created probabilistic mass functions, which determined the probability that one particular nonsense syllable corresponded to a sound in the foreign language.               

Hasegawa-Johnson says they can create a speech recognizer using roughly 10 hours of transcribed audio. But if they can get that number up to about 100 hours of speech data for a particular language, the recognizer may have an error rate small enough to be commercially useful. 

As work progressed and they focused on more and more languages, the research team didn’t just rely on English-speaking transcribers. They also went to another crowdsourcing site, Upwork, which is more expensive than Mechanical Turk but has transcribers who speak a more diverse set of languages, including Hindi or Mandarin. Finding transcribers who spoke Mandarin was useful for transcribing Vietnamese because the two languages have similarities in their systems of sounds.

In addition to Hindi and Vietnamese, other low-resource languages they have worked on include Arabic, Hungarian, Swahili, Zulu, Dinka, and Urdu. The reason speech recognition systems for low-resource languages would be so helpful, Hasegawa-Johnson explains, is that cell phones abound in Third World countries.

“People don’t always have landline phones, but many have cell phones,” he says.

Speech recognition systems for phones in less-developed countries would open up economic possibilities, he adds. For instance, people who make fantastic woven products in the mountains of Zaire could use a speech recognition system to create websites on their phones and sell their products worldwide. In cases of natural disasters or riots, troops or aid workers could use a speech recognition system to monitor radio broadcasts and determine where help is needed.

Hasegawa-Johnson says that once they develop systems for the first 200 languages, it may be difficult to find audio for the remaining languages that could be used for transcriptions. But they will push on.

As he puts it, “The Holy Grail is to have speech recognition in every language. ‘Every’ is a big word, but I think within the next 10 years it will be available in hundreds of languages.”

This story first appeared in the fall/winter 2016 issue of Resonance, ECE ILLINOIS' semi-annual magazine.


Share this story

This story was published January 18, 2017.