2/17/2014 Meg Dickinson, ECE ILLINOIS
Written by Meg Dickinson, ECE ILLINOIS
It sounds like a tricky riddle. If you have an eight-hour recording, and you want to know how many unusual noises are on it, how long will it take you to find out?
Eight hours, right?
Not exactly. With a new computer program developed with the help of two ECE ILLINOIS professors, humans can now easily sort through long recordings while quickly picking out sounds that don’t belong.
It’s the result of research ECE Professor Mark Hasegawa-Johnson and several others recently published in two articles related to noises that grab human attention.
The research started when Dirk Bernhardt-Walther, a former postdoctoral fellow at the Beckman Institute, worked with ECE Professor Thomas S Huang to research what visual elements elicit human attention.
Then, the research group started discussing ways that research could be applied, and came up with two possibilities, Hasegawa-Johnson said.
If you can predict what visual elements can grab human attention, can you do the same thing with audio? And later, they asked, is there a way to help humans easily filter through large amounts of audio?
The first publication, in Pattern Recognition Letters, explains what Hasegawa-Johnson called an “unexpectedly simple” answer to the first question. Noises that get humans’ attention start loudly. It doesn’t matter how loud they are when they end, or if they’re periodic. Even pitch doesn’t matter.
However, coming to this conclusion took some work. Research from the 1990s at Caltech used eye-tracking to see what grabs human attention visually, and Bernhardt-Walther has expanded on this work.
But there’s no such thing as ear tracking. The only way you can tell what a human is listening to is to ask.
So, the paper’s authors gave listeners 20 hours of audio recordings, including simulated business meetings, and asked them to label the beginning and end of any sound that drew their attention by clicking a button.
“As it turned out, of many things we analyzed, the only thing that correlated is that the subjects clicked the button when the sound turned on,” Hasegawa-Johnson said.
That led to the research that discovered a way to sort through long recordings quickly. It’s being published in ACM Transactions on Applied Perception.
The idea started when the National Science Foundation put out a call for proposals for ways humans could “find patterns in large quantities of data using visual analytics tools,” Hasegawa-Johnson said.
“We came up with the idea of being able to do that on audio data,” he said, with the goal of allowing someone like a security guard monitoring many recordings to be able to tell when something interesting is happening.
To create this tool, Hasegawa-Johnson and his fellow researchers worked with Camille Goudeseune, a computer systems analyst at the Beckman Institute’s Illinois Simulator Laboratory.
They built software that uses a spectrogram, which is a visual display of sound that’s been around longer than computers. Spectrograms show both time and frequency. However, traditionally, they’re short, and don’t allow for using a lot of information.
“Creating a spectrogram from a long sound requires throwing away information, so therefore, most spectrograms are very short,” Hasegawa-Johnson said. “Speech scientists prefer to look at spectrograms of no more than 2 or 3 seconds per computer screen, for fear of losing information.”
To make one that allowed for monitoring hours of recordings, Goudeseune built a program that will show you 12 hours of audio information on a computer screen, but allows you to easily zoom in to see one second’s worth of information. It’s called the timeliner.
To test it, the researchers started with orchestra recordings, which have a fairly uniform look on a spectrogram. Then, they added short recordings of birds chirping, cows mooing, motorcycles revving their engines, and even Pacman dying.
The timeliner makes those unexpected sounds more visually prominent by using research about what grabs human attention visually. To see a video demonstrating the timeliner, click here.
“What's grabbing your eye is a bright pattern,” Goudeseune said, referring to spikes that show a robin chirping over the orchestra. “It’s kind of angular and spikey. It looks different than what's around it.”
They demonstrated the timeliner at the 2011 Beckman Institute Open House, and found the average visitor used it to find animal noises in a recording three times faster than if they were just listening. Even 6-, 7-, and 8-year-olds found the unusual sounds with ease.
Then, they added the results of Bernhardt-Walther and Huang’s research about what sparks human attention, visually. Their study was on salience, or signal features that draw human attention.
They modified the spectrogram display so that acoustically unusual sounds are more visually salient, while all other sounds are more or less grayed out. Huang’s student, Kai-Hsiang Lin, named this display the “Saliency-Enhanced Spectrogram.”
Study participants were given eight minutes to find video-game sound effects with in an 80-minute recorded conversation. Those using the saliency-enhanced spectrogram were able to find 70 percent of the embedded recordings, while those using a regular spectrogram were only able to find 35 percent.
Hasegawa-Johnson said that humans might be able to detect unusual sounds if an audio recording is sped up two or three times its normal rate.
“But after that, you start to miss things,” he said.
Goudeseune’s next step will be to create a way to use his program to visualize up to 100 recordings at a time. He’s also found a way to visualize geotagged audio recordings to show unusual sounds recorded with thousands of microphones over a specific area.
Hasegawa-Johnson said he wants to use this research to improve speech recognition algorithms, with the ultimate goal to allow humans to pick out acoustically notable changes in audio for quick transcription.
He said his group’s research on attention-getting audio combines the strengths of humans, who are accurate but not very fast, and computers, which can process hours of audio data quickly.
“Some tasks are best done by a machine, some are best done by a human,” he said, “but a human and a machine working together is usually the best solution.”