Software canaries to detect failures in computer processors
Elise King, Coordinated Science Lab
- ECE Assistant Professor Rakesh Kumar and his former student John Sartori have received a grant to study the use of software canaries in detecting hardware failures.
- A canary is a part of a system that fails before an entire program fails.
- Compared to hardware canaries, a software canary runs on the same processor as the actual hardware, and could save power and enhance performance.
ECE Assistant Professor Rakesh Kumar and ECE alumnus and University of Minnesota Assistant Professor John Sartori (MSEE ’10, PhD ’12) have received a $300,000, 3-year grant from the National Science Foundation and the Semiconductor Research Corporation to research the use of software canaries in detecting hardware failures.
Kumar, a researcher in the Coordinated Science Lab, said the idea of a software canary can be explained by the analogy of canaries in mines. Miners would bring birds into mines to detect methane gas. When the canaries stopped singing, the miners knew the canaries had died and evacuated before they were harmed.
"The idea of a canary, in the context of processors, is that, if you can build something inside a processor that will fail before your program fails, then it’s a good detector," said Kumar. "It means that maybe you can dial down the speed, or you can dial up the voltage so that your programs can work correctly."
Traditionally, the industry has used hardware canaries to check for problems. However, “when you’re building a hardware warning system to check a different piece of hardware, you know it’s a question of who checks the checker,” Kumar said. The hardware canary will suffer from the same kinds of issues that the actual hardware would, so the system is very conservative.
However, a software canary can run on the same processor that the actual hardware runs on—as opposed to running alongside it—and therefore is less conservative and can potentially save power and enhance performance, Kumar said.
This grant is part of the NSF/SRC Joint Initiative in Failure Resistant Systems. “It’s a big program,” Kumar said. “They encourage impactful research in failure-resistant systems . . . and essentially they are looking for cross-cutting solutions,” he said. Recently, Oracle Corp. has also taken interest in this specific research and is offering their machines to test the software canaries on.
Kumar and Sartori have worked on projects together in the recent past when Sartori was a graduate student in Kumar's group. Kumar said that he likes that this project gives them the opportunity to keep working together. “I’m also very happy with the fact that [Sartori] has something to support his research as soon he has started,” he said.