Using high-performance computing to unlock the mysteries of genetics


Bill Bell, Engineering at Illinois

When scientists first mapped the human genome in 2000, the promise of personalized medicine went from a remote dream to a breakthrough within grasp. Thirteen years later, however, momentum has slowed because the ability to sequence DNA has begun to outpace computing—in particular, the capability of storing, transmitting and, most critically, analyzing the data. 

With a $2.6 million grant from the National Science Foundation, researchers from across the University of Illinois are collaborating to build an instrument that will enable faster, more accurate DNA sequencing and the processing of massive data sets. The potential payoff:  A better understanding of the basic processes of life, illumination on how evolution works and custom treatments for disease, among other breakthroughs.

“With Illinois’s expertise in genomic research and high-performance computing, we believe that we can have an enormous impact in this field,” said ECE Professor Steven Sam Lumetta, the principal investigator on the project, which is known as CompGen Initiative. “The machine will be built with genomic applications in mind, with the idea that eventually we’d like to see this technology migrate to a cloud environment.”

To develop and build the instrument, Lumetta and ECE Professor Ravishankar K Iyer have partnered with researchers from Illinois’s Institute for Genomic Biology (IGB), including Saurabh Sinha and Victor Jongeneel. Lumetta and Iyer are also professors at the Coordinated Science Laboratory.

“We’re on the cusp of a second genomic revolution, but we need big data to make it happen,” said IGB Director Gene Robinson. “The purpose of CompGen is to open the floodgates for genomic information and transform computing for genomics.”

The team building this unique instrument includes a consortium of about 15 companies, universities, and research institutions. They’ll design the instrument’s hardware and software simultaneously, creating a single integrated platform to drive new breakthroughs in genomics. The consortium will also allow for new applied research projects that require the instrument. Already teams are collaborating on improved error correction, genome assembly, and variant calling.

Currently, the world’s most powerful sequencers map an individual’s DNA by chopping up a human’s 3 billion nucleotides, which encode the instructions for a gene, into very tiny strings that machines can effectively process. Researchers must then take the tiny strings and order them correctly, much like putting together a million-piece puzzle. 

“I believe that genomic data is indeed the most complex of all big data problems and has the potential to be transformational to computer science and engineering in all of its aspects,” Iyer said. 

With the CompGen instrument, the goal is to be able to accelerate genomic science, using new computational technologies and techniques to leverage the more widespread availability of genomic data. It will do this by incorporating technologies—like non-volatile memory and die-stacked memory—that are only beginning to make their way to commercial products.

The instrument will combine custom, state-of-the-art technologies that enable the processing, information retrieval and storage of massive data sets. With CompGen’s scaling capabilities, scientists hope to be able to compare large genome collections, with the idea of exploring such complex issues as the impact of climate change on gene expression and ecosystems and exploring social aspects of genomics.

“We’re seeing exponential growth in gene sequencing today. The amount of genome data available for exploration doubles every five months,” said CSL Interim Director Klara Nahrstedt.  “A project like CompGen is perfect for Illinois, requiring an organization that has a deep understanding of both Big Data and of biology and genomics.”

Members of the CompGen project include: 

  • University of Illinois at Urbana-Champaign
  • Abbott Nutrition
  • Agilent Technologies
  • Baylor College of Medicine Human Genome Sequencing Center
  • Beijing Genomics Institute
  • IBM
  • Infosys
  • Intel
  • Mayo Clinic
  • Microsoft
  • Monsanto
  • MultiCoreWare
  • Strand Life Sciences
  • Tata Institute of Fundamental Research
  • Tezzaron Semiconductor
  • Washington University’s Genome Institute