Milenkovic's $1.3M BD2K grant tackling Big Data challenges in genomics
For some companies, email has taken the place of face-to-face communication.
But for scientists and researchers who focus on genomic, functional genomic, and proteomic research (which is the large-scale study of proteins), emailing work materials and datasets are an idealistic dream. Attachments of the size of genomic data are too large to communicate.
Associate Professor Olgica Milenkovic is hoping to change this with the help of one of the inaugural Big Data to Knowledge (BD2K) Awards for software development from the National Institute of Health (NIH). The award is a $1.3 million grant.
She and her team will be working on special-purpose compression algorithms to make a wide range of data pertaining to human, animal, and microbial genomes of a more usable size for researchers.
“The NIH has recognized the immense storage and communication challenges that will arise with the ever-increasing volumes of genomic data, and the need to reduce the cost of maintaining large biological data repositories,” Milenkovic said. “Hospitals and research labs are required by law to keep patient data for many years. Research labs will literally become warehouses full of patient data.”
For example, a hospital may spend half a million dollars a year paying for storage facilities and technology. But if one can decrease the file size by 10 or 20 times, the savings would be immense.
“Those resources could be used to support fundamental research,” she said. ”Therefore, we see a strong need to pursue new ways to compress DNA data for storage on classic disk systems.”
Milenkovic is among a handful of teams working to develop new algorithmic solutions for metagenomic and functional genomic compression and compressed computing. Metagenomics is the study of genetic material recovered directly from environmental samples.
“We have developed algorithms compatible with current data standards that outperform existing methods five or six times with respect to achievable compression rates,” Milenkovic said.
Compression only happens once, but decompression needs to be fast because it happens many times, she said. Also, search and alignment algorithms should work directly on the compressed information so as to avoid many compression decompression rounds.
“If anyone told you a few years back you would be using coding theory and information theory to deal with DNA data, most people would say ‘you’re joking,” she said. “But now people realize we absolutely need to do new work in source coding.”
Her collaborators on the compression projects at Illinois include Professor Venu Veeravalli, ECE graduate student Minji Kim, and Bioengineering Assistant Professor Jian Ma.