Shomorony earns prestigious NSF CAREER Award to better understand genomic data problems

11/12/2021 Laura Schmitt

His framework will combine information theory and novel algorithms to accurately analyze genomic data from plants, bacteria, viruses, and the human microbiome.

Written by Laura Schmitt

An expert in applying information theory to computational biology, ECE Assistant Professor Ilan Shomorony is developing new algorithms to analyze genomic data while ensuring their accuracy. Many of the techniques he develops assemble the genomes of species that haven’t been sequenced before, including plants, bacteria, viruses, and the human gut microbiome.

Ilan Shomorony
Ilan Shomorony

In June, he received a $500,000 NSF CAREER award for young faculty to develop a framework that establishes the information limits of genomic data science problems.

Human, viral and bacterial genomic data is poised to revolutionize healthcare and biology by providing researchers with the knowledge that can lead to better diagnosis and treatments and a better understanding of the mechanisms of infectious disease. However, acquiring, processing, and analyzing vast amounts of this data has its challenges.

“For example, we need to know how much genomic data is needed to carry out specific computational tasks, as this is related to the cost of acquiring the data,” said Shomorony. “Understanding the information limits allows us to develop reliable tools for the analysis of genomic information.”

Shomorony will examine a range of issues, including how much sequencing data is needed to learn the genome of a species reliably, how much genomic sequencing data can be compressed while maintaining its usefulness, and how sequencing errors impact the ability to perform biologically valid inferences.

The computationally efficient algorithms he develops will focus on three areas—aligning pairs of sequences, reconstructing sequences from noisy fragments, and clustering sequences based on appropriate metrics.

“Since the pairwise alignment of a large number of noisy sequences is often a bottleneck in genomic data science, the first [research aim] will be to study how low-dimensional representations of these sequences, or sketches, can be optimally used for alignment computation,” he said.

Shomorony will leverage a source-coding framework to study the tradeoffs between sketch size and the incurred distortion in alignment computation.

The second research aim is to tackle computational complexity obstacles such as NP-hardness, which often do not appropriately capture the complexity of real-world problems. Shomorony will introduce a notion of instance-based informational hardness to develop efficient algorithms with instance-specific theoretical guarantees.

His third aim is to study the problem of clustering sequences in the context of metagenomic sequencing, where the goal is to determine which sequences come from the same microbial genome. He will introduce information-theoretic metrics for clustering metagenomic sequencing data into algorithms that seek to resolve microbial communities at the maximum resolution allowed by the data.

In addition to the research, Shomorony will introduce a new graduate-level course in the spring 2022 semester. The ECE 598 Fundamental Limits in Data Science course will introduce students to the concept of applying tools and ideas from information theory to genomic data science problems.

The NSF CAREER Award is the agency’s most prestigious award in support of early-career faculty who have the potential to serve as academic role models in both research and education and can advance the mission of their respective department or organization. 

Share this story

This story was published November 12, 2021.