Nature Methods details new tool that powers gene expression studies

Rob Patro, assistant professor of computer science, leads team that develops lightweight method that provides fast and bias-aware quantification from RNA-seq reads

For a team of computational biological researchers from the Department of Computer Science (CS) at Stony Brook University, University of North Carolina – Chapel Hill, Harvard School of Public Health, Carnegie Mellon School of Computer Science, and private industry, creating a tool that is ultra-fast and improves the accuracy of abundance estimates proved to be a challenge that spanned several years, and several funding sources.

The resulting software tool, Salmon, published in the March 6th edition of Nature Methods, combines a dual-phase parallel inference algorithm with feature-rich models that have the ability to correct for fragment GC content and other technical biases in RNA-seq data, therefore improving the accuracy of transcript abundance estimates and sensitivity of downstream expression analyses.

In genomics, transcript abundance estimates are used to e.g., classify diseases and their subtypes, understand how gene expression changes correlate with phenotype, and track the progression of cancer. The accuracy of abundance estimates derived from RNA-seq data is especially urgent given the wide range of biases that affect the RNA-seq fragmentation and sequencing processes, and the use of expression data in studying disease and, eventually, for medical diagnosis and personalized treatments.

Created by researchers Rob Patro, Geet Duggal, Michael Love, Rafael A. Irizarry, and Carl Kingsford, Salmon synthesizes, into one tool, many algorithmic and methodological advances that will power gene expression studies, both small and large-scale.

According to Patro, the hallmarks of the method are its speed, accuracy and robustness. Salmon runs at a similar speed to existing fast algorithms for quantifying gene expression. Yet, it incorporates a rich and expressive model of the underlying experiment, including many technical biases, and uses a new statistical inference procedure to estimate gene expression quickly and accurately.

Salmon was developed with the help of funding from the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative, the National Science Foundation, and National Institutes of Health.

“This research represents a perfect storm for computer science. We have a group of knowledge-driven collaborators from across the United States, funded by multiple sources, and striving for advancing genomic research by developing an innovative tool. I congratulate them on this discovery,” said Dr. Arie Kaufman, chair of the computer science department.

Looking forward, Patro remarks, “The methodological underpinnings of Salmon provide a framework upon which we can continue to build accurate models and efficient inference algorithms. We are working on understanding and modeling an even larger array of potential technical biases that arise in RNA-seq-based gene expression studies. We are also particularly interested in how quantification algorithms can be made more accurate and robust in single-cell RNA-sequencing (scRNA-seq) experiments, which present unique algorithmic and statistical challenges.”

About the CS Researcher

Rob Patro joined the Department of Computer Science at Stony Brook in 2014 as an assistant professor. He earned his PhD from the University of Maryland at College Park in 2012 and served as a postdoctoral researcher at Carnegie Mellon University. His research focuses on scalable algorithms for high-throughput genomic analysis, algorithms for inferring and comparing biological networks, and network evolution and systems.