Faculty and doctoral students from the Department of Computer Science at Stony Brook University recently developed Mantis, a space-efficient system that uses new data structures to index large collections of raw sequencing data.
Photo credit: DNA with binary codes ktsdesign/Shutterstock
Searching these sequencing archives for evidence of a particular sequence is, potentially, a very powerful capability. If, for example, a scientist discovers a new gene or variant they believe to be associated with some condition (e.g., a particular disease), they may want to query the entire archive to find which samples contain this gene or variant.
Searching such immense public databases is a true challenge, and Mantis significantly advances the state-of-the-art, bringing this goal closer to reality.
Developed by computer science professors Robert Patro, Michael Bender, Michael Ferdman, and Rob Johnson along with PhD students Prashant Pandey and Fatemeh Almodaresi, Mantis is capable of sequence-level searchers on large collections of RNA sequencing experiments, such as a large subset of experiments from the NCBI SRA. It enables users to ask which experiments are likely to contain a particular query sequence (e.g., a particular transcript), and provides them with a measure of the quality of the discovered match.
With Mantis, the improvement in search speed can be up to 100X compared to previous state-of-the-art approaches, leading to the potential for near-interactive sequence searches. Since Mantis is exact, query results contain zero false positives or negatives.
To give you a sense of the power of Mantis, it is able to complete a search for 200,400 known human transcripts over an index of 2,652 RNA sequencing experiments in just 82 minutes, where the previous fastest approach (SSBT) took close to four days.
In its initial stage, Mantis was constructed on a set of ~2,600 human RNA-sequencing (RNA-seq) samples --- consisting of blood, breast, and brain tissue. Yet, the SRA currently contains on the order of 100,000 human RNA-seq experiments. Moving forward, the team will now focus on scaling these methods to be able to index all publicly-available RNA-seq experiments, and potentially expanding techniques to other types of sequencing experiments.
About the Researchers
Prashant Pandey came to Stony Brook in 2013 and his doctoral research focuses on the intersection of systems and algorithms. Pandey works on designing and building theoretically well-founded data structures for large data issues in computational biology, databases and file systems. Currently, Pandey is focused on “building efficient approximate membership query data structures, specifically, counting filters and their applications.” He is also working on “finding compact methods to represent large DNA sequencing and transcriptome datasets for large-scale sequence-search and de Bruijn graph traversal and assembly process.”
Fatemeh Almodaresi started her PhD in computer science at Stony Brook in 2015. Her research focus is in computational biology, where she mainly works on developing algorithms and data structures for indexing and processing high-throughput sequencing data. Most recently, her research focus is on designing and developing space and time efficient indices that enable querying in a large database of genome or transcriptome reference sequences or raw sequencing data. Specifically, she has focused on representing and indexing this data via various type of de Bruijn graphs that enables efficient searching.
Michael A. Bender is an associate professor of computer science at Stony Brook and Chief Scientist at Tokutek, Inc. His research interests include analysis of algorithms, databases, parallel computing, scheduling, data structures, and I/O-efficient computing on large data sets. Bender co-founded Tokutek in 2006 and he has held visiting scientist positions at both MIT and King's College London. He was a member of the Sandia team that won the CPA R&D 100 Award for scheduling in parallel computers. He has also won four awards for graduate and undergraduate teaching. Bender obtained a D.E.A. in Computer Science from the Ecole Normale Superieure de Lyon, France and completed a PhD on scheduling algorithms from Harvard University.
Michael Ferdman is an assistant professor of computer science at Stony Brook University, where he co-directs the Computer Architecture Stony Brook (COMPAS) Lab. His research interests are in the areas of computer architecture and systems, with particular emphasis on building high performance and efficient servers. Ferdman received a BS in Computer Science, and BS, MS, and PhD in Electrical and Computer Engineering from Carnegie Mellon University.
Robert Patro is an assistant professor of computer science in the College of Engineering and Applied Sciences at Stony Brook University since 2014. He earned a PhD and BS in computer science from the University of Maryland-College Park. Patro’s main academic interests are in the design of algorithms and data structures for processing, organizing, indexing and querying high-throughput genomics data. He is also interested in the intersection between efficient algorithms and statistical inference. Patro is a 2018 NSF CAREER awardee and he works with students to develop, maintain and contribute to a number of different open-source bioinformatics software tools.
Rob Johnson is a Senior Researcher in the VMWare Research group. He does theoretical work with an impact on the real world. Johnson developed BetrFS, a file system that uses recent advances in data structures to improve performance on some operations by over an order of magnitude. He invented the quotient filter, a high-performance alternative to the Bloom filter for Big Data applications. Johnson has co-authored CQual, a static analysis tool that has found dozens of bugs in the Linux kernel. Before joining VMWare, he was a research assistant professor at Stony Brook University and he earned a PhD at UC Berkeley.