Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Original language | English |
---|---|
Pages (from-to) | 692-697 |
Number of pages | 6 |
Journal | Nature Methods |
Volume | 22 |
Issue number | 4 |
Early online date | 9 Apr 2025 |
DOIs | |
Publication status | Published - 30 Apr 2025 |
Data Availability Statement
The Zenodo depositions for the five phylogenetically compressed test collections are provided in the following table.
GISP Assemblies (XZ) https://doi.org/10.5281/zenodo.10070404
SC2 Assemblies (XZ) Available upon request (GISAID license).
NCTC3k Assemblies (XZ) https://doi.org/10.5281/zenodo.5533354
BIGSIdata De Bruijn graphs (simplitigs after k-mer propagation; XZ) https://doi.org/10.5281/zenodo.5555253
661k Assemblies (XZ) https://doi.org/10.5281/zenodo.4602622 Assemblies (MBGC) https://doi.org/10.5281/zenodo.6347064 k-mer index (COBS; XZ) https://doi.org/10.5281/zenodo.7313926 https://doi.org/10.5281/zenodo.7313942 https://doi.org/10.5281/zenodo.7315499
661k-HQ k-mer index (COBS; XZ) https://doi.org/10.5281/zenodo.6845083 https://doi.org/10.5281/zenodo.68496
Acknowledgements
Portions of this research were conducted on the O2 high-performance compute cluster, supported by the Research Computing Group at Harvard Medical School, and on the GenOuest bioinformatics core facility (https://www.genouest.org/).Funding
This work was supported by the NIGMS of the National Institutes of Health (R35GM133700 to M.B.), the David and Lucile Packard Foundation (to M.B.), the Pew Charitable Trusts (to M.B.), the Alfred P. Sloan Foundation (to M.B.), the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 872539, 956229 and 101047160 to R.C.) and the ANR Transipedia, SeqDigger, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001, respectively; to R.C.).
ASJC Scopus subject areas
- Biotechnology
- Biochemistry
- Molecular Biology
- Cell Biology