Efficient and robust search of microbial genomes via phylogenetic compression

Karel Břinda, Leandro Lima, Simone Pignotti, Natalia Quinones-Olvera, Kamil Salikhov, Rayan Chikhi, Gregory Kucherov, Zamin Iqbal, Michael Baym

Research output: Contribution to journalArticlepeer-review

Abstract

Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

Original languageEnglish
Pages (from-to)692-697
Number of pages6
JournalNature Methods
Volume22
Issue number4
Early online date9 Apr 2025
DOIs
Publication statusPublished - 30 Apr 2025

Data Availability Statement


The Zenodo depositions for the five phylogenetically compressed test collections are provided in the following table.

GISP Assemblies (XZ) https://doi.org/10.5281/zenodo.10070404

SC2 Assemblies (XZ) Available upon request (GISAID license).

NCTC3k Assemblies (XZ) https://doi.org/10.5281/zenodo.5533354

BIGSIdata De Bruijn graphs (simplitigs after k-mer propagation; XZ) https://doi.org/10.5281/zenodo.5555253

661k Assemblies (XZ) https://doi.org/10.5281/zenodo.4602622 Assemblies (MBGC) https://doi.org/10.5281/zenodo.6347064 k-mer index (COBS; XZ) https://doi.org/10.5281/zenodo.7313926 https://doi.org/10.5281/zenodo.7313942 https://doi.org/10.5281/zenodo.7315499

661k-HQ k-mer index (COBS; XZ) https://doi.org/10.5281/zenodo.6845083 https://doi.org/10.5281/zenodo.68496

Acknowledgements

Portions of this research were conducted on the O2 high-performance compute cluster, supported by the Research Computing Group at Harvard Medical School, and on the GenOuest bioinformatics core facility (https://www.genouest.org/).

Funding

This work was supported by the NIGMS of the National Institutes of Health (R35GM133700 to M.B.), the David and Lucile Packard Foundation (to M.B.), the Pew Charitable Trusts (to M.B.), the Alfred P. Sloan Foundation (to M.B.), the European Union’s Horizon 2020 research and innovation programme (grant agreement nos. 872539, 956229 and 101047160 to R.C.) and the ANR Transipedia, SeqDigger, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001, respectively; to R.C.).

ASJC Scopus subject areas

  • Biotechnology
  • Biochemistry
  • Molecular Biology
  • Cell Biology

Fingerprint

Dive into the research topics of 'Efficient and robust search of microbial genomes via phylogenetic compression'. Together they form a unique fingerprint.

Cite this