Predicting clinical outcome of Escherichia coli O157:H7 infections using explainable machine learning

Julian A. Paganini, Suniya Khatun, Sean McAteer, Lauren Cowley, David R. Greig, David L. Gally, Claire Jenkins, Timothy J. Dallman

Research output: Contribution to journalArticlepeer-review

Abstract

Shiga toxin-producing Escherichia coli (STEC) O157:H7 is a globally dispersed zoonotic pathogen capable of causing severe disease outcomes, including bloody diarrhoea and haemolytic uraemic syndrome. While variations in Shiga toxin subtype are well-recognized drivers of disease severity, many unexplained differences remain among strains carrying the same toxin profile.We applied explainable machine learning (ML) approaches - Random Forest and Extreme Gradient Boosting - to whole-genome sequencing data from 1,030 STEC O157:H7 isolates to predict patient clinical outcomes, using data collected over 2 years of routine surveillance in England. A phylogeny-informed cross-validation strategy was implemented to account for population structure and avoid data leakage, ensuring robust model generalizability. Extreme Gradient Boosting outperformed Random Forest in predicting minority classes and correctly predicted high-risk isolates in traditionally low-risk lineages, illustrating its utility for capturing complex genomic signatures beyond known virulence genes. Feature importance analyses highlighted phage-encoded elements, including potentially novel intergenic regulators, alongside established virulence factors. Moreover, key genomic regions linked to small RNAs and stress-response pathways were enriched in isolates causing severe disease. These findings underscore the capacity of explainable ML to refine risk assessments, offering a valuable tool for early detection of high-risk STEC O157:H7 and guiding targeted public health interventions.

Original languageEnglish
Article number001591
JournalMicrobial Genomics
Volume11
Issue number12
Early online date17 Dec 2025
DOIs
Publication statusPublished - 31 Dec 2025

Acknowledgements

We would like to thank all the staff at the Gastrointestinal Bacteria Reference Unit and Health Protection Research Unit at Public Health of England for their support and guidance during this project.

Funding

This work was supported by the BBSRC London Interdisciplinary Doctoral Programme and Public Health of England. J.A.P. and T.J.D. were funded by the HealthHolland TKI-LSI grant, project number: LSHM23021. D.L.G. and S.M. were supported by funding from BBSRC: BBS/E/RL/230002C.

FundersFunder number
Biotechnology and Biological Sciences Research Council

Keywords

  • clinical outcome prediction
  • machine learning
  • Next Generation Sequencing (NGS)
  • public health
  • Shiga toxin
  • Shiga toxin-producing Escherichia coli (STEC)

ASJC Scopus subject areas

  • Epidemiology
  • Microbiology
  • Molecular Biology
  • Genetics

Fingerprint

Dive into the research topics of 'Predicting clinical outcome of Escherichia coli O157:H7 infections using explainable machine learning'. Together they form a unique fingerprint.

Cite this