Rapid geographical source attribution of Salmonella enterica serovar Enteritidis genomes using hierarchical machine learning

Sion C. Bayliss, Rebecca K. Locke, Claire Jenkins, Marie Anne Chattaway, Timothy J. Dallman, Lauren A. Cowley

Research output: Contribution to journalArticlepeer-review

6 Citations (SciVal)


Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonel-losis globally and is commonly transmitted from animals to humans by the consumption of contam-inated foodstuffs. In the UK and many other countries in the Global North, a significant proportion of cases are caused by the consumption of imported food products or contracted during foreign travel, therefore, making the rapid identification of the geographical source of new infections a requirement for robust public health outbreak investigations. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S. Enteritidis infections from whole genome sequencing data. 2313 S. Enteritidis genomes, collected by the UKHSA between 2014–2019, were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to four continents, 11 sub-regions, and 38 countries (53 classes). The highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661, respectively). A number of countries commonly visited by UK travelers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provided granular geographical source prediction directly from sequencing reads in <4 min per sample, facili-tating rapid outbreak resolution and real-time genomic epidemiology. The results suggest additional application to a broader range of pathogens and other geographically structured problems, such as antimicrobial resistance prediction, is warranted.

Original languageEnglish
Article numbere84167
Publication statusPublished - 12 Apr 2023

Bibliographical note

Funding Information:
We would like to acknowledge both Dr. Harry Thorpe and Dr. Nicola Coyle who have both previously contributed to the development of scripts that underlie the unitig processing pipeline. This work was funded by an Academy of Medical Sciences Springboard grant (SBF005\1089). CJ, TD, and MAC are affiliated to the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Gastrointestinal Infections and Genomics and Enabling Data at the University of Liverpool and University of Warwick, respectively in partnership with the UK Health Security Agency (UKHSA). CJ and MAC are based at UKHSA. The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health and Social Care, or the UK Health Security Agency.

Data availability
The final optimised hierarchical model as well as a pipeline for pre-processing raw read data to unitigs/patterns for input and paper data is available from https://github.com/SionBayliss/HierarchicalML (copy archived at Bayliss and Cowley, 2023) with a short description and tutorial for ease of use. This end-to-end process, from FASTQ to prediction, is open access and available to users under GNU GPLv2 licence . This depository also includes the preprocessed unitig datasets and resulting predictions. Short read sequencing data is available from the Sequence Read Archive (Bioproject: PRJNA248792). Please note that the sequence data has been previously deposited/published in the Sequence Read Archive by PHE/UKHSA and was not generated for this project.

ASJC Scopus subject areas

  • General Neuroscience
  • General Biochemistry,Genetics and Molecular Biology
  • General Immunology and Microbiology


Dive into the research topics of 'Rapid geographical source attribution of Salmonella enterica serovar Enteritidis genomes using hierarchical machine learning'. Together they form a unique fingerprint.

Cite this