Machine learning to predict the source of campylobacteriosis using whole genome data

Nicolas Arning, Samuel K. Sheppard, Sion Bayliss, David A. Clifton, Daniel J. Wilson

Research output: Contribution to journalArticlepeer-review

23 Citations (SciVal)

Abstract

Campylobacteriosis is among the world’s most common foodborne illnesses, caused predominantly by the bacterium Campylobacter jejuni. Effective interventions require determination of the infection source which is challenging as transmission occurs via multiple sources such as contaminated meat, poultry, and drinking water. Strain variation has allowed source tracking based upon allelic variation in multi-locus sequence typing (MLST) genes allowing isolates from infected individuals to be attributed to specific animal or environmental reservoirs. However, the accuracy of probabilistic attribution models has been limited by the ability to differentiate isolates based upon just 7 MLST genes. Here, we broaden the input data spectrum to include core genome MLST (cgMLST) and whole genome sequences (WGS), and implement multiple machine learning algorithms, allowing more accurate source attribution. We increase attribution accuracy from 64% using the standard iSource population genetic approach to 71% for MLST, 85% for cgMLST and 78% for kmerized WGS data using the classifier we named aiSource. To gain insight beyond the source model prediction, we use Bayesian inference to analyse the relative affinity of C. jejuni strains to infect humans and identified potential differences, in source-human transmission ability among clonally related isolates in the most common disease causing lineage (ST-21 clonal complex). Providing generalizable computationally efficient methods, based upon machine learning and population genetics, we provide a scalable approach to global disease surveillance that can continuously incorporate novel samples for source attribution and identify fine-scale variation in transmission potential.

Original languageEnglish
Article numbere1009436
JournalPlos Genetics
Volume17
Issue number10
DOIs
Publication statusPublished - 18 Oct 2021

Bibliographical note

Funding Information:
N. A. is a recipient of a BBSRC scholarship and thus supported by funding from the Biotechnology and Biological Sciences Research Council (BBSRC) (grant number BB/ M011224/1). SKS was supported by Wellcome Trust (088786/C/09/Z) and Medical Research Council (MR/M501608/1 and MR/L015080/1) grants. D. J. W. is supported by a Sir Henry Dale Fellowship jointly funded by the Wellcome Trust and the Royal Society (grant number: 101237/Z/ 13/B) and by the Robertson Foundation. DAC?s research is supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. N.A. would like to thank David Eyre, Christophe Fraser and Alexandra Casey for insightful comments. Computation used the Oxford Biomedical Research Computing (BMRC) facility, a joint development between the Wellcome Centre for Human Genetics and the Big Data Institute supported by Health Data Research UK and the NIHR Oxford Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Molecular Biology
  • Genetics
  • Genetics(clinical)
  • Cancer Research

Fingerprint

Dive into the research topics of 'Machine learning to predict the source of campylobacteriosis using whole genome data'. Together they form a unique fingerprint.

Cite this