Classification of molecular sequence data using Bayesian phylogenetic mixture models

E. Loza Reyes, M. A. Hurn, A. Robinson

Research output: Contribution to journalArticle

1 Citation (Scopus)
100 Downloads (Pure)

Abstract

Rate variation among the sites of a molecular sequence is commonly found in applications of phylogenetic inference. Several approaches exist to account for this feature but they do not usually enable the investigator to pinpoint the sites that evolve under one or another rate of evolution in a straightforward manner. The focus is on Bayesian phylogenetic mixture models, augmented with allocation variables, as tools for site classification and quantification of classification uncertainty. The method does not rely on prior knowledge of site membership to classes or even the number of classes. Furthermore, it does not require correlated sites to be next to one another in the sequence alignment, unlike some phylogenetic hidden Markov or change-point models. In the approach presented, model selection on the number and type of mixture components is conducted ahead of both model estimation and site classification; the steppingstone sampler (SS) is used to select amongst competing mixture models. Example applications of simulated data and mitochondrial DNA of primates illustrate site classification via 'augmented' Bayesian phylogenetic mixtures. In both examples, all mixtures outperform commonly-used models of among-site rate variation and models that do not account for rate heterogeneity. The examples further demonstrate how site classification is readily available from the analysis output. The method is directly relevant to the choice of partitions in Bayesian phylogenetics, and its application may lead to the discovery of structure not otherwise recognised in a molecular sequence alignment. Computational aspects of Bayesian phylogenetic model estimation are discussed, including the use of simple Markov chain Monte Carlo (MCMC) moves that mix efficiently without tempering the chains. The contribution to the field of Bayesian phylogenetics is in (1) the use of mixture models augmented with allocation variables as tools for site classification and quantification of classification uncertainty, (2) the successful application of SS for selection of phylogenetic mixtures, and (3) the development of novel MCMC aspects of relevance to Bayesian phylogenetic models-whether mixtures or not.1 1 The MCMC methods discussed in this paper have been coded in a C program; source files are available upon request. Supplementary material is available online (see Appendix A).
Original languageEnglish
Pages (from-to)81-95
Number of pages15
JournalComputational Statistics & Data Analysis
Volume75
Early online date29 Jan 2014
DOIs
Publication statusPublished - 1 Jul 2014

Fingerprint

Phylogenetics
Mixture Model
Markov processes
Sequence Alignment
Markov Chain Monte Carlo
Quantification
Change-point Model
Uncertainty
Markov Chain Monte Carlo Methods
Prior Knowledge
Model Selection
Tempering
Model
Partition
DNA
Monte Carlo methods
Output

Keywords

  • among-site rate variation
  • Bayesian mixture model
  • classification
  • Markov chain Monte Carlo
  • model selection
  • phylogeny

Cite this

Classification of molecular sequence data using Bayesian phylogenetic mixture models. / Loza Reyes, E.; Hurn, M. A.; Robinson, A.

In: Computational Statistics & Data Analysis, Vol. 75, 01.07.2014, p. 81-95.

Research output: Contribution to journalArticle

@article{15d1f294316347f8bcdeac7d084cf1b4,
title = "Classification of molecular sequence data using Bayesian phylogenetic mixture models",
abstract = "Rate variation among the sites of a molecular sequence is commonly found in applications of phylogenetic inference. Several approaches exist to account for this feature but they do not usually enable the investigator to pinpoint the sites that evolve under one or another rate of evolution in a straightforward manner. The focus is on Bayesian phylogenetic mixture models, augmented with allocation variables, as tools for site classification and quantification of classification uncertainty. The method does not rely on prior knowledge of site membership to classes or even the number of classes. Furthermore, it does not require correlated sites to be next to one another in the sequence alignment, unlike some phylogenetic hidden Markov or change-point models. In the approach presented, model selection on the number and type of mixture components is conducted ahead of both model estimation and site classification; the steppingstone sampler (SS) is used to select amongst competing mixture models. Example applications of simulated data and mitochondrial DNA of primates illustrate site classification via 'augmented' Bayesian phylogenetic mixtures. In both examples, all mixtures outperform commonly-used models of among-site rate variation and models that do not account for rate heterogeneity. The examples further demonstrate how site classification is readily available from the analysis output. The method is directly relevant to the choice of partitions in Bayesian phylogenetics, and its application may lead to the discovery of structure not otherwise recognised in a molecular sequence alignment. Computational aspects of Bayesian phylogenetic model estimation are discussed, including the use of simple Markov chain Monte Carlo (MCMC) moves that mix efficiently without tempering the chains. The contribution to the field of Bayesian phylogenetics is in (1) the use of mixture models augmented with allocation variables as tools for site classification and quantification of classification uncertainty, (2) the successful application of SS for selection of phylogenetic mixtures, and (3) the development of novel MCMC aspects of relevance to Bayesian phylogenetic models-whether mixtures or not.1 1 The MCMC methods discussed in this paper have been coded in a C program; source files are available upon request. Supplementary material is available online (see Appendix A).",
keywords = "among-site rate variation, Bayesian mixture model, classification, Markov chain Monte Carlo, model selection, phylogeny",
author = "{Loza Reyes}, E. and Hurn, {M. A.} and A. Robinson",
year = "2014",
month = "7",
day = "1",
doi = "10.1016/j.csda.2014.01.008",
language = "English",
volume = "75",
pages = "81--95",
journal = "Computational Statistics & Data Analysis",
issn = "0167-9473",
publisher = "Elsevier",

}

TY - JOUR

T1 - Classification of molecular sequence data using Bayesian phylogenetic mixture models

AU - Loza Reyes, E.

AU - Hurn, M. A.

AU - Robinson, A.

PY - 2014/7/1

Y1 - 2014/7/1

N2 - Rate variation among the sites of a molecular sequence is commonly found in applications of phylogenetic inference. Several approaches exist to account for this feature but they do not usually enable the investigator to pinpoint the sites that evolve under one or another rate of evolution in a straightforward manner. The focus is on Bayesian phylogenetic mixture models, augmented with allocation variables, as tools for site classification and quantification of classification uncertainty. The method does not rely on prior knowledge of site membership to classes or even the number of classes. Furthermore, it does not require correlated sites to be next to one another in the sequence alignment, unlike some phylogenetic hidden Markov or change-point models. In the approach presented, model selection on the number and type of mixture components is conducted ahead of both model estimation and site classification; the steppingstone sampler (SS) is used to select amongst competing mixture models. Example applications of simulated data and mitochondrial DNA of primates illustrate site classification via 'augmented' Bayesian phylogenetic mixtures. In both examples, all mixtures outperform commonly-used models of among-site rate variation and models that do not account for rate heterogeneity. The examples further demonstrate how site classification is readily available from the analysis output. The method is directly relevant to the choice of partitions in Bayesian phylogenetics, and its application may lead to the discovery of structure not otherwise recognised in a molecular sequence alignment. Computational aspects of Bayesian phylogenetic model estimation are discussed, including the use of simple Markov chain Monte Carlo (MCMC) moves that mix efficiently without tempering the chains. The contribution to the field of Bayesian phylogenetics is in (1) the use of mixture models augmented with allocation variables as tools for site classification and quantification of classification uncertainty, (2) the successful application of SS for selection of phylogenetic mixtures, and (3) the development of novel MCMC aspects of relevance to Bayesian phylogenetic models-whether mixtures or not.1 1 The MCMC methods discussed in this paper have been coded in a C program; source files are available upon request. Supplementary material is available online (see Appendix A).

AB - Rate variation among the sites of a molecular sequence is commonly found in applications of phylogenetic inference. Several approaches exist to account for this feature but they do not usually enable the investigator to pinpoint the sites that evolve under one or another rate of evolution in a straightforward manner. The focus is on Bayesian phylogenetic mixture models, augmented with allocation variables, as tools for site classification and quantification of classification uncertainty. The method does not rely on prior knowledge of site membership to classes or even the number of classes. Furthermore, it does not require correlated sites to be next to one another in the sequence alignment, unlike some phylogenetic hidden Markov or change-point models. In the approach presented, model selection on the number and type of mixture components is conducted ahead of both model estimation and site classification; the steppingstone sampler (SS) is used to select amongst competing mixture models. Example applications of simulated data and mitochondrial DNA of primates illustrate site classification via 'augmented' Bayesian phylogenetic mixtures. In both examples, all mixtures outperform commonly-used models of among-site rate variation and models that do not account for rate heterogeneity. The examples further demonstrate how site classification is readily available from the analysis output. The method is directly relevant to the choice of partitions in Bayesian phylogenetics, and its application may lead to the discovery of structure not otherwise recognised in a molecular sequence alignment. Computational aspects of Bayesian phylogenetic model estimation are discussed, including the use of simple Markov chain Monte Carlo (MCMC) moves that mix efficiently without tempering the chains. The contribution to the field of Bayesian phylogenetics is in (1) the use of mixture models augmented with allocation variables as tools for site classification and quantification of classification uncertainty, (2) the successful application of SS for selection of phylogenetic mixtures, and (3) the development of novel MCMC aspects of relevance to Bayesian phylogenetic models-whether mixtures or not.1 1 The MCMC methods discussed in this paper have been coded in a C program; source files are available upon request. Supplementary material is available online (see Appendix A).

KW - among-site rate variation

KW - Bayesian mixture model

KW - classification

KW - Markov chain Monte Carlo

KW - model selection

KW - phylogeny

U2 - 10.1016/j.csda.2014.01.008

DO - 10.1016/j.csda.2014.01.008

M3 - Article

VL - 75

SP - 81

EP - 95

JO - Computational Statistics & Data Analysis

JF - Computational Statistics & Data Analysis

SN - 0167-9473

ER -