TY - JOUR
T1 - Predicting haplogroups using a versatile machine learning program (PredYMaLe) on a new mutationally balanced 32 Y-STR multiplex (CombYplex)
T2 - Unlocking the full potential of the human STR mutation rate spectrum to estimate forensic parameters
AU - Bouakaze, Caroline
AU - Delehelle, Franklin
AU - Saenz-Oyhéréguy, Nancy
AU - Moreira, Andreia
AU - Schiavinato, Stéphanie
AU - Croze, Myriam
AU - Delon, Solène
AU - Fortes-Lima, Cesar
AU - Gibert, Morgane
AU - Bujan, Louis
AU - Huyghe, Eric
AU - Bellis, Gil
AU - Calderon, Rosario
AU - Hernández, Candela Lucia
AU - Avendaño-Tamayo, Efren
AU - Bedoya, Gabriel
AU - Salas, Antonio
AU - Mazières, Stéphane
AU - Charioni, Jacques
AU - Migot-Nabias, Florence
AU - Ruiz-Linares, Andres
AU - Dugoujon, Jean-Michel
AU - Thèves, Catherine
AU - Mollereau-Manaute, Catherine
AU - Noûs, Camille
AU - Poulet, Nicolas
AU - King, Turi
AU - D'Amato, Maria Eugenia
AU - Balaresque, Patricia
N1 - Copyright © 2020 Elsevier B.V. All rights reserved.
PY - 2020/9/30
Y1 - 2020/9/30
N2 - We developed a new mutationally well-balanced 32 Y-STR multiplex (CombYplex) together with a machine learning (ML) program PredYMaLe to assess the impact of STR mutability on haplogourp prediction, while respecting forensic community criteria (high DC/HD). We designed CombYplex around two sub-panels M1 and M2 characterized by average and high-mutation STR panels. Using these two sub-panels, we tested how our program PredYmale reacts to mutability when considering basal branches and, moving down, terminal branches. We tested first the discrimination capacity of CombYplex on 996 human samples using various forensic and statistical parameters and showed that its resolution is sufficient to separate haplogroup classes. In parallel, PredYMaLe was designed and used to test whether a ML approach can predict haplogroup classes from Y-STR profiles. Applied to our kit, SVM and Random Forest classifiers perform very well (average 97 %), better than Neural Network (average 91 %) and Bayesian methods (< 90 %). We observe heterogeneity in haplogroup assignation accuracy among classes, with most haplogroups having high prediction scores (99-100 %) and two (E1b1b and G) having lower scores (67 %). The small sample sizes of these classes explain the high tendency to misclassify the Y-profiles of these haplogroups; results were measurably improved as soon as more training data were added. We provide evidence that our ML approach is a robust method to accurately predict haplogroups when it is combined with a sufficient number of markers, well-balanced mutation rate Y-STR panels, and large ML training sets. Further research on confounding factors (such as CNV-STR or gene conversion) and ideal STR panels in regard to the branches analysed can be developed to help classifiers further optimize prediction scores.
AB - We developed a new mutationally well-balanced 32 Y-STR multiplex (CombYplex) together with a machine learning (ML) program PredYMaLe to assess the impact of STR mutability on haplogourp prediction, while respecting forensic community criteria (high DC/HD). We designed CombYplex around two sub-panels M1 and M2 characterized by average and high-mutation STR panels. Using these two sub-panels, we tested how our program PredYmale reacts to mutability when considering basal branches and, moving down, terminal branches. We tested first the discrimination capacity of CombYplex on 996 human samples using various forensic and statistical parameters and showed that its resolution is sufficient to separate haplogroup classes. In parallel, PredYMaLe was designed and used to test whether a ML approach can predict haplogroup classes from Y-STR profiles. Applied to our kit, SVM and Random Forest classifiers perform very well (average 97 %), better than Neural Network (average 91 %) and Bayesian methods (< 90 %). We observe heterogeneity in haplogroup assignation accuracy among classes, with most haplogroups having high prediction scores (99-100 %) and two (E1b1b and G) having lower scores (67 %). The small sample sizes of these classes explain the high tendency to misclassify the Y-profiles of these haplogroups; results were measurably improved as soon as more training data were added. We provide evidence that our ML approach is a robust method to accurately predict haplogroups when it is combined with a sufficient number of markers, well-balanced mutation rate Y-STR panels, and large ML training sets. Further research on confounding factors (such as CNV-STR or gene conversion) and ideal STR panels in regard to the branches analysed can be developed to help classifiers further optimize prediction scores.
KW - Chromosomes, Human, Y
KW - DNA Fingerprinting
KW - Forensic Genetics/methods
KW - Haplotypes
KW - Humans
KW - Machine Learning
KW - Male
KW - Microsatellite Repeats
KW - Multiplex Polymerase Chain Reaction
KW - Mutation Rate
KW - Polymorphism, Single Nucleotide
U2 - 10.1016/j.fsigen.2020.102342
DO - 10.1016/j.fsigen.2020.102342
M3 - Article
C2 - 32818722
SN - 1872-4973
VL - 48
SP - 102342
JO - Forensic Science International. Genetics
JF - Forensic Science International. Genetics
ER -