Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

BADBIR Study Group

Research output: Contribution to journalArticlepeer-review

19 Citations (SciVal)

Abstract

In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the 'lowest number of feature subset' with the 'maximal average AUC over the nested cross validation' and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

Original languageEnglish
Article number23335
JournalScientific Reports
Volume11
Issue number1
DOIs
Publication statusPublished - 2 Dec 2021

Funding

This work was supported by Versus Arthritis (grant number 21173, grant number 21754 and grant number 21755). FJ is supported by an MRC/University of Manchester Skills Development Fellowship (grant number MR/R016615). RBW is supported by the Manchester NIHR Biomedical Research Centre. H.M-O is supported by the National Institute for Health Research (NIHR) Leeds Biomedical Research Centre (LBRC). This research has been conducted using the UK Biobank Resource (approved research ID 7996, Principal Investigator: Dr Suzanne Verstappen). SV is supported by Versus Arthritis (grant numbers 20385, 20380) and the NIHR Manchester Biomedical Research Centre. The authors would like to acknowledge the assistance given by IT Services and the use of the Computational Shared Facility at The University of Manchester. This work was part-funded by the NIHR Manchester BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The authors acknowledge the substantial contribution of the BADBIR team to the administration of the project. BADBIR acknowledges the support of the National Institute for Health Research (NIHR) through the clinical research networks and its contribution in facilitating recruitment into the registry. This research was funded/supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the BADBIR, NIHR, NHS or the Department of Health. The authors are grateful to the members of the Data Monitoring Committee (DMC): Dr Robert Chalmers, Dr Carsten Flohr (Chair), Dr Karen Watson and David Prieto-Merino and the BADBIR Steering Committee (in alphabetical order): Oras Alabas, Prof Jonathan Barker, Gabrielle Becher, Anthony Bewley, David Burden, Simon Morrison (CEO of BAD), Prof Phil Laws (Chair), Mr Ian Evans, Prof Christopher Griffiths, Shehnaz Ahmed, Dr Brian Kirby, Elise Kleyn, Ms Linda Lawson, Teena Mackenzie, Tess McPherson, Dr Kathleen McElhone, Dr Ruth Murphy, Prof Anthony Ormerod, Dr Caroline Owen, Prof Nick Reynolds, Amir Rashid, Prof Catherine Smith and Dr Richard Warren. The research was funded/supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. The authors thank all the patient participants and acknowledge the enthusiastic collaboration of all clinicians and research teams in the United Kingdom and the Republic of Ireland who recruited for this study. This study is supported by the Psoriasis Association and the National Institute of Health and Research Biomedical Research Centre at King’s College London/Guy’s and St Thomas’ National Health Service Foundation Trust. The authors are grateful to the members of the BSTOP Steering Committee (Prof David Burden (Chair), Prof Catherine Smith, Prof Stefan Siebert, Prof Sara Brown, Helen McAteer, Dr Julia Schofield and Dr Nick Dand) for their valuable role in oversight of study delivery. This work was supported by Psoriasis Stratification to Optimise Relevant Therapy (PSORT), which is in turn funded by a Medical Research Council Stratified Medicine award (MR/L011808/1), the Psoriasis Association (RG2/10), the National Institute of Health and Research Biomedical Research Centre at King’s College London/Guy’s and St Thomas’ National Health Service Foundation Trust, the National Institute of Health and Research Manchester Biomedical Research Centre , and the National Institute of Health and Research Newcastle Biomedical Research Centre . TT is supported by an MRC Clinical Research Training Fellowship (MR/R001839/1). ND is supported by Health Data Research UK (MR/S003126/1). The British Association of Dermatologists Biologics and Immunomodulators Register is coordinated by the University of Manchester and funded by the British Association of Dermatologists. Finally, we acknowledge the enthusiastic collaboration of all of the dermatologists and specialist nurses in the U.K. and the Republic of Ireland who provide the BADBIR data. The principal investigators at the participating sites are listed at the following website: http://www.badbir.org/Clinicians/.

Fingerprint

Dive into the research topics of 'Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models'. Together they form a unique fingerprint.

Cite this