Abstract

Machine learning (ML) can deliver rapid and accurate reaction barrier predictions for use in rational reactivity design. However, model training requires large data sets of typically thousands or tens of thousands of barriers that are very expensive to obtain computationally or experimentally. Furthermore, bespoke data sets are required for each region of interest in reaction space as models typically struggle to generalize. We have therefore reformulated the ML barrier prediction problem toward a much more data-efficient process: finding a reaction from a prespecified set with a desired target value. Our reformulation enables the rapid selection of reactions with purpose-specific activation barriers, for example, in the design of reactivity and selectivity in synthesis, catalyst design, toxicology, and covalent drug discovery, requiring just tens of accurately measured barriers. Importantly, our reformulation does not require generalization beyond the domain of the data set at hand, and we show excellent results for the highly toxicologically and synthetically relevant data sets of aza-Michael addition and transition-metal-catalyzed dihydrogen activation, typically requiring less than 20 accurately measured density functional theory (DFT) barriers. Even for incomplete data sets of E2 and SN2 reactions, with high numbers of missing barriers (74% and 56% respectively), our chosen ML search method still requires significantly fewer data points than the hundreds or thousands needed for more conventional uses of ML to predict activation barriers. Finally, we include a case study in which we use our process to guide the optimization of the dihydrogen activation catalyst. Our approach was able to identify a reaction within 1 kcal mol–1 of the target barrier by only having to run 12 DFT reaction barrier calculations, which illustrates the usage and real-world applicability of this reformulation for systems of high synthetic importance.
Original languageEnglish
Pages (from-to)13506–13515
Number of pages10
JournalACS Catalysis
Volume13
Issue number20
Early online date6 Oct 2023
DOIs
Publication statusPublished - 20 Oct 2023

Bibliographical note

Funding: This work was supported by U.K. Research and Innovation (UKRI) [grant number EP/S023437/1]; and the Engineering and Physical Sciences Research Council [grant number EP/W003724/1]. This work made use of the Balena, Anatra and Nimbus high-performance computing services at the University of Bath with support from the University of Bath’s Research Computing Group (doi.org/10.15125/b6cd-s854). This work was supported by the ART-AI CDT and the University of Bath.

Data availability: Gaussian output files for the aza-Michael addition reactants and transition states and LDA single-point energies for the dihydrogen activation data set are available from the Unversity of Bath Research Data Archive (10.15125/BATH-01240). (56) Code and other data are available from https://github.com/the-grayson-group/finding_barriers.

Funding

This work was supported by U.K. Research and Innovation (UKRI) [grant number EP/S023437/1]; and the Engineering and Physical Sciences Research Council [grant number EP/W003724/1]. This work made use of the Balena, Anatra and Nimbus high-performance computing services at the University of Bath with support from the University of Bath’s Research Computing Group (doi.org/10.15125/b6cd-s854). This work was supported by the ART-AI CDT and the University of Bath.

FundersFunder number
UK Research and InnovationEP/S023437/1
Engineering and Physical Sciences Research CouncilEP/W003724/1
University of Bath

Keywords

  • activation barriers
  • catalyst design
  • data efficiency
  • machine learning
  • organic synthesis

ASJC Scopus subject areas

  • General Chemistry
  • Catalysis

Fingerprint

Dive into the research topics of 'Reformulating Reactivity Design for Data-Efficient Machine Learning'. Together they form a unique fingerprint.

Cite this