Abstract
Machine-assisted approaches for free-text analysis are rising in popularity, owing to a growing need to rapidly analyze large volumes of qualitative data. In both research and policy settings, these approaches have promise in providing timely insights into public perceptions and enabling policymakers to understand their community’s needs. However, current approaches still require expert human interpretation—posing a financial and practical barrier for those outside of academia. For the first time, we propose and validate the Deep Computational Text Analyser (DECOTA)—a novel machine learning methodology that automatically analyzes large free-text data sets and outputs concise themes. Building on structural topic modeling approaches, we used two fine-tuned large language models and sentence transformers to automatically derive “codes” and their corresponding “themes”, as in inductive thematic analysis. To fully automate the process, we designed and validated a novel algorithm to choose the optimal number of “topics” for the structural topic modeling. DECOTA outputs key codes and themes, their prevalence, and how prevalence varies across covariates such as age and gender. Each code is accompanied by three representative quotes. Four data sets previously analyzed using thematic analysis were triangulated with DECOTA’s codes and themes. We found that DECOTA is approximately 378 times faster and 1,920 times cheaper than human coding and consistently yields codes in agreement with or complementary to human coding (averaging 91.6% for codes and 90% for themes). The implications for evidence-based policy development, public engagement with policymaking, and psychometric measure development are discussed. Computational approaches are increasingly being used to quickly process large volumes of free-text data. These approaches hold promise in helping academics study public perceptions, and policymakers understand their community’s needs. However, current methods still require expert human interpretation, which can be costly and impractical. In this article, we developed the Deep Computational Text Analyser (DECOTA), a novel machine learning methodology designed to automatically analyze large free-text data sets to produce concise “themes” within the data. DECOTA uses several custom-trained models to detect themes and subthemes within the data, as a human may do when categorizing free-text responses. Our approach gives information about how common each subtheme and theme is, how common they are among different demographic groups, and offers example quotes. We compared how similar DECOTA’s analysis was to human coders, using four example free-text data sets. DECOTA’s outputs were highly consistent with human analyses, detecting 91.6% of all human subthemes and 90% of the humans’ themes. We noted that DECOTA was approximately 378 times faster and 1,920 times cheaper than human analysis. The potential uses of this methodology for policymakers and academics are discussed.
Original language | English |
---|---|
Journal | Psychological Methods |
Early online date | 7 Apr 2025 |
DOIs | |
Publication status | E-pub ahead of print - 7 Apr 2025 |
Funding
Lois Player and Ryan Hughes are supported by a scholarship from the Engineering and Physical Sciences Research Council (EPSRC) Centre for Doctoral Training in Advanced Automotive Propulsion Systems (AAPS), under the project EP/S023364/1. We would like to thank Lauren Towler from the University of Southampton for her time discussing the practicalities of thematic analysis. The views expressed in this publication are those of the authors and do not reflect the official position of the European Commission. All code and data associated with this article have been shared on the Open Science Framework (OSF; https://osf.io/5jste/), and a preprint on the OSF\u2019s repository PsyArXiv (available from https://osf.io/preprints/psyarxiv/t5gbv_v1). The methodology and some results were presented at select conferences and research groups (a University of Bath internal conference, the British Environmental Psychology Conference 2024, two research groups at Otto-von-Guericke-University Magdeburg, and a research group at the Technical University of Berlin).
Funders | Funder number |
---|---|
Otto von Guericke University Magdeburg | |
European Commission | |
University of Southampton | |
Engineering and Physical Sciences Research Council | EP/S023364/1 |
Keywords
- free-text analysis
- large language models
- machine learning
- natural language processing
- topic modeling
ASJC Scopus subject areas
- Psychology (miscellaneous)