Abstract
Background: Difficulty finding and understanding information in clinical guidelines contributes to medication errors. Large language models (LLMs) can simplify complex text to aid in understanding, but this approach to improving the quality of guidelines has not been investigated. However, LLMs are also known to hallucinate, or generate outputs that may not align with reality.
Objective: To develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content.
Methods: To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study to which a GPT-4 based pipeline was applied, with prompts based on user testing-derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually or GPT-4-revised using the user testing derived recommendations, and fully user tested. Readability was evaluated using readability metrics and three expert pharmacists’ ratings. Content similarity before/after LLM revision was assessed using BERT (Bidirectional Encoder Representations From Transformers) scores and expert pharmacist review.
Results: Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88 to 0.96). An omission, addition or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%) and 18 (12%) (respectively) of the 153 guideline sub-sections. The SMOG (Simple Measure of Gobbledygook) grade showed a small but significant improvement in readability for the LLM guidelines (mean difference 0.32, 95% confidence interval (CI): 0.10-0.55, P=.02) and the manually revised versions (mean difference 0.46, 95%CI: 0.13-0.79, P=.03). There was no significant difference between the LLM and manually revised versions (P>0.99). There were no significant differences between Flesch-Kincaid reading grades (P=.91). Expert ratings favoured the LLM-revised versions for understandability. Considering two IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM-revision.
Conclusions: Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions or changes of meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines.
Objective: To develop and evaluate an LLM pipeline to improve the readability of clinical guidelines while ensuring the preservation of critical content.
Methods: To align LLM revisions with research evidence and enable comparison with manual editing, the National Health Service Injectable Medicines Guide (IMG) was used as a case study to which a GPT-4 based pipeline was applied, with prompts based on user testing-derived recommendations for IMG authors. This enabled readability comparisons between various IMG guideline versions: original, manually or GPT-4-revised using the user testing derived recommendations, and fully user tested. Readability was evaluated using readability metrics and three expert pharmacists’ ratings. Content similarity before/after LLM revision was assessed using BERT (Bidirectional Encoder Representations From Transformers) scores and expert pharmacist review.
Results: Considering 20 IMG guidelines used in practice, BERT scores indicated high semantic similarity between the original and LLM-revised guidelines (0.88 to 0.96). An omission, addition or change in meaning was identified by at least one pharmacist in 30 (20%), 7 (5%) and 18 (12%) (respectively) of the 153 guideline sub-sections. The SMOG (Simple Measure of Gobbledygook) grade showed a small but significant improvement in readability for the LLM guidelines (mean difference 0.32, 95% confidence interval (CI): 0.10-0.55, P=.02) and the manually revised versions (mean difference 0.46, 95%CI: 0.13-0.79, P=.03). There was no significant difference between the LLM and manually revised versions (P>0.99). There were no significant differences between Flesch-Kincaid reading grades (P=.91). Expert ratings favoured the LLM-revised versions for understandability. Considering two IMG guidelines from previous research, user testing produced a greater improvement in readability than LLM-revision.
Conclusions: Authors should not use current LLMs to modify clinical guidelines without carefully checking the revised text for unintended omissions, additions or changes of meaning. Further work should investigate the potential of LLMs to augment manual user testing and reduce the barriers to the wider use of this approach to improve the safety of clinical guidelines.
| Original language | English |
|---|---|
| Article number | e81915 |
| Number of pages | 11 |
| Journal | Journal of Medical Internet Research |
| Volume | 28 |
| Early online date | 23 Feb 2026 |
| DOIs | |
| Publication status | E-pub ahead of print - 23 Feb 2026 |
Data Availability Statement
Data and code are available in a public, open access repository. Data created during this research that are not presented in full in this paper are openly available from [30].Acknowledgements
The authors are grateful to members of the Injectable Medicines Guide editorial team, the pharmacist reviewers, Holly Wilson (Onaya Science), and Joseph Marvin Imperial (University of Bath) for their support of this study. Microsoft Copilot (GPT-5 model) was used during manuscript revision to edit the Introduction and Methods sections for brevity.Funding
This project was funded by a Pump Priming Award from the Department of Life Sciences, University of Bath. Additional funding was received from the Academic Secondment Scheme of Research and Innovation Services at the University of Bath. These funders had no role in study design; collection, analysis, and interpretation of data; writing of the paper; and/or decision to submit for publication.
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- Guidelines as topic
- Guidelines
- Large Language Models (LLMs)
- Artificial intelligence
- Usability
- Readability
- Content analysis
- information design
- GPT-4
ASJC Scopus subject areas
- Artificial Intelligence
- Pharmacy
- Fundamentals and skills
Fingerprint
Dive into the research topics of 'Improving the understandability of clinical guidelines: development and evaluation of a GPT-4–based pipeline'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS