Abstract
Retrieval-augmented generation (RAG) systems show promise in specialized knowledge domains, but the tobacco research field lacks standardized assessment frameworks for comparing different large language models (LLMs). This gap impacts public health decisions that require accurate, domain-specific information retrieval from complex tobacco industry documentation. To develop and validate a tobacco domain-specific evaluation framework for assessing various LLMs in RAG systems that combines automated metrics with expert validation. Using a Goal-Question-Metric paradigm, we evaluated two distinct LLM architectures in RAG configurations: Mixtral 8 × 7B and Llama 3.1 70B. The framework incorporated automated assessments via GPT-4o alongside validation by three tobacco research specialists. A domain-specific dataset of 20 curated queries assessed model performance across nine metrics including accuracy, domain specificity, completeness, and clarity. Our framework successfully differentiated performance between models, with Mixtral 8 × 7B significantly outperformed Llama 3.1 70B in accuracy (8.8/10 vs. 7.55/10, p < 0.05) and domain specificity (8.65/10 vs. 7.6/10, p < 0.05). Case analysis revealed Mixtral’s superior handling of industry-specific terminology and contextual relationships. Hyperparameter optimization further improved Mixtral’s completeness from 7.1/10 to 7.9/10, demonstrating the framework’s utility for model refinement. This study establishes a robust framework specifically for evaluating LLMs in tobacco research RAG systems, with demonstrated potential for extension to other specialized domains. The significant performance differences between models highlight the importance of domain-specific evaluation for public health applications. Future research should extend this framework to broader document corpora and additional LLMs, including commercial models.
| Original language | English |
|---|---|
| Article number | 22760 |
| Journal | Scientific Reports |
| Volume | 15 |
| Issue number | 1 |
| Early online date | 2 Jul 2025 |
| DOIs | |
| Publication status | E-pub ahead of print - 2 Jul 2025 |
Bibliographical note
Publisher Copyright:© The Author(s) 2025.
Data Availability Statement
The datasets generated and analysed during the current study are available in the Harvard Dataverse repository, https://doi.org/10.7910/DVN/GVGVMP, and the complete codebase is hosted at our GitHub repository [43].Keywords
- AI evaluation
- Domain-Specific information retrieval
- Expert validation
- Goal-Question-Metric framework
- Large Language models
- Retrieval-Augmented generation
ASJC Scopus subject areas
- General