Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models

Research output: Contribution to journalArticlepeer-review

1 Citation (SciVal)

Abstract

Retrieval-augmented generation (RAG) systems show promise in specialized knowledge domains, but the tobacco research field lacks standardized assessment frameworks for comparing different large language models (LLMs). This gap impacts public health decisions that require accurate, domain-specific information retrieval from complex tobacco industry documentation. To develop and validate a tobacco domain-specific evaluation framework for assessing various LLMs in RAG systems that combines automated metrics with expert validation. Using a Goal-Question-Metric paradigm, we evaluated two distinct LLM architectures in RAG configurations: Mixtral 8 × 7B and Llama 3.1 70B. The framework incorporated automated assessments via GPT-4o alongside validation by three tobacco research specialists. A domain-specific dataset of 20 curated queries assessed model performance across nine metrics including accuracy, domain specificity, completeness, and clarity. Our framework successfully differentiated performance between models, with Mixtral 8 × 7B significantly outperformed Llama 3.1 70B in accuracy (8.8/10 vs. 7.55/10, p < 0.05) and domain specificity (8.65/10 vs. 7.6/10, p < 0.05). Case analysis revealed Mixtral’s superior handling of industry-specific terminology and contextual relationships. Hyperparameter optimization further improved Mixtral’s completeness from 7.1/10 to 7.9/10, demonstrating the framework’s utility for model refinement. This study establishes a robust framework specifically for evaluating LLMs in tobacco research RAG systems, with demonstrated potential for extension to other specialized domains. The significant performance differences between models highlight the importance of domain-specific evaluation for public health applications. Future research should extend this framework to broader document corpora and additional LLMs, including commercial models.

Original languageEnglish
Article number22760
JournalScientific Reports
Volume15
Issue number1
Early online date2 Jul 2025
DOIs
Publication statusE-pub ahead of print - 2 Jul 2025

Bibliographical note

Publisher Copyright:
© The Author(s) 2025.

Data Availability Statement

The datasets generated and analysed during the current study are available in the Harvard Dataverse repository, https://doi.org/10.7910/DVN/GVGVMP, and the complete codebase is hosted at our GitHub repository [43].

Keywords

  • AI evaluation
  • Domain-Specific information retrieval
  • Expert validation
  • Goal-Question-Metric framework
  • Large Language models
  • Retrieval-Augmented generation

ASJC Scopus subject areas

  • General

Fingerprint

Dive into the research topics of 'Scalable evaluation framework for retrieval augmented generation in tobacco research using large Language models'. Together they form a unique fingerprint.

Cite this