Enhancing Sentiment and Intent Analysis in Public Health via Fine-Tuned Large Language Models on Tobacco and E-cigarette-Related Tweets

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Accurate sentiment analysis and intent categorization of tobacco and e-cigarette-related social media content are critical for public health research, yet they necessitate specialized natural language processing approaches.

Objective: To compare pre-trained and fine-tuned Flan-T5 models for intent classification and sentiment analysis of tobacco and e-cigarette tweets, demonstrating the effectiveness of pre-training a lightweight large language model for domain specific tasks.

Methods: Three Flan-T5 classification models were developed: (1) tobacco intent, (2) e-cigarette intent, and (3) sentiment analysis. Domain-specific datasets with tobacco and e-cigarette tweets were created using GPT-4 and validated by tobacco control specialists using a rigorous evaluation process. A standardized rubric and consensus mechanism involving domain specialists ensured high-quality datasets. The Flan-T5 Large Language Models were fine-tuned using Low-Rank Adaptation and evaluated against pre-trained baselines on the datasets using accuracy performance metrics. To further assess model generalizability and robustness, the fine-tuned models were evaluated on real-world tweets collected around the COP9 event.

Results: In every task, fine-tuned models performed much better than pre-trained models. Compared to the pre-trained model's accuracy of 0.33, the fine-tuned model achieved an overall accuracy of 0.91 for tobacco intent classification. The fine-tuned model achieved an accuracy of 0.93 for e-cigarette intent, which is higher than the accuracy of 0.36 for the pre-trained model. The fine-tuned model significantly outperformed the pre-trained model's accuracy of 0.65 in sentiment analysis, achieving an accuracy of 0.94 for sentiments.

Conclusion: The effectiveness of lightweight Flan-T5 models in analyzing tweets associated with tobacco and e-cigarette is significantly improved by domain-specific fine-tuning, providing highly accurate instruments for tracking public conversation on tobacco and e-cigarette. The involvement of domain specialists in dataset validation ensured that the generated content accurately represented real-world discussions, thereby enhancing the quality and reliability of the results. Research on tobacco control and the formulation of public policy could be informed by these findings.
Original languageEnglish
Article number1501154
JournalFrontiers in Big Data
Volume7
Early online date28 Nov 2024
DOIs
Publication statusPublished - 31 Dec 2024

Data Availability Statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at: https://doi.org/10.7910/DVN/WQQW8S, Harvard Dataverse, V1.

Acknowledgements

We acknowledge the use of GPT-4 (OpenAI, version GPT-4-turbo) to generate synthetic tweets that were used in this study. These tweets, representing diverse sentiments and intents related to tobacco and e-cigarette topics, were subsequently validated by tobacco control specialists to ensure relevance and accuracy for our analyses.

Funding

All authors are funded by Bloomberg Philanthropies as part of the Bloomberg Initiative to Reduce Tobacco Use (www.bloomberg.org). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

FundersFunder number
Bloomberg Philanthropies

    Keywords

    • Large Language Models (LLMs)
    • domain adaptation
    • e-cigarette
    • intent classification
    • public health
    • sentiment analysis (SA)
    • social media analysis
    • tobacco

    ASJC Scopus subject areas

    • Computer Science (miscellaneous)
    • Information Systems
    • Artificial Intelligence

    Fingerprint

    Dive into the research topics of 'Enhancing Sentiment and Intent Analysis in Public Health via Fine-Tuned Large Language Models on Tobacco and E-cigarette-Related Tweets'. Together they form a unique fingerprint.

    Cite this