Abstract

Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. However, current works in controllable text generation have yet to explore using these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context learning-based framework to guide large language models to align with expert-defined standards. Focusing on English language standards in the education domain as a use case, we consider the Common European Framework of Reference for Languages (CEFR) and Common Core Standards (CCS) for the task of open-ended content generation. Our findings show that models can gain 40% to 100% increase in precise accuracy for Llama2 and GPT-4, respectively, demonstrating that the use of knowledge artifacts extracted from standards and integrating them in the generation process can effectively guide models to produce better standard-aligned content.
Original languageEnglish
Title of host publicationProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Place of PublicationFlorida, U.S.A.
PublisherAssociation for Computational Linguistics
Pages1573–1594
Number of pages22
ISBN (Electronic)9798891761643
DOIs
Publication statusPublished - 30 Nov 2024
Event2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 - Hybrid, Miami, USA United States
Duration: 12 Nov 202416 Nov 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
Country/TerritoryUSA United States
CityHybrid, Miami
Period12/11/2416/11/24

Acknowledgements

We are grateful to the anonymous reviewers and Action Editors in ARR for their feedback on the improvement of this paper and to Dr. Brian North for the insightful discussions on capturing language standards, including CEFR, as part of the theoretical component of this work. We also thank Dr. Samantha Curle and Dr. Reka Jablonkai from the Department of Education at the University of Bath for helping with the evaluation of model-generated texts. This work made use of the Hex GPU cloud of the Department of Computer Science at the University of Bath. We attribute the black icons used in Figure 1 to the collections of Design Circle and Victor Zukeran from the Noun Project and the colored teacher icon from Flaticon.

Funding

JMI is supported by the National University Philippines and the UKRI Centre for Doctoral Training in Accountable, Responsible, and Transparent AI [EP/S023437/1] of the University of Bath.

FundersFunder number
National University, Philippines
UKRI Centre for Doctoral Training in Accountable Responsible and Transparent Artificial IntelligenceEP/S023437/1

    Keywords

    • cs.CL

    ASJC Scopus subject areas

    • Computational Theory and Mathematics
    • Computer Science Applications
    • Information Systems
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation'. Together they form a unique fingerprint.

    Cite this