Abstract

Human-annotated datasets are central to the development and evaluation of sentiment analysis and other natural language processing systems. However, many existing datasets suffer from low annotator agreement and errors, raising concerns about the quality of data used to train and evaluate computational systems. Improving annotation reliability demands close examination of how datasets are created and how annotators interpret and approach the task. To this end, we create AmbiSent, a new sentiment dataset designed to capture cases of interpretive complexity that commonly challenge both annotators and computational models. Using a mixed-methods approach, we investigate how annotation instructions influence annotator experience and both inter-annotator agreement (Krippendorff’s alpha) and intra-annotator agreement (percent agreement). Two groups of 53 crowdworkers annotated 252 sentences under either detailed or minimal instructions, allowing comparison of inter- and intra-annotator agreement, using a permutation test and an independent samples t-test, respectively. Contrary to our hypotheses, our findings reveal that detailed instructions alone do not ensure more consistent annotations – either across or within individuals. A reflexive thematic analysis of open-ended survey responses further contextualised these findings, offering insights into the annotators’ cognitive effort involved and the practical challenges faced. Drawing from both quantitative and qualitative findings, the effectiveness of instructions appears contingent on participants’ level of task engagement and the extent to which the instructions align with intuitive annotation strategies. Despite the detailed guidance, participants often resorted to reductive annotation approaches. However, we also observed sentence types where detailed instructions may improve annotator agreement, (e.g., sentences with perspective-dependent sentiment, and rhetorical questions). Together, these results inform recommendations for enhancing task engagement and instruction adherence, offering practical insights for future dataset development. Finally, to support diverse use cases, we release three versions of the AmbiSent dataset, each accompanied by detailed annotator information and label distributions to better accommodate different user needs.
Original languageEnglish
Article numbere0336269
JournalPLoS ONE
Volume20
Issue number12
DOIs
Publication statusPublished - 1 Dec 2025

Data Availability Statement

All relevant data are within the manuscript, its Supporting Information files and the following links on the Open Science Framework. This includes: - study resource: https://osf.io/a9vyb/ - detailed description and stimulus preparation: https://osf.io/edg6r/files/osfstorage - other supplementary materials: https://osf.io/r84qj/files/osfstorage - study protocol: https://osf.io/af2h9/files/osfstorage - AmbiSent dataset: https://osf.io/x687m/files/osfstorage.

Funding

This work has been funded by the UK Government awarded to BID & JH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Fingerprint

Dive into the research topics of 'Disambiguating sentiment annotation: A mixed methods investigation of annotator experience and impact of instructions on annotator agreement'. Together they form a unique fingerprint.

Cite this