TY - GEN
T1 - Adjudicating LLMs as PropBank Annotators
AU - Bonn, Julia
AU - Madabushi, Harish Tayyar
AU - Hwang, Jena D.
AU - Bonial, Claire
PY - 2024/5/21
Y1 - 2024/5/21
N2 - We evaluate the ability of large language models (LLMs) to provide PropBank semantic role label annotations across different realizations of the same verbs in transitive, intransitive, and middle voice constructions. In order to assess the meta-linguistic capabilities of LLMs as well as their ability to glean such capabilities through in-context learning, we evaluate the models in a zero-shot setting, in a setting where it is given three examples of another verb used in transitive, intransitive, and middle voice constructions, and finally in a setting where it is given the examples as well as the correct sense and roleset information. We find that zero-shot knowledge of PropBank annotation is almost nonexistent. The largest model evaluated, GPT-4, achieves the best performance in the setting where it is given both examples and the correct roleset in the prompt, demonstrating that larger models can ascertain some meta-linguistic capabilities through in-context learning. However, even in this setting, which is simpler than the task of a human in PropBank annotation, the model achieves only 48% accuracy in marking numbered arguments correctly. To ensure transparency and reproducibility, we publicly release our dataset and model responses.
AB - We evaluate the ability of large language models (LLMs) to provide PropBank semantic role label annotations across different realizations of the same verbs in transitive, intransitive, and middle voice constructions. In order to assess the meta-linguistic capabilities of LLMs as well as their ability to glean such capabilities through in-context learning, we evaluate the models in a zero-shot setting, in a setting where it is given three examples of another verb used in transitive, intransitive, and middle voice constructions, and finally in a setting where it is given the examples as well as the correct sense and roleset information. We find that zero-shot knowledge of PropBank annotation is almost nonexistent. The largest model evaluated, GPT-4, achieves the best performance in the setting where it is given both examples and the correct roleset in the prompt, demonstrating that larger models can ascertain some meta-linguistic capabilities through in-context learning. However, even in this setting, which is simpler than the task of a human in PropBank annotation, the model achieves only 48% accuracy in marking numbered arguments correctly. To ensure transparency and reproducibility, we publicly release our dataset and model responses.
KW - LLM Evaluation
KW - PropBank
KW - Semantic Role Labeling
UR - http://www.scopus.com/inward/record.url?scp=85195124732&partnerID=8YFLogxK
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85195124732
T3 - 5th International Workshop on Designing Meaning Representation, DMR 2024 at LREC-COLING 2024 - Workshop Proceedings
SP - 112
EP - 123
BT - 5th International Workshop on Designing Meaning Representation, DMR 2024 at LREC-COLING 2024 - Workshop Proceedings
A2 - Bonial, Claire
A2 - Bonn, Julia
A2 - Hwang, Jena D.
PB - European Language Resources Association (ELRA)
T2 - 5th International Workshop on Designing Meaning Representation, DMR 2024
Y2 - 21 May 2024
ER -