Word Boundary Information Isn’t Useful for Encoder Language Models

Edward Gow-Smith, Dylan Phelps, Harish Tayyar Madabushi, Carolina Scarton, Aline Villavicencio

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

Abstract

All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as ## or _) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm. In this work, we explore whether word boundary information is at all useful to such models. In particular, we train transformer encoders across four different training scales, and investigate several alternative approaches to including word boundary information, evaluating on two languages (English and Finnish) with a range of tasks across different domains and problem set-ups: sentence classification datasets, NER (for token-level classification), and two classification datasets involving complex words (Superbizarre and FLOTA). Overall, through an extensive experimental setup that includes the pretraining of 35 models, we find no substantial improvements from our alternative approaches, suggesting that modifying tokenisers to remove word boundary information isn’t leading to a loss of useful information.

Original languageEnglish
Title of host publicationACL 2024 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 - Proceedings of the Workshop
EditorsChen Zhao, Marius Mosbach, Pepa Atanasova, Seraphina Goldfarb-Tarrent, Peter Hase, Arian Hosseini, Maha Elbayad, Sandro Pezzelle, Maximilian Mozes
Place of PublicationTexas, U. S. A.
PublisherAssociation for Computational Linguistics (ACL)
Pages118-135
Number of pages18
ISBN (Electronic)9798891761575
Publication statusPublished - 15 Aug 2024
Event9th Workshop on Representation Learning for NLP, RepL4NLP 2024 at ACL 2024 - Bangkok, Thailand
Duration: 15 Aug 2024 → …

Publication series

NameACL 2024 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 - Proceedings of the Workshop

Conference

Conference9th Workshop on Representation Learning for NLP, RepL4NLP 2024 at ACL 2024
Country/TerritoryThailand
CityBangkok
Period15/08/24 → …

ASJC Scopus subject areas

  • Language and Linguistics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Word Boundary Information Isn’t Useful for Encoder Language Models'. Together they form a unique fingerprint.

Cite this