TY - GEN
T1 - Word Boundary Information Isn’t Useful for Encoder Language Models
AU - Gow-Smith, Edward
AU - Phelps, Dylan
AU - Madabushi, Harish Tayyar
AU - Scarton, Carolina
AU - Villavicencio, Aline
PY - 2024/8/15
Y1 - 2024/8/15
N2 - All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as ## or _) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm. In this work, we explore whether word boundary information is at all useful to such models. In particular, we train transformer encoders across four different training scales, and investigate several alternative approaches to including word boundary information, evaluating on two languages (English and Finnish) with a range of tasks across different domains and problem set-ups: sentence classification datasets, NER (for token-level classification), and two classification datasets involving complex words (Superbizarre and FLOTA). Overall, through an extensive experimental setup that includes the pretraining of 35 models, we find no substantial improvements from our alternative approaches, suggesting that modifying tokenisers to remove word boundary information isn’t leading to a loss of useful information.
AB - All existing transformer-based approaches to NLP using subword tokenisation algorithms encode whitespace (word boundary information) through the use of special space symbols (such as ## or _) forming part of tokens. These symbols have been shown to a) lead to reduced morphological validity of tokenisations, and b) give substantial vocabulary redundancy. As such, removing these symbols has been shown to have a beneficial effect on the processing of morphologically complex words for transformer encoders in the pretrain-finetune paradigm. In this work, we explore whether word boundary information is at all useful to such models. In particular, we train transformer encoders across four different training scales, and investigate several alternative approaches to including word boundary information, evaluating on two languages (English and Finnish) with a range of tasks across different domains and problem set-ups: sentence classification datasets, NER (for token-level classification), and two classification datasets involving complex words (Superbizarre and FLOTA). Overall, through an extensive experimental setup that includes the pretraining of 35 models, we find no substantial improvements from our alternative approaches, suggesting that modifying tokenisers to remove word boundary information isn’t leading to a loss of useful information.
UR - http://www.scopus.com/inward/record.url?scp=85204916212&partnerID=8YFLogxK
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85204916212
T3 - ACL 2024 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 - Proceedings of the Workshop
SP - 118
EP - 135
BT - ACL 2024 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 - Proceedings of the Workshop
A2 - Zhao, Chen
A2 - Mosbach, Marius
A2 - Atanasova, Pepa
A2 - Goldfarb-Tarrent, Seraphina
A2 - Hase, Peter
A2 - Hosseini, Arian
A2 - Elbayad, Maha
A2 - Pezzelle, Sandro
A2 - Mozes, Maximilian
PB - Association for Computational Linguistics (ACL)
CY - Texas, U. S. A.
T2 - 9th Workshop on Representation Learning for NLP, RepL4NLP 2024 at ACL 2024
Y2 - 15 August 2024
ER -