An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework

Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, Rémi Cardon, Thomas François, Akio Hayakawa, Andrea Horbach, Anna Hülsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Peréz Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, Marcos ZampieriHoracio Saggion

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

4 Citations (SciVal)

Abstract

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

Original languageEnglish
Title of host publication3rd Workshop on Tools and Resources for People with REAding DIfficulties, READI 2024 at LREC-COLING 2024 - Workshop Proceedings
EditorsRodrigo Wilkens, Remi Cardon, Amalia Todirascu, Nuria Gala
PublisherEuropean Language Resources Association (ELRA)
Pages38-46
Number of pages9
ISBN (Electronic)9782493814340
Publication statusPublished - 20 May 2024
Event3rd Workshop on Tools and Resources for People with REAding DIfficulties, READI 2024 - Torino, Italy
Duration: 20 May 2024 → …

Publication series

Name3rd Workshop on Tools and Resources for People with REAding DIfficulties, READI 2024 at LREC-COLING 2024 - Workshop Proceedings

Conference

Conference3rd Workshop on Tools and Resources for People with REAding DIfficulties, READI 2024
Country/TerritoryItaly
CityTorino
Period20/05/24 → …

Funding

Andrea Horbach is part of the research conducted at CATALPA \u2013 Center for Advanced Technology-Assisted Learning and Predictive Analytics of the FernUniversit\u00E4t in Hagen, Germany. Anna H\u00FClsing is supported by the German Federal Ministry of Education and Research (grant no. FKZ 01JA23S03C). Joseph Imperial is supported by the National University Philippines (Project ID: 2023I-1T-05-MLA-CCIT-Computer Science) and the UKRI Centre for Doctoral Training in Accountable, Responsible and Transparent AI [EP/S023437/1] of the University of Bath. Horacio Saggion, Stefan Bott and Akio Hayakawa acknowledge funding from the European Union's Horizon Europe research and innovation program under the Grant Agreement No. 101132431 (iDEM Project) \u2013 views and opinions expressed are however those of the author(s) only and do necessarily reflect those of the European Union. Horacio Saggion, Stefan Bott and Akio Hayakawa also thank the support from Departament de Recerca i Universitats de la Generalitat de Catalunya (ajuts SGR-Cat 2021).

FundersFunder number
National University, Philippines
University of Bath
Departament de Recerca i Universitats de la Generalitat de Catalunya
FernUniversität Hagen
Bundesministerium für Bildung und ForschungFKZ 01JA23S03C
Bundesministerium für Bildung und Forschung
UK Research and InnovationEP/S023437/1
UK Research and Innovation
European Union's Horizon Europe research and innovation program101132431

    Keywords

    • lexical complexity prediction
    • lexical simplification
    • MultiLS

    ASJC Scopus subject areas

    • Language and Linguistics
    • Education
    • Library and Information Sciences
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework'. Together they form a unique fingerprint.

    Cite this