A Baseline Readability Model for Cebuano

Lloyd Lois Antonie Reyes, Michael Antonio Ibañez, Ranz Sapinit, Mohammed Hussien, Joseph Marvin Imperial

Research output: Working paper / PreprintPreprint

31 Downloads (Pure)

Abstract

In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from the multilingual BERT model. Results show that the use of the first two handcrafted linguistic features obtained the best performance trained on an optimized Random Forest model with approximately 87% across all metrics. The feature sets and algorithm used also is similar to previous results in readability assessment for the Filipino language showing potential of crosslingual application. To encourage more work for readability assessment in Philippine languages such as Cebuano, we open-sourced both code and data.
Original languageEnglish
Place of Publication2022
PublisherAssociation for Computational Linguistics (ACL)
VolumeProceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)
Publication statusPublished - 31 Mar 2022

Bibliographical note

Accepted to BEA Workshop at NAACL 2022

Keywords

  • cs.CL

Fingerprint

Dive into the research topics of 'A Baseline Readability Model for Cebuano'. Together they form a unique fingerprint.

Cite this