Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model

Iiro Rastas, Yann Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, Filip Ginter

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

14 Citations (SciVal)

Abstract

In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.

Original languageEnglish
Title of host publicationLChange 2022 - 3rd International Workshop on Computational Approaches to Historical Language Change 2022, Proceedings of the Workshop
EditorsNina Tahmasebi, Syrielle Montariol, Andrey Kutuzov, Simon Hengchen, Haim Dubossarsky, Lars Borin
PublisherAssociation for Computational Linguistics (ACL)
Pages68-77
Number of pages10
ISBN (Electronic)9781955917421
DOIs
Publication statusPublished - 25 May 2022
Event3rd International Workshop on Computational Approaches to Historical Language Change, LChange 2022 - Dublin, Ireland
Duration: 26 May 202227 May 2022

Publication series

NameLChange 2022 - 3rd International Workshop on Computational Approaches to Historical Language Change 2022, Proceedings of the Workshop

Conference

Conference3rd International Workshop on Computational Approaches to Historical Language Change, LChange 2022
Country/TerritoryIreland
CityDublin
Period26/05/2227/05/22

Bibliographical note

Funding Information:
The research was supported by the Academy of Finland under the project High Performance Computing for the Detection and Analysis of Historical Discourses. Computational resources were provided by CSC – IT Center for Science.

Funding

The research was supported by the Academy of Finland under the project High Performance Computing for the Detection and Analysis of Historical Discourses. Computational resources were provided by CSC – IT Center for Science.

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model'. Together they form a unique fingerprint.

Cite this