Revisiting Low Resource Status of Indian Languages in Machine Translation

Jerin Philip, Shashank Siripragada, Vinay P. Namboodiri, C. V. Jawahar

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

17 Citations (SciVal)

Abstract

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.

Original languageEnglish
Title of host publicationCODS-COMAD 2021 - Proceedings of the 3rd ACM India Joint International Conference on Data Science and Management of Data, 8th ACM IKDD CODS and 26th COMAD
EditorsJayant Haritsa, Shourya Roy, Manish Gupta, Sharad Mehrotra, Balaji Vasan Srinivasan, Yogesh Simmhan
PublisherAssociation for Computing Machinery
Pages178-187
Number of pages10
ISBN (Electronic)9781450388177
DOIs
Publication statusPublished - 2 Jan 2021
Event3rd ACM India Joint International Conference on Data Science and Management of Data, CODS-COMAD 2021 - Virtual, Online, India
Duration: 2 Jan 20214 Jan 2021

Publication series

NameACM International Conference Proceeding Series

Conference

Conference3rd ACM India Joint International Conference on Data Science and Management of Data, CODS-COMAD 2021
Country/TerritoryIndia
CityVirtual, Online
Period2/01/214/01/21

Keywords

  • information retrieval
  • machine translation
  • parallel corpus

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Revisiting Low Resource Status of Indian Languages in Machine Translation'. Together they form a unique fingerprint.

Cite this