A multilingual parallel corpora collection effort for Indian languages

Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, C. V. Jawahar

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

37 Citations (SciVal)

Abstract

We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.

Original languageEnglish
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages3743-3751
Number of pages9
ISBN (Electronic)9791095546344
Publication statusPublished - 31 May 2020
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: 11 May 202016 May 2020

Publication series

NameLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
Country/TerritoryFrance
CityMarseille
Period11/05/2016/05/20

Bibliographical note

Funding Information:
We gratefully acknowledge the online corpora provided by ILCI, WAT-ILMPC, the online publicly available data that we have used in this work and the online open source tools that have facilitated this work.

Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC

Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.

Funding

We gratefully acknowledge the online corpora provided by ILCI, WAT-ILMPC, the online publicly available data that we have used in this work and the online open source tools that have facilitated this work.

Keywords

  • Indian languages
  • Machine Translation
  • Parallel Corpus

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'A multilingual parallel corpora collection effort for Indian languages'. Together they form a unique fingerprint.

Cite this