Abstract
We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.
Original language | English |
---|---|
Title of host publication | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
Editors | Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis |
Publisher | European Language Resources Association (ELRA) |
Pages | 3743-3751 |
Number of pages | 9 |
ISBN (Electronic) | 9791095546344 |
Publication status | Published - 31 May 2020 |
Event | 12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France Duration: 11 May 2020 → 16 May 2020 |
Publication series
Name | LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings |
---|
Conference
Conference | 12th International Conference on Language Resources and Evaluation, LREC 2020 |
---|---|
Country/Territory | France |
City | Marseille |
Period | 11/05/20 → 16/05/20 |
Bibliographical note
Funding Information:We gratefully acknowledge the online corpora provided by ILCI, WAT-ILMPC, the online publicly available data that we have used in this work and the online open source tools that have facilitated this work.
Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
Funding
We gratefully acknowledge the online corpora provided by ILCI, WAT-ILMPC, the online publicly available data that we have used in this work and the online open source tools that have facilitated this work.
Keywords
- Indian languages
- Machine Translation
- Parallel Corpus
ASJC Scopus subject areas
- Language and Linguistics
- Education
- Library and Information Sciences
- Linguistics and Language