TY - GEN
T1 - Towards automatic face-to-face translation
AU - Prajwal, K. R.
AU - Mukhopadhyay, Rudrabha
AU - Philip, Jerin
AU - Jha, Abhishek
AU - Namboodiri, Vinay
AU - Jawahar, C. V.
PY - 2019/10/15
Y1 - 2019/10/15
N2 - In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.
AB - In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real-world applications. First, we build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language. We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio. Quantitative evaluation of LipGAN on the standard LRW test set shows that it significantly outperforms existing approaches across all standard metrics. We also subject our Face-to-Face Translation pipeline, to multiple human evaluations and show that it can significantly improve the overall user experience for consuming and interacting with multimodal content across languages. Code, models and demo video are made publicly available.
KW - Cross-language talking face generation
KW - Lip Synthesis
KW - Neural Machine Translation
KW - Speech to Speech Translation
KW - Translation systems
KW - Voice Transfer
UR - http://www.scopus.com/inward/record.url?scp=85074841957&partnerID=8YFLogxK
U2 - 10.1145/3343031.3351066
DO - 10.1145/3343031.3351066
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85074841957
T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
SP - 1428
EP - 1436
BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PB - Association for Computing Machinery
T2 - 27th ACM International Conference on Multimedia, MM 2019
Y2 - 21 October 2019 through 25 October 2019
ER -