LEARNING TO PREDICT SPEECH IN SILENT VIDEOS VIA AUDIOVISUAL ANALOGY

Ravindra Yadav, Ashish Sardana, Vinay P. Namboodiri, Rajesh M. Hegde

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

1 Citation (SciVal)

Abstract

Lipreading is a difficult task, even for humans. And synthesizing the original speech waveform from lipreading makes it even a more challenging problem. Towards this end, we present a deep learning framework that can be trained end-to-end to learn the mapping between the auditory and visual signals. In particular, in this paper, our interest is to design a model that can efficiently predict the speech signal in a given silent talking-face video. The proposed framework generates a speech signal by mapping the video frames in a sequence of feature vectors. However, unlike some recent methods that adopt a sequence-to-sequence approach for translation from the frame stream to the audio stream, we posit it as an analogy learning problem between the two modalities. In which each frame is mapped to the corresponding speech segment via a deep audio-visual analogy framework. We predict plausible audio stream by training adversarially against a discriminator network. Our experiments, both qualitative and quantitative, on the publicly available GRID dataset show that the proposed method outperforms prior work on existing evaluation benchmarks. Our user studies confirm that our generated samples are more natural and closely match the ground truth speech signal.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PublisherIEEE
Pages8042-8046
Number of pages5
Volume2022
ISBN (Electronic)9781665405409
ISBN (Print)978-1-6654-0541-6
DOIs
Publication statusPublished - 27 Apr 2022
Event47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Virtual, Online, Singapore
Duration: 23 May 202227 May 2022

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2022-May
ISSN (Print)1520-6149

Conference

Conference47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
Country/TerritorySingapore
CityVirtual, Online
Period23/05/2227/05/22

Keywords

  • AudioVisual analogy
  • self-supervised learning
  • Speech prediction

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'LEARNING TO PREDICT SPEECH IN SILENT VIDEOS VIA AUDIOVISUAL ANALOGY'. Together they form a unique fingerprint.

Cite this