Projects per year
Abstract
Lipreading is a difficult task, even for humans. And synthesizing the original speech waveform from lipreading makes it even a more challenging problem. Towards this end, we present a deep learning framework that can be trained end-to-end to learn the mapping between the auditory and visual signals. In particular, in this paper, our interest is to design a model that can efficiently predict the speech signal in a given silent talking-face video. The proposed framework generates a speech signal by mapping the video frames in a sequence of feature vectors. However, unlike some recent methods that adopt a sequence-to-sequence approach for translation from the frame stream to the audio stream, we posit it as an analogy learning problem between the two modalities. In which each frame is mapped to the corresponding speech segment via a deep audio-visual analogy framework. We predict plausible audio stream by training adversarially against a discriminator network. Our experiments, both qualitative and quantitative, on the publicly available GRID dataset show that the proposed method outperforms prior work on existing evaluation benchmarks. Our user studies confirm that our generated samples are more natural and closely match the ground truth speech signal.
Original language | English |
---|---|
Title of host publication | 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings |
Publisher | IEEE |
Pages | 8042-8046 |
Number of pages | 5 |
Volume | 2022 |
ISBN (Electronic) | 9781665405409 |
ISBN (Print) | 978-1-6654-0541-6 |
DOIs | |
Publication status | Published - 27 Apr 2022 |
Event | 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Virtual, Online, Singapore Duration: 23 May 2022 → 27 May 2022 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
Volume | 2022-May |
ISSN (Print) | 1520-6149 |
Conference
Conference | 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 |
---|---|
Country/Territory | Singapore |
City | Virtual, Online |
Period | 23/05/22 → 27/05/22 |
Keywords
- AudioVisual analogy
- self-supervised learning
- Speech prediction
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'LEARNING TO PREDICT SPEECH IN SILENT VIDEOS VIA AUDIOVISUAL ANALOGY'. Together they form a unique fingerprint.-
Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA) - 2.0
Cosker, D., Bilzon, J., Campbell, N., Cazzola, D., Colyer, S., Lutteroth, C., McGuigan, P., O'Neill, E., Petrini, K., Proulx, M. & Yang, Y.
Engineering and Physical Sciences Research Council
1/11/20 → 31/10/25
Project: Research council
-
Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA)
Cosker, D., Bilzon, J., Campbell, N., Cazzola, D., Colyer, S., Fincham Haines, T., Hall, P., Kim, K. I., Lutteroth, C., McGuigan, P., O'Neill, E., Richardt, C., Salo, A., Seminati, E., Tabor, A. & Yang, Y.
Engineering and Physical Sciences Research Council
1/09/15 → 28/02/21
Project: Research council