TY - GEN
T1 - Visual speech enhancement without a real visual stream
AU - Hegde, Sindhu B.
AU - Prajwal, K. R.
AU - Mukhopadhyay, Rudrabha
AU - Namboodiri, Vinay
AU - Jawahar, C. V.
PY - 2021/6/14
Y1 - 2021/6/14
N2 - In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state- of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only "methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech- driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable ( < 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.
AB - In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state- of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only "methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech- driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable ( < 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.
UR - http://www.scopus.com/inward/record.url?scp=85116141220&partnerID=8YFLogxK
U2 - 10.1109/WACV48630.2021.00197
DO - 10.1109/WACV48630.2021.00197
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85116141220
T3 - Proceedings - IEEE Winter Conference on Applications of Computer Vision
SP - 1925
EP - 1934
BT - 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
PB - IEEE
CY - U. S. A.
T2 - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
Y2 - 5 January 2021 through 9 January 2021
ER -