TY - GEN
T1 - Towards Generating Ultra-High Resolution Talking-Face Videos with Lip synchronization
AU - Gupta, Anchit
AU - Mukhopadhyay, Rudrabha
AU - Balachandra, Sindhu
AU - Khan, Faizan Farooq
AU - Namboodiri, Vinay P.
AU - Jawahar, C. V.
PY - 2023/1/7
Y1 - 2023/1/7
N2 - Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256×256 pixels), thus, generating extremely high-resolution videos still remains a challenge. We take a giant leap in this work and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted before in literature. To address these issues, we propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time. Our core idea to encode the faces in a compact 16× 16 representation allows us to model high-resolution videos. In our framework, we learn the lip movements in the quantized space on the newly collected 4K Talking Faces (4KTF) dataset. Our approach is speaker agnostic and can handle various languages and voices. We benchmark our technique against several competitive works and show that we can achieve a remarkable 64-times more pixels than the current state-of-the-art! Our supplementary demo video depicts additional qualitative results, comparisons, and several real-world applications, like professional movie editing enabled by our model.
AB - Talking-face video generation works have achieved state-of-the-art results in synthesizing videos with lip synchronization. However, most of the previous works deal with low-resolution talking-face videos (up to 256×256 pixels), thus, generating extremely high-resolution videos still remains a challenge. We take a giant leap in this work and propose a novel method to synthesize talking-face videos at resolutions as high as 4K! Our task presents several key challenges: (i) Scaling the existing methods to such high resolutions is resource-constrained, both in terms of compute and the availability of very high-resolution datasets, (ii) The synthesized videos need to be spatially and temporally coherent. The sheer number of pixels that the model needs to generate while maintaining the temporal consistency at the video level makes this task non-trivial and has never been attempted before in literature. To address these issues, we propose to train the lip-sync generator in a compact Vector Quantized (VQ) space for the first time. Our core idea to encode the faces in a compact 16× 16 representation allows us to model high-resolution videos. In our framework, we learn the lip movements in the quantized space on the newly collected 4K Talking Faces (4KTF) dataset. Our approach is speaker agnostic and can handle various languages and voices. We benchmark our technique against several competitive works and show that we can achieve a remarkable 64-times more pixels than the current state-of-the-art! Our supplementary demo video depicts additional qualitative results, comparisons, and several real-world applications, like professional movie editing enabled by our model.
KW - Algorithms: Vision + language and/or other modalities
KW - Commercial/retail
KW - Education
UR - http://www.scopus.com/inward/record.url?scp=85149013501&partnerID=8YFLogxK
U2 - 10.1109/WACV56688.2023.00518
DO - 10.1109/WACV56688.2023.00518
M3 - Chapter in a published conference proceeding
AN - SCOPUS:85149013501
T3 - Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023
SP - 5198
EP - 5207
BT - Proceedings - 2023 IEEE Winter Conference on Applications of Computer Vision, WACV 2023
PB - IEEE
T2 - 23rd IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023
Y2 - 3 January 2023 through 7 January 2023
ER -