Neural Style-Preserving Visual Dubbing

Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, H.-P. Seidel, Thabo Beeler, Christian Richardt, Christian Theobalt

Research output: Contribution to journalArticle

Abstract

Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific idiosyncrasies such as a unique type of smile. We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages. At the heart of our approach is the concept of motion style, in particular for facial expressions, i.e., the person-specific expression change that is yet another essential factor beyond visual accuracy in face editing applications. Our method is based on a recurrent generative adversarial network that captures the spatiotemporal co-activation of facial expressions, and enables generating and modifying the facial expressions of the target actor while preserving their style. We train our model with unsynchronized source and target videos in an unsupervised manner using cycle-consistency and mouth expression losses, and synthesize photorealistic video frames using a layered neural face renderer. Our approach generates temporally coherent results, and handles dynamic backgrounds. Our results show that our dubbing approach maintains the idiosyncratic style of the target actor better than previous approaches, even for widely differing source and target actors.
Original languageEnglish
Article number178
Number of pages13
JournalACM Transactions on Graphics
Volume38
Issue number6
Early online date6 Sep 2019
DOIs
Publication statusPublished - 17 Nov 2019
EventSIGGRAPH Asia 2019 - Brisbane, Australia
Duration: 17 Nov 201920 Nov 2019
https://sa2019.siggraph.org/

Cite this

Kim, H., Elgharib, M., Zollhöfer, M., Seidel, H-P., Beeler, T., Richardt, C., & Theobalt, C. (2019). Neural Style-Preserving Visual Dubbing. ACM Transactions on Graphics, 38(6), [178]. https://doi.org/10.1145/3355089.3356500

Neural Style-Preserving Visual Dubbing. / Kim, Hyeongwoo; Elgharib, Mohamed; Zollhöfer, Michael; Seidel, H.-P.; Beeler, Thabo; Richardt, Christian; Theobalt, Christian.

In: ACM Transactions on Graphics, Vol. 38, No. 6, 178, 17.11.2019.

Research output: Contribution to journalArticle

Kim, H, Elgharib, M, Zollhöfer, M, Seidel, H-P, Beeler, T, Richardt, C & Theobalt, C 2019, 'Neural Style-Preserving Visual Dubbing', ACM Transactions on Graphics, vol. 38, no. 6, 178. https://doi.org/10.1145/3355089.3356500
Kim H, Elgharib M, Zollhöfer M, Seidel H-P, Beeler T, Richardt C et al. Neural Style-Preserving Visual Dubbing. ACM Transactions on Graphics. 2019 Nov 17;38(6). 178. https://doi.org/10.1145/3355089.3356500
Kim, Hyeongwoo ; Elgharib, Mohamed ; Zollhöfer, Michael ; Seidel, H.-P. ; Beeler, Thabo ; Richardt, Christian ; Theobalt, Christian. / Neural Style-Preserving Visual Dubbing. In: ACM Transactions on Graphics. 2019 ; Vol. 38, No. 6.
@article{eb5501e3f62b437abd2a0d291bf689b7,
title = "Neural Style-Preserving Visual Dubbing",
abstract = "Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific idiosyncrasies such as a unique type of smile. We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages. At the heart of our approach is the concept of motion style, in particular for facial expressions, i.e., the person-specific expression change that is yet another essential factor beyond visual accuracy in face editing applications. Our method is based on a recurrent generative adversarial network that captures the spatiotemporal co-activation of facial expressions, and enables generating and modifying the facial expressions of the target actor while preserving their style. We train our model with unsynchronized source and target videos in an unsupervised manner using cycle-consistency and mouth expression losses, and synthesize photorealistic video frames using a layered neural face renderer. Our approach generates temporally coherent results, and handles dynamic backgrounds. Our results show that our dubbing approach maintains the idiosyncratic style of the target actor better than previous approaches, even for widely differing source and target actors.",
author = "Hyeongwoo Kim and Mohamed Elgharib and Michael Zollh{\"o}fer and H.-P. Seidel and Thabo Beeler and Christian Richardt and Christian Theobalt",
year = "2019",
month = "11",
day = "17",
doi = "10.1145/3355089.3356500",
language = "English",
volume = "38",
journal = "ACM Transactions on Graphics",
issn = "0730-0301",
publisher = "Association for Computing Machinery",
number = "6",

}

TY - JOUR

T1 - Neural Style-Preserving Visual Dubbing

AU - Kim, Hyeongwoo

AU - Elgharib, Mohamed

AU - Zollhöfer, Michael

AU - Seidel, H.-P.

AU - Beeler, Thabo

AU - Richardt, Christian

AU - Theobalt, Christian

PY - 2019/11/17

Y1 - 2019/11/17

N2 - Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific idiosyncrasies such as a unique type of smile. We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages. At the heart of our approach is the concept of motion style, in particular for facial expressions, i.e., the person-specific expression change that is yet another essential factor beyond visual accuracy in face editing applications. Our method is based on a recurrent generative adversarial network that captures the spatiotemporal co-activation of facial expressions, and enables generating and modifying the facial expressions of the target actor while preserving their style. We train our model with unsynchronized source and target videos in an unsupervised manner using cycle-consistency and mouth expression losses, and synthesize photorealistic video frames using a layered neural face renderer. Our approach generates temporally coherent results, and handles dynamic backgrounds. Our results show that our dubbing approach maintains the idiosyncratic style of the target actor better than previous approaches, even for widely differing source and target actors.

AB - Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific idiosyncrasies such as a unique type of smile. We present a style-preserving visual dubbing approach from single video inputs, which maintains the signature style of target actors when modifying facial expressions, including mouth motions, to match foreign languages. At the heart of our approach is the concept of motion style, in particular for facial expressions, i.e., the person-specific expression change that is yet another essential factor beyond visual accuracy in face editing applications. Our method is based on a recurrent generative adversarial network that captures the spatiotemporal co-activation of facial expressions, and enables generating and modifying the facial expressions of the target actor while preserving their style. We train our model with unsynchronized source and target videos in an unsupervised manner using cycle-consistency and mouth expression losses, and synthesize photorealistic video frames using a layered neural face renderer. Our approach generates temporally coherent results, and handles dynamic backgrounds. Our results show that our dubbing approach maintains the idiosyncratic style of the target actor better than previous approaches, even for widely differing source and target actors.

U2 - 10.1145/3355089.3356500

DO - 10.1145/3355089.3356500

M3 - Article

VL - 38

JO - ACM Transactions on Graphics

JF - ACM Transactions on Graphics

SN - 0730-0301

IS - 6

M1 - 178

ER -