06: Facial Animation and Rendering [Q&A Session]

  • Full Access Full Access
  • Onsite Student Access Onsite Student Access
  • Virtual Full Access Virtual Full Access

Date/Time: 06 – 17 December 2021
All presentations are available in the virtual platform on-demand.

FreeStyleGAN: Free-view Editable Portrait Rendering with the Camera Manifold

Abstract: Current Generative Adversarial Networks (GANs) produce photorealistic renderings of portrait images. Embedding real images into the latent space of such models enables high-level image editing. While recent methods provide considerable semantic control over the (re-)generated images, they can only generate a limited set of viewpoints and cannot explicitly control the camera. Such 3D camera control is required for 3D virtual and mixed reality applications. In our solution, we use a few images of a face to perform 3D reconstruction, and we introduce the notion of the GAN camera manifold, the key element allowing us to precisely define the range of images that the GAN can reproduce in a stable manner. We train a small face-specific neural implicit representation network to map a captured face to this manifold and complement it with a warping scheme to obtain free-viewpoint novel-view synthesis. We show how our approach - due to its precise camera control - enables the integration of a pre-trained StyleGAN into standard 3D rendering pipelines, allowing e.g., stereo rendering or consistent insertion of faces in synthetic 3D environments. Our solution proposes the first truly free-viewpoint rendering of realistic faces at interactive rates, using only a small number of casual photos as input, while simultaneously allowing semantic editing capabilities, such as facial expression or lighting changes.

Thomas Leimkuehler, INRIA, Université Côte d'Azur, France
George Drettakis, INRIA, Unversité Côte d'Azur, France

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Abstract: To the best of our knowledge, we first present a live system that generates personalized photorealistic talking-head animation only driven by audio signals at over 30 fps. Our system contains three stages. The first stage is a deep neural network that extracts deep audio features along with a manifold projection to project the features to the target person's speech space. In the second stage, we learn facial dynamics and motions from the projected audio features. The predicted motions include head poses and upper body motions, where the former is generated by an autoregressive probabilistic model which models the head pose distribution of the target person. Upper body motions are deduced from the head poses. In the final stage, we generate conditional feature maps from previous predictions and send them with a candidate image set to an image-to-image translation network to synthesize photorealistic renderings. Our method generalizes well to wild audio and successfully synthesizes high-fidelity personalized facial details, e.g., wrinkles, teeth. Our method also allows explicit control of head poses. Extensive qualitative and quantitative evaluations, along with user studies, demonstrate the superiority of our method over state-of-the-art techniques.

Yuanxun Lu, Nanjing University, China
Jinxiang Chai, Xmov, China
Xun Cao, Nanjing University, China

Rendering with Style: Combining Traditional and Neural Approaches for High-Quality Face Rendering

Abstract: For several decades, researchers have been advancing techniques for creating and rendering 3D digital faces, where a lot of the effort has gone into geometry and appearance capture, modeling and rendering techniques. This body of research work has largely focused on facial skin, with much less attention devoted to peripheral components like hair, eyes and the interior of the mouth. As a result, even with the best technology for facial capture and rendering, in most high-end productions a lot of artist time is still spent modeling the missing components and fine-tuning the rendering parameters to combine everything into photo-real digital renders. In this work we propose to combine incomplete, high-quality renderings showing only facial skin with recent methods for neural rendering of faces, in order to automatically and seamlessly create photo-realistic full-head portrait renders from captured data without the need for artist intervention. Our method begins with traditional face rendering, where the skin is rendered with the desired appearance, expression, viewpoint, and illumination. These skin renders are then projected into the latent space of a pre-trained neural network that can generate arbitrary photo-real face images (StyleGAN2). The result is a sequence of realistic face images that match the identity and appearance of the 3D character at the skin level, but is completed naturally with synthesized hair, eyes, inner mouth and surroundings. Notably, we present the first method for {\em multi-frame consistent} projection into this latent space, allowing photo-realistic rendering and preservation of the identity of the digital human over an animated performance sequence, which can depict different expressions, lighting conditions and viewpoints. Our method can be used in new face rendering pipelines and, importantly, in other deep learning applications that require large amounts of realistic training data with ground-truth 3D geometry, appearance maps, lighting, and viewpoint.

Prashanth Chandran, ETH Zürich, DisneyResearch|Studios, Switzerland
Sebastian Winberg, DisneyResearch|Studios, Switzerland
Gaspard Zoss, ETH Zürich, DisneyResearch|Studios, Switzerland
Jérémy Riviere, DisneyResearch|Studios, Switzerland
Markus Gross, ETH Zürich, DisneyResearch|Studios, Switzerland
Paulo Gotardo, DisneyResearch|Studios, Switzerland
Derek Bradley, DisneyResearch|Studios, Switzerland

Semi-supervised video-driven facial animation transfer for production

Abstract: We propose a simple algorithm for automatic transfer of facial expressions, from videos to a 3D character, as well as between distinct 3D characters through their rendered animations. Our method begins by learning a common, semantically-consistent latent representation for the different input image domains using an unsupervised image-to-image translation model. It subsequently learns, in a supervised manner, a linear mapping from the character images' encoded representation to the animation coefficients. At inference time, given the source domain (i.e., actor footage), it regresses the corresponding animation coefficients for the target character. Expressions are automatically remapped between the source and target identities despite differences in physiognomy. We show how our technique can be used in the context of markerless motion capture with controlled lighting conditions, for one actor and for multiple actors. Additionally, we show how it can be used to automatically transfer facial animation between distinct characters without consistent mesh parameterization and without engineered geometric priors. We compare our method with standard approaches used in production and with recent state-of-the-art models on single camera face tracking.

Lucio Moser, Digital Domain, Canada
Chinyu Chien, Digital Domain, Taiwan
Mark Williams, Digital Domain, Canada
Jose Serra, Digital Domain, Canada
Darren Hendler, Digital Domain, United States of America
Doug Roble, Digital Domain, United States of America