X-Avatar: Expressive Human Avatars

1ETH Zürich, 2Microsoft
CVPR2023

X-Avatar, an animatible implicit human avatar model capable of capturing human body pose, hand pose, facial expressions, and appearance. X-Avatar can be created from 3D scans or RGB-D images.

Abstract

We present X-Avatar, a novel avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond. Our method models bodies, hands, facial expressions and appearance in a holistic fashion and can be learned from either full 3D scans or RGB-D data.

To achieve this, we propose a part-aware learned forward skinning module that can be driven by the parameter space of SMPL-X, allowing for expressive animation of X-Avatars. To efficiently learn the neural shape and deformation fields, we propose novel part-aware sampling and initialization strategies. This leads to higher fidelity results, especially for smaller body parts while maintaining efficient training despite increased number of articulated bones. To capture the appearance of the avatar with high-frequency details, we extend the geometry and deformation fields with a texture network that is conditioned on pose, facial expression, geometry and the normals of the deformed surface. We show experimentally that our method outperforms strong baselines in both data domains both quantitatively and qualitatively on the animation task.

To facilitate future research on expressive avatars we contribute a new dataset, called X-Humans, containing 233 sequences of high-quality textured scans from 20 participants, totalling 35,500 data frames.

Method

Dataset

  • 20 subjects, 233 sequences, 35,427 frames
  • High-quality textured scans, SMPL[-X] registrations
  • Body pose + hand gesture + facial expression
  • Various clothing types, hair styles, genders and ages

  • Results

    Comparison

    Our method recovers better hand articulation and facial expression than other baselines on the animation task.

    Animation

    X-Avatars can be learned from multiple modalities.

    Scan Version

    Upper: 3D scans used for training. Lower: Avatars driven by testing poses.


    RGB-D Version

    Upper: RGB-D data used for training. Lower: Avatars driven by testing poses.


    More Qualitative Results

    X-Avatars animated by motions extracted by PyMAF-X from monocular RGB videos.


    BibTeX

    @article{shen2023xavatar,
      author    = {Shen, Kaiyue and Guo, Chen and Kaufmann, Manuel and Zarate, Juan and Valentin, Julien and Song, Jie and Hilliges, Otmar},
      title     = {X-Avatar: Expressive Human Avatars},
      journal   = {Computer Vision and Pattern Recognition (CVPR)},
      year      = {2023},
    }