HDHumans A Hybrid Approach for High-fidelity Digital Humans MARC HABERMANN Max Planck Institute for Informatics Germany

2025-05-06 0 0 9.13MB 24 页 10玖币
侵权投诉
HDHumans: A Hybrid Approach for High-fidelity Digital
Humans
MARC HABERMANN, Max Planck Institute for Informatics, Germany
LINGJIE LIU, Max Planck Institute for Informatics, Germany
WEIPENG XU, Meta Reality Labs, United States
GERARD PONS-MOLL, University of Tuebingen, Germany
MICHAEL ZOLLHOEFER, Meta Reality Labs, United States
CHRISTIAN THEOBALT, Max Planck Institute for Informatics, Germany
Fig. 1. We propose a method for photo-realistic human synthesis given an arbitrary camera pose and a
potentially unseen skeletal motion. Our method also handles loose types of clothing such as skirts, since we
jointly learn the dense and space-time coherent deforming geometry of the human surface (including the
dynamic clothing) along with a neural radiance field.
Photo-real digital human avatars are of enormous importance in graphics, as they enable immersive commu-
nication over the globe, improve gaming and entertainment experiences, and can be particularly benecial for
AR and VR settings. However, current avatar generation approaches either fall short in high-delity novel view
synthesis, generalization to novel motions, reproduction of loose clothing, or they cannot render characters at
the high resolution oered by modern displays. To this end, we propose HDHumans, which is the rst method
for HD human character synthesis that jointly produces an accurate and temporally coherent 3D deforming
surface and highly photo-realistic images of arbitrary novel views and of motions not seen at training time. At
the technical core, our method tightly integrates a classical deforming character template with neural radiance
elds (NeRF). Our method is carefully designed to achieve a synergy between classical surface deformation
and a NeRF. First, the template guides the NeRF, which allows synthesizing novel views of a highly dynamic
and articulated character and even enables the synthesis of novel motions. Second, we also leverage the dense
pointclouds resulting from the NeRF to further improve the deforming surface via 3D-to-3D supervision. We
outperform the state of the art quantitatively and qualitatively in terms of synthesis quality and resolution, as
well as the quality of 3D surface reconstruction.
Authors’ addresses: Marc Habermann, Max Planck Institute for Informatics, Germany, mhaberma@mpi-inf.mpg.de; Lingjie
Liu, Max Planck Institute for Informatics, Germany, lliu@mpi-inf.mpg.de; Weipeng Xu, Meta Reality Labs, United States,
xuweipeng@meta.com; Gerard Pons-Moll, University of Tuebingen, Germany, gerard.pons-moll@uni-tuebingen.de; Michael
Zollhoefer, Meta Reality Labs, United States, zollhoefer@meta.com; Christian Theobalt, Max Planck Institute for Informatics,
Germany, theobalt@mpi-inf.mpg.de.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
©2023 Copyright held by the owner/author(s).
2577-6193/2023/8-ART
https://doi.org/10.1145/3606927
Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.
arXiv:2210.12003v2 [cs.CV] 14 Jul 2023
2 Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt
CCS Concepts: Computing methodologies Computer vision;Rendering.
Additional Key Words and Phrases: human synthesis, neural synthesis, human modeling, human performance
capture
ACM Reference Format:
Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt.
2023. HDHumans: A Hybrid Approach for High-delity Digital Humans. Proc. ACM Comput. Graph. Interact.
Tech. 6, 2 (August 2023), 24 pages. https://doi.org/10.1145/3606927
1 INTRODUCTION
Photo-realistic synthesis of digital humans is a very important research topic in graphics and
computer vision. Specially, with the recent developments of VR and AR headsets, it has become
even more important, since photo-real human avatars can be used to populate virtual or augment
real scenes. The classical approach to achieve this goal would be the manual creation of human
avatars by means of 3D modeling including meshing, texturing, designing material properties, and
many more manual steps. However, this process is not only tedious and time-consuming, but it
also requires expert knowledge, preventing these techniques from being adopted by non-expert
users. A promising alternative is to create such digital human avatars from video captures of real
humans. The goal of our approach is to create controllable and highly photo-realistic characters at
high resolution solely from multi-view video.
This idea was already subject of previous research works that can be broadly categorized based on
the employed representation. Some approaches explicitly model the human’s surface as a mesh and
employ texture retrieval techniques [Casas et al
.
2014;Xu et al
.
2011] or deep learning [Habermann
et al
.
2021] to generate realistic appearance eects. However, the synthesis quality is still limited
and the recovered surface deformations are of insucient quality because they are driven purely by
image-based supervision. Other works solely synthesize humans in image space [Chan et al
.
2019;
Liu et al
.
2020b,2019b]. These approaches however suer from 3D inconsistency when changing
viewpoint. Recently, rst attempts have also been proposed to combine a neural radiance eld with
a human body model [Chen et al
.
2021;Liu et al
.
2021;Peng et al
.
2021a,b;Xu et al
.
2021]. These
works have demonstrated that a classical mesh-based surface can guide a neural radiance eld
(NeRF) [Mildenhall et al
.
2020] for image synthesis. However, since they rely on a human body
model or skeleton representation, they do not model the underlying deforming surface well. In
consequence, they only work for subjects wearing tight clothing. In stark contrast, we for the rst
time demonstrate how a NeRF can be conditioned on a densely deforming template and we even
show that improvements can be achieved in the other direction as well where the NeRF is guiding
the mesh deformation.
In contrast to prior work, we propose a tightly coupled hybrid representation consisting of a
classical deforming surface mesh and a neural radiance eld dened in a thin shell around the
surface. On the one hand, the surface mesh guides the learning of the neural radiance eld, enables
the method to handle large motions and loose clothing, and leads to a more ecient sampling
strategy along the camera rays. On the other hand, the radiance eld achieves a higher synthesis
quality than pure surface-based approaches, produces explicit 3D constraints for better supervision
of explicit surface deformation networks, and helps in overcoming local minima due to the local
nature of color gradients in image space. This tight coupling between explicit surface deformation
and neural radiance elds creates a two-way synergy between both representations. We are able to
jointly capture the detailed underlying deforming surface of the clothed human and also employ
this surface to drive a neural radiance eld, which captures high-frequency detail and texture. More
precisely, our method takes skeletal motion as input and predicts a motion-dependent deforming
Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.
HDHumans: A Hybrid Approach for High-fidelity Digital Humans 3
surface as well as a motion- and view-dependent neural radiance eld that is parameterized in
a thin shell around the surface. In this way, the deforming surface acts as an initializer for the
sampling and the feature accumulation of the neural radiance eld making it signicantly (6times)
more ecient and, thus, enables training on 4K multi-view videos. The deforming surface mesh and
the neural radiance eld are tightly coupled during training such that the mesh drives the neural
radiance eld making it ecient and robust to dynamic changes. Furthermore, not only the neural
radiance eld is improved based on the tracked surface mesh, but it can also be used to rene the
surface mesh, since the neural radiance eld drives the mesh towards reconstructing ner-scale
detail, such as cloth wrinkles, which are dicult to capture with image-based supervision alone.
Thus, a two-way synergy between the employed classical and neural scene representation is created
that leads to signicantly improved delity. Compared to previous work, our approach not only
reconstructs deforming surface geometry of higher quality, but also renders human images at much
higher delity (see Figure 1). In summary, our technical contributions are:
A novel approach for high-delity character synthesis that enables novel view and motion
synthesis at a very high resolution, which cannot be achieved by previous work.
A synergistic integration of a classical mesh-based and a neural scene representation for
virtual humans that produces higher quality geometry, motion, and appearance than any of
the two components in isolation.
To the best of our knowledge, this is the rst approach that tightly couples a deforming explicit
mesh and a NeRF enablings photo-realistic rendering of neural humans wearing loose clothing.
2 RELATED WORK
Mesh-based synthesis. Photo-realistic image synthesis of controllable characters is challenging
due to the diculty in capturing or predicting high-quality pose-dependent geometry deformation
and appearance. Some works [Carranza et al
.
2003;Collet et al
.
2015;Hilsmann et al
.
2020;Li
et al
.
2014;Zitnick et al
.
2004] focus on free-viewpoint replay of the captured human performance
sequence. Other works [Casas et al
.
2014;Volino et al
.
2014;Xu et al
.
2011] aim at the more
challenging task of photo-realistic free-viewpoint synthesis for new body poses. However, their
method needs several seconds to generate a single frame. Casas et al
.
[2014] and Volino et al
.
[2014] accelerate the image synthesis process with a temporally coherent layered representation
of appearance in texture space. These classical methods struggle with producing high-quality
results due to the coarse geometric proxy, and have limited generalizability to new poses and
viewpoints, which are very dierent from those in the database. To improve the synthesis quality
and generalizability, Habermann et al
.
[2021] proposes a method for learning a 3D virtual character
model with pose-dependent geometry deformations and pose- and view-dependent textures in
a weakly supervised way from multi-view videos. While great improvements have been made,
some ne-scale details are missing in the results, because of the diculty in the optimization
of deforming polygon meshes with only images as supervision. In this work, we observed that
deforming implicit elds is more exible (e.g., no need of using regularization terms to keep the
mesh topology), thus leading to more stable and ecient training. However, the rendering of
implicit elds is time-consuming, and editing implicit representations is much more dicult than
editing explicit representations, e.g., meshes. Hence, our method unies the implicit elds and
explicit polygon meshes joining the advantages from both worlds.
Image-based synthesis. GANs have achieved great progress in image synthesis in recent
years. To close the gap between the rendering of a coarse geometric proxy and realistic renderings,
many works formulate the mapping from the coarse rendering to a photo-realistic rendering as an
image-to-image translation problem. These works take the renderings of a skeleton [Chan et al
.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.
4 Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt
2019;Kappel et al
.
2020;Li et al
.
2019;Pumarola et al
.
2018;Shysheya et al
.
2019;Zhu et al
.
2019], a
dense mesh [Grigor’ev et al
.
2019;Liu et al
.
2020b,2019b,a;Neverova et al
.
2018;Prokudin et al
.
2021;Raj et al
.
2021;Sarkar et al
.
2020;Wang et al
.
2018], or a joint position heatmap [Aberman
et al
.
2019;Ma et al
.
2017,2018] as the input to image-to-image translation and output realistic
renderings. While these methods can produce high-quality images from a single view, they are
not able to synthesize view-consistent videos when changing camera viewpoints. In contrast, our
method directly optimizes the geometry deformations and appearance in 3D space, so it is able to
produce temporally- and view-consistent photo-realistic animations of characters.
Volume-based and hybrid approaches. Recently, some methods have demonstrated impressive
results on novel view synthesis of static scenes by using neural implicit elds [Mildenhall et al
.
2020;Niemeyer et al
.
2020;Oechsle et al
.
2021;Sitzmann et al
.
2019;Wang et al
.
2021;Yariv et al
.
2021,2020] or hybrid representations [DeVries et al
.
2021;Hedman et al
.
2021;Liu et al
.
2020a;
Reiser et al
.
2021;Yu et al
.
2021] as scene representations. Great eorts have been made to extend
neural representations to dynamic scenes. Neural Volumes [Lombardi et al
.
2019] and its follow-up
work [Wang et al
.
2020] use an encoder-decoder network to learn a mapping from reference images
to 3D volumes for each frame of the scene, followed by a volume rendering technique to render
the scene. Several works extend the NeRF [Mildenhall et al
.
2020] to dynamic scene modeling
with a dedicated deformation network [Park et al. 2020,2021;Pumarola et al. 2020;Tretschk et al.
2021], scene ow elds [Li et al
.
2020], or space-time neural irradiance elds [Xian et al
.
2020].
Many works focus on human character modeling. Peng et al
.
[2021b] and Kwon et al
.
[2021] assign
latent features on the vertices of the SMPL model and use them as anchors to link dierent frames.
Lombardi et al
.
[2021] introduce a mixture of volume primitives for the ecient rendering of
human actors. These methods can only playback a dynamic scene from novel views but are not
able to generate images for novel poses. To address this issue, several methods propose articulated
implicit representations for human characters. A-NeRF [Su et al
.
2021] proposes an articulated NeRF
representation based on a human skeleton for human pose renement. Recent works [Anonymous
2022;Chen et al
.
2021;Jiakai et al
.
2021;Li et al
.
2022;Liu et al
.
2021;Noguchi et al
.
2021;Peng et al
.
2021a;Wang et al
.
2022;Xu et al
.
2021] present a deformable NeRF representation, which unwarps
dierent poses to a shared canonical space with inverse kinematic transformations and residual
deformations. Moreover, HumanNeRF [Weng et al
.
2022] has shown view-synthesis for human
characters given only a monocular RGB video for training. Most of these works cannot synthesize
pose-dependent dynamic appearance, are not applicable to large-scale datasets that include severe
pose variations, and have limited generalizability to new poses. The most related work to our
proposed method is Neural Actor [Liu et al
.
2021], which uses a texture map as a structure-aware
local pose representation to infer dynamic deformation and appearance. In contrast to our method,
they only use a human body model as a mesh proxy, and thus cannot model characters in loose
clothes. Furthermore, they only employ the mesh proxy to guide the warping of the NeRF but do
not optimize the mesh. In consequence, this method cannot extract high-quality surface geometry.
Further, since the mesh proxy is not very close to the actual surface, it still needs to sample many
points around the surface, which prevents training on 4K resolution. Instead, we infer the dense
deformation of a template that is assumed to be given, which is more ecient and enables the
tracking of loose clothing. More importantly, our recovered NeRF even further renes the template
deformations.
3 METHOD
The goal of our approach is to learn a unied representation of a dynamic human from multi-view
video, which on the one hand allows to synthesize motion-dependent deforming geometry and
on the other hand also enables photo-real synthesis of images displaying the human under novel
Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.
HDHumans: A Hybrid Approach for High-fidelity Digital Humans 5
Fig. 2. Overview of the proposed approach. Our method takes as input a skeletal motion of the actor and
predicts high-quality appearance as well as space-time coherent and deforming geometry.
viewpoints and novel motions. To this end, we propose an end-to-end approach, which solely takes
a skeletal motion and a camera pose as input and outputs a posed and deformed mesh as well as
the respective photo-real rendering of the human. Figure 2shows an overview of the proposed
method. In the following, some fundamentals are provided (Section 3.1). Then, we introduce our
mesh-guided neural radiance eld, which allows synthesizing a dynamic performance of the actor
from novel views and for unseen motions (Section 3.2). This proposed mesh guidance assumes
a highly detailed, accurately tracked, and space-time coherent surface of the human actor. We
however found that previous weakly-supervised performance capture approaches [Habermann et al
.
2021,2020] struggle with capturing high delity geometry. At the same time, volume-based surface
representations [Mildenhall et al
.
2020] seem to recover such geometric details when visualizing
their view-dependent pointclouds, but they lack space-time coherence, which is essential for the
proposed mesh guidance. To overcome this limitation, we propose a NeRF-guided point cloud loss,
which further improves the motion-dependent and deformable human mesh model (Section 3.3).
Data assumptions. For each actor, we employ
𝐶
calibrated and synchronized cameras to collect a
segmented multi-view video of the person performing various types of motions. The skeletal motion
that is input to our method is recovered using a markerless motion capture software [TheCaptury
2020]. Finally, we acquire a static textured mesh template of the person using a scanner [Treedys
2020] that is manually rigged to the skeleton. Note that our approach does not assume any 4D
geometry in terms of per-frame scans or pointclouds as input.
3.1 Human Model and Neural Radiance Fields
3.1.1 Deformable Human Mesh Model. We aim at having a deformation model of the human body
and clothing, which only depends on the skeletal motion
M={(𝜃𝑡𝑇,𝜶𝑡𝑇,
z
𝑡𝑇), ..., (𝜃𝑡,𝜶𝑡,
z
𝑡)}
and deforms the person-specic template such that motion-dependent clothing and body deforma-
tions can be modeled, e.g. the swinging of a skirt induced by the motion of the hips. Here,
𝜃𝑡R57
,
𝜶𝑡R3
, and z
𝑡R3
refer to the skeletal joint angles, the root rotation and translation, respectively.
(·)𝑡
refers to the
𝑡
th frame of the video. In practice, the time window is set to 3 (
𝑇=
2) and for the
Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.
摘要:

HDHumans:AHybridApproachforHigh-fidelityDigitalHumansMARCHABERMANN,MaxPlanckInstituteforInformatics,GermanyLINGJIELIU,MaxPlanckInstituteforInformatics,GermanyWEIPENGXU,MetaRealityLabs,UnitedStatesGERARDPONS-MOLL,UniversityofTuebingen,GermanyMICHAELZOLLHOEFER,MetaRealityLabs,UnitedStatesCHRISTIANTHEO...

展开>> 收起<<
HDHumans A Hybrid Approach for High-fidelity Digital Humans MARC HABERMANN Max Planck Institute for Informatics Germany.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:9.13MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注