HDHumans A Hybrid Approach for High-fidelity Digital Humans MARC HABERMANN Max Planck Institute for Informatics Germany

2025-05-06 0 0 9.13MB 24 页 10玖币

侵权投诉

HDHumans: A Hybrid Approach for High-fidelity Digital

Humans

MARC HABERMANN, Max Planck Institute for Informatics, Germany

LINGJIE LIU, Max Planck Institute for Informatics, Germany

WEIPENG XU, Meta Reality Labs, United States

GERARD PONS-MOLL, University of Tuebingen, Germany

MICHAEL ZOLLHOEFER, Meta Reality Labs, United States

CHRISTIAN THEOBALT, Max Planck Institute for Informatics, Germany

Fig. 1. We propose a method for photo-realistic human synthesis given an arbitrary camera pose and a

potentially unseen skeletal motion. Our method also handles loose types of clothing such as skirts, since we

jointly learn the dense and space-time coherent deforming geometry of the human surface (including the

dynamic clothing) along with a neural radiance field.

Photo-real digital human avatars are of enormous importance in graphics, as they enable immersive commu-

nication over the globe, improve gaming and entertainment experiences, and can be particularly benecial for

AR and VR settings. However, current avatar generation approaches either fall short in high-delity novel view

synthesis, generalization to novel motions, reproduction of loose clothing, or they cannot render characters at

the high resolution oered by modern displays. To this end, we propose HDHumans, which is the rst method

for HD human character synthesis that jointly produces an accurate and temporally coherent 3D deforming

surface and highly photo-realistic images of arbitrary novel views and of motions not seen at training time. At

the technical core, our method tightly integrates a classical deforming character template with neural radiance

elds (NeRF). Our method is carefully designed to achieve a synergy between classical surface deformation

and a NeRF. First, the template guides the NeRF, which allows synthesizing novel views of a highly dynamic

and articulated character and even enables the synthesis of novel motions. Second, we also leverage the dense

pointclouds resulting from the NeRF to further improve the deforming surface via 3D-to-3D supervision. We

outperform the state of the art quantitatively and qualitatively in terms of synthesis quality and resolution, as

well as the quality of 3D surface reconstruction.

Authors’ addresses: Marc Habermann, Max Planck Institute for Informatics, Germany, mhaberma@mpi-inf.mpg.de; Lingjie

Liu, Max Planck Institute for Informatics, Germany, lliu@mpi-inf.mpg.de; Weipeng Xu, Meta Reality Labs, United States,

xuweipeng@meta.com; Gerard Pons-Moll, University of Tuebingen, Germany, gerard.pons-moll@uni-tuebingen.de; Michael

Zollhoefer, Meta Reality Labs, United States, zollhoefer@meta.com; Christian Theobalt, Max Planck Institute for Informatics,

Germany, theobalt@mpi-inf.mpg.de.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and

the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses,

contact the owner/author(s).

2577-6193/2023/8-ART

https://doi.org/10.1145/3606927

Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.

arXiv:2210.12003v2 [cs.CV] 14 Jul 2023

2 Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt

CCS Concepts: •Computing methodologies →Computer vision;Rendering.

Additional Key Words and Phrases: human synthesis, neural synthesis, human modeling, human performance

capture

ACM Reference Format:

Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt.

2023. HDHumans: A Hybrid Approach for High-delity Digital Humans. Proc. ACM Comput. Graph. Interact.

Tech. 6, 2 (August 2023), 24 pages. https://doi.org/10.1145/3606927

1 INTRODUCTION

Photo-realistic synthesis of digital humans is a very important research topic in graphics and

computer vision. Specially, with the recent developments of VR and AR headsets, it has become

even more important, since photo-real human avatars can be used to populate virtual or augment

real scenes. The classical approach to achieve this goal would be the manual creation of human

avatars by means of 3D modeling including meshing, texturing, designing material properties, and

many more manual steps. However, this process is not only tedious and time-consuming, but it

also requires expert knowledge, preventing these techniques from being adopted by non-expert

users. A promising alternative is to create such digital human avatars from video captures of real

humans. The goal of our approach is to create controllable and highly photo-realistic characters at

high resolution solely from multi-view video.

This idea was already subject of previous research works that can be broadly categorized based on

the employed representation. Some approaches explicitly model the human’s surface as a mesh and

employ texture retrieval techniques [Casas et al

2014;Xu et al

2011] or deep learning [Habermann

et al

2021] to generate realistic appearance eects. However, the synthesis quality is still limited

and the recovered surface deformations are of insucient quality because they are driven purely by

image-based supervision. Other works solely synthesize humans in image space [Chan et al

2019;

Liu et al

2020b,2019b]. These approaches however suer from 3D inconsistency when changing

viewpoint. Recently, rst attempts have also been proposed to combine a neural radiance eld with

a human body model [Chen et al

2021;Liu et al

2021;Peng et al

2021a,b;Xu et al

2021]. These

works have demonstrated that a classical mesh-based surface can guide a neural radiance eld

(NeRF) [Mildenhall et al

2020] for image synthesis. However, since they rely on a human body

model or skeleton representation, they do not model the underlying deforming surface well. In

consequence, they only work for subjects wearing tight clothing. In stark contrast, we for the rst

time demonstrate how a NeRF can be conditioned on a densely deforming template and we even

show that improvements can be achieved in the other direction as well where the NeRF is guiding

the mesh deformation.

In contrast to prior work, we propose a tightly coupled hybrid representation consisting of a

classical deforming surface mesh and a neural radiance eld dened in a thin shell around the

surface. On the one hand, the surface mesh guides the learning of the neural radiance eld, enables

the method to handle large motions and loose clothing, and leads to a more ecient sampling

strategy along the camera rays. On the other hand, the radiance eld achieves a higher synthesis

quality than pure surface-based approaches, produces explicit 3D constraints for better supervision

of explicit surface deformation networks, and helps in overcoming local minima due to the local

nature of color gradients in image space. This tight coupling between explicit surface deformation

and neural radiance elds creates a two-way synergy between both representations. We are able to

jointly capture the detailed underlying deforming surface of the clothed human and also employ

this surface to drive a neural radiance eld, which captures high-frequency detail and texture. More

precisely, our method takes skeletal motion as input and predicts a motion-dependent deforming

Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.

HDHumans: A Hybrid Approach for High-fidelity Digital Humans 3

surface as well as a motion- and view-dependent neural radiance eld that is parameterized in

a thin shell around the surface. In this way, the deforming surface acts as an initializer for the

sampling and the feature accumulation of the neural radiance eld making it signicantly (6times)

more ecient and, thus, enables training on 4K multi-view videos. The deforming surface mesh and

the neural radiance eld are tightly coupled during training such that the mesh drives the neural

radiance eld making it ecient and robust to dynamic changes. Furthermore, not only the neural

radiance eld is improved based on the tracked surface mesh, but it can also be used to rene the

surface mesh, since the neural radiance eld drives the mesh towards reconstructing ner-scale

detail, such as cloth wrinkles, which are dicult to capture with image-based supervision alone.

Thus, a two-way synergy between the employed classical and neural scene representation is created

that leads to signicantly improved delity. Compared to previous work, our approach not only

reconstructs deforming surface geometry of higher quality, but also renders human images at much

higher delity (see Figure 1). In summary, our technical contributions are:

•

A novel approach for high-delity character synthesis that enables novel view and motion

synthesis at a very high resolution, which cannot be achieved by previous work.

•

A synergistic integration of a classical mesh-based and a neural scene representation for

virtual humans that produces higher quality geometry, motion, and appearance than any of

the two components in isolation.

To the best of our knowledge, this is the rst approach that tightly couples a deforming explicit

mesh and a NeRF enablings photo-realistic rendering of neural humans wearing loose clothing.

2 RELATED WORK

Mesh-based synthesis. Photo-realistic image synthesis of controllable characters is challenging

due to the diculty in capturing or predicting high-quality pose-dependent geometry deformation

and appearance. Some works [Carranza et al

2003;Collet et al

2015;Hilsmann et al

2020;Li

et al

2014;Zitnick et al

2004] focus on free-viewpoint replay of the captured human performance

sequence. Other works [Casas et al

2014;Volino et al

2014;Xu et al

2011] aim at the more

challenging task of photo-realistic free-viewpoint synthesis for new body poses. However, their

method needs several seconds to generate a single frame. Casas et al

[2014] and Volino et al

[2014] accelerate the image synthesis process with a temporally coherent layered representation

of appearance in texture space. These classical methods struggle with producing high-quality

results due to the coarse geometric proxy, and have limited generalizability to new poses and

viewpoints, which are very dierent from those in the database. To improve the synthesis quality

and generalizability, Habermann et al

[2021] proposes a method for learning a 3D virtual character

model with pose-dependent geometry deformations and pose- and view-dependent textures in

a weakly supervised way from multi-view videos. While great improvements have been made,

some ne-scale details are missing in the results, because of the diculty in the optimization

of deforming polygon meshes with only images as supervision. In this work, we observed that

deforming implicit elds is more exible (e.g., no need of using regularization terms to keep the

mesh topology), thus leading to more stable and ecient training. However, the rendering of

implicit elds is time-consuming, and editing implicit representations is much more dicult than

editing explicit representations, e.g., meshes. Hence, our method unies the implicit elds and

explicit polygon meshes joining the advantages from both worlds.

Image-based synthesis. GANs have achieved great progress in image synthesis in recent

years. To close the gap between the rendering of a coarse geometric proxy and realistic renderings,

many works formulate the mapping from the coarse rendering to a photo-realistic rendering as an

image-to-image translation problem. These works take the renderings of a skeleton [Chan et al

Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.

4 Marc Habermann, Lingjie Liu, Weipeng Xu, Gerard Pons-Moll, Michael Zollhoefer, and Christian Theobalt

2019;Kappel et al

2020;Li et al

2019;Pumarola et al

2018;Shysheya et al

2019;Zhu et al

2019], a

dense mesh [Grigor’ev et al

2019;Liu et al

2020b,2019b,a;Neverova et al

2018;Prokudin et al

2021;Raj et al

2021;Sarkar et al

2020;Wang et al

2018], or a joint position heatmap [Aberman

et al

2019;Ma et al

2017,2018] as the input to image-to-image translation and output realistic

renderings. While these methods can produce high-quality images from a single view, they are

not able to synthesize view-consistent videos when changing camera viewpoints. In contrast, our

method directly optimizes the geometry deformations and appearance in 3D space, so it is able to

produce temporally- and view-consistent photo-realistic animations of characters.

Volume-based and hybrid approaches. Recently, some methods have demonstrated impressive

results on novel view synthesis of static scenes by using neural implicit elds [Mildenhall et al

2020;Niemeyer et al

2020;Oechsle et al

2021;Sitzmann et al

2019;Wang et al

2021;Yariv et al

2021,2020] or hybrid representations [DeVries et al

2021;Hedman et al

2021;Liu et al

2020a;

Reiser et al

2021;Yu et al

2021] as scene representations. Great eorts have been made to extend

neural representations to dynamic scenes. Neural Volumes [Lombardi et al

2019] and its follow-up

work [Wang et al

2020] use an encoder-decoder network to learn a mapping from reference images

to 3D volumes for each frame of the scene, followed by a volume rendering technique to render

the scene. Several works extend the NeRF [Mildenhall et al

2020] to dynamic scene modeling

with a dedicated deformation network [Park et al. 2020,2021;Pumarola et al. 2020;Tretschk et al.

2021], scene ow elds [Li et al

2020], or space-time neural irradiance elds [Xian et al

2020].

Many works focus on human character modeling. Peng et al

[2021b] and Kwon et al

[2021] assign

latent features on the vertices of the SMPL model and use them as anchors to link dierent frames.

Lombardi et al

[2021] introduce a mixture of volume primitives for the ecient rendering of

human actors. These methods can only playback a dynamic scene from novel views but are not

able to generate images for novel poses. To address this issue, several methods propose articulated

implicit representations for human characters. A-NeRF [Su et al

2021] proposes an articulated NeRF

representation based on a human skeleton for human pose renement. Recent works [Anonymous

2022;Chen et al

2021;Jiakai et al

2021;Li et al

2022;Liu et al

2021;Noguchi et al

2021;Peng et al

2021a;Wang et al

2022;Xu et al

2021] present a deformable NeRF representation, which unwarps

dierent poses to a shared canonical space with inverse kinematic transformations and residual

deformations. Moreover, HumanNeRF [Weng et al

2022] has shown view-synthesis for human

characters given only a monocular RGB video for training. Most of these works cannot synthesize

pose-dependent dynamic appearance, are not applicable to large-scale datasets that include severe

pose variations, and have limited generalizability to new poses. The most related work to our

proposed method is Neural Actor [Liu et al

2021], which uses a texture map as a structure-aware

local pose representation to infer dynamic deformation and appearance. In contrast to our method,

they only use a human body model as a mesh proxy, and thus cannot model characters in loose

clothes. Furthermore, they only employ the mesh proxy to guide the warping of the NeRF but do

not optimize the mesh. In consequence, this method cannot extract high-quality surface geometry.

Further, since the mesh proxy is not very close to the actual surface, it still needs to sample many

points around the surface, which prevents training on 4K resolution. Instead, we infer the dense

deformation of a template that is assumed to be given, which is more ecient and enables the

tracking of loose clothing. More importantly, our recovered NeRF even further renes the template

deformations.

3 METHOD

The goal of our approach is to learn a unied representation of a dynamic human from multi-view

video, which on the one hand allows to synthesize motion-dependent deforming geometry and

on the other hand also enables photo-real synthesis of images displaying the human under novel

Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.

HDHumans: A Hybrid Approach for High-fidelity Digital Humans 5

Fig. 2. Overview of the proposed approach. Our method takes as input a skeletal motion of the actor and

predicts high-quality appearance as well as space-time coherent and deforming geometry.

viewpoints and novel motions. To this end, we propose an end-to-end approach, which solely takes

a skeletal motion and a camera pose as input and outputs a posed and deformed mesh as well as

the respective photo-real rendering of the human. Figure 2shows an overview of the proposed

method. In the following, some fundamentals are provided (Section 3.1). Then, we introduce our

mesh-guided neural radiance eld, which allows synthesizing a dynamic performance of the actor

from novel views and for unseen motions (Section 3.2). This proposed mesh guidance assumes

a highly detailed, accurately tracked, and space-time coherent surface of the human actor. We

however found that previous weakly-supervised performance capture approaches [Habermann et al

2021,2020] struggle with capturing high delity geometry. At the same time, volume-based surface

representations [Mildenhall et al

2020] seem to recover such geometric details when visualizing

their view-dependent pointclouds, but they lack space-time coherence, which is essential for the

proposed mesh guidance. To overcome this limitation, we propose a NeRF-guided point cloud loss,

which further improves the motion-dependent and deformable human mesh model (Section 3.3).

Data assumptions. For each actor, we employ

𝐶

calibrated and synchronized cameras to collect a

segmented multi-view video of the person performing various types of motions. The skeletal motion

that is input to our method is recovered using a markerless motion capture software [TheCaptury

2020]. Finally, we acquire a static textured mesh template of the person using a scanner [Treedys

2020] that is manually rigged to the skeleton. Note that our approach does not assume any 4D

geometry in terms of per-frame scans or pointclouds as input.

3.1 Human Model and Neural Radiance Fields

3.1.1 Deformable Human Mesh Model. We aim at having a deformation model of the human body

and clothing, which only depends on the skeletal motion

M={(𝜃𝑡−𝑇,𝜶𝑡−𝑇,

𝑡−𝑇), ..., (𝜃𝑡,𝜶𝑡,

𝑡)}

and deforms the person-specic template such that motion-dependent clothing and body deforma-

tions can be modeled, e.g. the swinging of a skirt induced by the motion of the hips. Here,

𝜃𝑡∈R57

𝜶𝑡∈R3

, and z

𝑡∈R3

refer to the skeletal joint angles, the root rotation and translation, respectively.

(·)𝑡

refers to the

𝑡

th frame of the video. In practice, the time window is set to 3 (

𝑇=

2) and for the

Proc. ACM Comput. Graph. Interact. Tech., Vol. 6, No. 2, Article . Publication date: August 2023.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HDHumans:AHybridApproachforHigh-fidelityDigitalHumansMARCHABERMANN,MaxPlanckInstituteforInformatics,GermanyLINGJIELIU,MaxPlanckInstituteforInformatics,GermanyWEIPENGXU,MetaRealityLabs,UnitedStatesGERARDPONS-MOLL,UniversityofTuebingen,GermanyMICHAELZOLLHOEFER,MetaRealityLabs,UnitedStatesCHRISTIANTHEO...

展开>> 收起<<

HDHumans A Hybrid Approach for High-fidelity Digital Humans MARC HABERMANN Max Planck Institute for Informatics Germany.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HDHumans A Hybrid Approach for High-fidelity Digital Humans MARC HABERMANN Max Planck Institute for Informatics Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: