ARAH Animatable Volume Rendering of Articulated Human SDFs Shaofei Wang1 Katja Schwarz23 Andreas Geiger23 and Siyu Tang1

2025-04-30 1 0 6.67MB 35 页 10玖币

侵权投诉

ARAH: Animatable Volume Rendering of

Articulated Human SDFs

Shaofei Wang1, Katja Schwarz2,3, Andreas Geiger2,3, and Siyu Tang1

1ETH Z¨urich

2Max Planck Institute for Intelligent Systems, T¨ubingen

3University of T¨ubingen

Abstract. Combining human body models with diﬀerentiable rendering

has recently enabled animatable avatars of clothed humans from sparse

sets of multi-view RGB videos. While state-of-the-art approaches achieve

a realistic appearance with neural radiance ﬁelds (NeRF), the inferred

geometry often lacks detail due to missing geometric constraints. Fur-

ther, animating avatars in out-of-distribution poses is not yet possible

because the mapping from observation space to canonical space does

not generalize faithfully to unseen poses. In this work, we address these

shortcomings and propose a model to create animatable clothed human

avatars with detailed geometry that generalize well to out-of-distribution

poses. To achieve detailed geometry, we combine an articulated implicit

surface representation with volume rendering. For generalization, we pro-

pose a novel joint root-ﬁnding algorithm for simultaneous ray-surface

intersection search and correspondence search. Our algorithm enables

eﬃcient point sampling and accurate point canonicalization while gener-

alizing well to unseen poses. We demonstrate that our proposed pipeline

can generate clothed avatars with high-quality pose-dependent geome-

try and appearance from a sparse set of multi-view RGB videos. Our

method achieves state-of-the-art performance on geometry and appear-

ance reconstruction while creating animatable avatars that generalize

well to out-of-distribution poses beyond the small number of training

poses.

Keywords: 3D Computer Vision, Clothed Human Modeling, Cloth Mod-

eling, Neural Rendering, Neural Implicit Functions

1 Introduction

Reconstruction and animation of clothed human avatars is a rising topic in

computer vision research. It is of particular interest for various applications in

AR/VR and the future metaverse. Various sensors can be used to create clothed

human avatars, ranging from 4D scanners over depth sensors to simple RGB

cameras. Among these data sources, RGB videos are by far the most accessi-

ble and user-friendly choice. However, they also provide the least supervision,

making this setup the most challenging for the reconstruction and animation of

clothed humans.

arXiv:2210.10036v1 [cs.CV] 18 Oct 2022

2 Wang et al.

Inputs:

Sparse Multi-view Videos

(Observation Space)

Output:

Animatable Avatar

(Canonical Space)

Our Results on

Out-of-distribution Poses

Existing Works

(Neural Body, Ani-NeRF)

Fig. 1: Detailed Geometry and Generalization to Extreme Poses. Given

sparse multi-view videos with SMPL ﬁttings and foreground masks, our approach

synthesizes animatable clothed avatars with realistic pose-dependent geometry

and appearance. While existing works, e.g. Neural Body [60] and Ani-NeRF [58],

struggle with generalizing to unseen poses, our approach enables avatars that

can be animated in extreme out-of-distribution poses.

Traditional works in clothed human modeling use explicit mesh [1,2,6,7,18,

19,31,35,56,69,75,85,90] or truncated signed distance ﬁelds (TSDFs) of ﬁxed

grid resolution [36,37,73,83,88] to represent the geometry of humans. Textures

are often represented by vertex colors or UV-maps. With the recent success

of neural implicit representations, signiﬁcant progress has been made towards

modeling articulated clothed humans. PIFu [65] and PIFuHD [66] are among the

ﬁrst works that propose to model clothed humans as continuous neural implicit

functions. ARCH [25] extends this idea and develops animatable clothed human

avatars from monocular images. However, this line of works does not handle

dynamic pose-dependent cloth deformations. Further, they require ground-truth

geometry for training. Such ground-truth data is expensive to acquire, limiting

the generalization of these methods.

Another line of works removes the need for ground-truth geometry by utiliz-

ing diﬀerentiable neural rendering. These methods aim to reconstruct humans

from a sparse set of multi-view videos with only image supervision. Many of

them use NeRF [49] as the underlying representation and achieve impressive vi-

sual ﬁdelity on novel view synthesis tasks. However, there are two fundamental

drawbacks of these existing approaches: (1) the NeRF-based representation lacks

proper geometric regularization, leading to inaccurate geometry. This is particu-

larly detrimental in a sparse multi-view setup and often results in artifacts in the

form of erroneous color blobs under novel views or poses. (2) Existing approaches

condition their NeRF networks [60] or canonicalization networks [58] on inputs

in observation space. Thus, they cannot generalize to unseen out-of-distribution

poses.

In this work, we address these two major drawbacks of existing approaches.

(1) We improve geometry by building an articulated signed-distance-ﬁeld (SDF)

representation for clothed human bodies to better capture the geometry of

clothed humans and improve the rendering quality. (2) In order to render the

SDF, we develop an eﬃcient joint root-ﬁnding algorithm for the conversion from

observation space to canonical space. Speciﬁcally, we represent clothed human

ARAH: Animatable Volume Rendering of Articulated Human SDFs 3

avatars as a combination of a forward linear blend skinning (LBS) network, an

implicit SDF network, and a color network, all deﬁned in canonical space and do

not condition on inputs in observation space. Given these networks and camera

rays in observation space, we apply our novel joint root-ﬁnding algorithm that

can eﬃciently ﬁnd the iso-surface points in observation space and their corre-

spondences in canonical space. This enables us to perform eﬃcient sampling on

camera rays around the iso-surface. All network modules can be trained with a

photometric loss in image space and regularization losses in canonical space.

We validate our approach on the ZJU-MoCap [60] and the H36M [26] dataset.

Our approach generalizes well to unseen poses, enabling robust animation of

clothed avatars even under out-of-distribution poses where existing works fail,

as shown in Fig. 1. We achieve signiﬁcant improvements over state-of-the-arts

for novel pose synthesis and geometry reconstruction, while also outperforming

state-of-the-arts in the novel view synthesis task on training poses. Code and

data are available at https://neuralbodies.github.io/arah/.

2 Related Works

Clothed Human Modeling with Explicit Representations: Many ex-

plicit mesh-based approaches represent cloth deformations as deformation lay-

ers [1,2,6–8] added to minimally clothed parametric human body models [5,

21,28,39,54,57,82]. Such approaches enjoy compatibility with parametric hu-

man body models but have diﬃculties in modeling large garment deformations.

Other mesh-based approaches model garments as separate meshes [18,19,31,35,

56,69,75,85,90] in order to represent more detailed and physically plausible cloth

deformations. However, such methods often require accurate 3D-surface registra-

tion, synthetic 3D data or dense multi-view images for training and the garment

meshes need to be pre-deﬁned for each cloth type. More recently, point-cloud-

based explicit methods [40,42,89] also showed promising results in modeling

clothed humans. However, they still require explicit 3D or depth supervision

for training, while our goal is to train using sparse multi-view RGB supervision

alone.

Clothed Humans as Implicit Functions: Neural implicit functions [13,44,

45,55,61] have been used to model clothed humans from various sensor inputs

including monocular images [22,23,25,33,64–66,72,80,93], multi-view videos [30,

38,52,58,60,81], sparse point clouds [6,14,16,77,78,94], or 3D meshes [11,12,

15,47,48,67,74]. Among the image-based methods, [4,23,25] obtain animatable

reconstructions of clothed humans from a single image. However, they do not

model pose-dependent cloth deformations and require ground-truth geometry for

training. [30] learns generalizable NeRF models for human performance capture

and only requires multi-view images as supervision. But it needs images as inputs

for synthesizing novel poses. [38,52,58,60,81] take multi-view videos as inputs

and do not need ground-truth geometry during training. These methods generate

4 Wang et al.

personalized per-subject avatars and only need 2D supervision. Our approach

follows this line of work and also learns a personalized avatar for each subject.

Neural Rendering of Animatable Clothed Humans: Diﬀerentiable neural

rendering has been extended to model animatable human bodies by a number

of recent works [52,58,60,63,72,81]. Neural Body [60] proposes to diﬀuse la-

tent per-vertex codes associated with SMPL meshes in observation space and

condition NeRF [49] on such latent codes. However, the conditional inputs of

Neural Body are in the observation space. Therefore, it does not generalize well

to out-of-distribution poses. Several recent works [52,58,72] propose to model

the radiance ﬁeld in canonical space and use a pre-deﬁned or learned back-

ward mapping to map query points from observation space to this canonical

space. A-NeRF [72] uses a deterministic backward mapping deﬁned by piece-

wise rigid bone transformations. This mapping is very coarse and the model

has to use a complicated bone-relative embedding to compensate for that. Ani-

NeRF [58] trains a backward LBS network that does not generalize well to out-

of-distribution poses, even when ﬁne-tuned with a cycle consistency loss for its

backward LBS network for each test pose. Further, all aforementioned methods

utilize a volumetric radiance representation and hence suﬀer from noisy geome-

try [53,76,86,87]. In contrast to these works, we improve geometry by combining

an implicit surface representation with volume rendering and improve pose gen-

eralization via iterative root-ﬁnding. H-NeRF [81] achieves large improvements

in geometric reconstruction by co-training SDF and NeRF networks. However,

code and models of H-NeRF are not publicly available. Furthermore, H-NeRF’s

canonicalization process relies on imGHUM [3] to predict an accurate signed

distance in observation space. Therefore, imGHUM needs to be trained on a

large corpus of posed human scans and it is unclear whether the learned signed

distance ﬁelds generalize to out-of-distribution poses beyond the training set. In

contrast, our approach does not need to be trained on any posed scans and it

can generalize to extreme out-of-distribution poses.

Concurrent Works: Several concurrent works extend NeRF-based articulated

models to improve novel view synthesis, geometry reconstruction, or animation

quality [10,24,27,32,46,59,71,79,84,92]. [92] proposes to jointly learn forward

blending weights, a canonical occupancy network, and a canonical color network

using diﬀerentiable surface rendering for head-avatars. In contrast to human

heads, human bodies show much more articulation. Abrupt changes in depth

also occur more frequently when rendering human bodies, which is diﬃcult to

capture with surface rendering [76]. Furthermore, [92] uses the secant method

to ﬁnd surface points. For each secant step, this needs to solve a root-ﬁnding

problem from scratch. Instead, we use volume rendering of SDFs and formulate

the surface-ﬁnding task of articulated SDFs as a joint root-ﬁnding problem that

only needs to be solved once per ray. We remark that [27] proposes to formu-

late surface-ﬁnding and correspondence search as a joint root-ﬁnding problem

to tackle geometry reconstruction from photometric and mask losses. However,

they use pre-deﬁned skinning ﬁelds and surface rendering. They also require esti-

ARAH: Animatable Volume Rendering of Articulated Human SDFs 5

(a) Root-finding and point sampling

Near Surface Points

Surface Points

(b) Canonicalization of sampled points

SDF Color

Predict Pixel

GT Pixel

Loss

Far Surface Points

(d) Photometric loss

Joint root-finding

(Sec 3.3)

Fig. 2: Overview of Our Pipeline. (a) Given a ray (c,v) with camera center

cand ray direction vin observation space, we jointly search for its intersec-

tion with the SDF iso-surface and the correspondence of the intersection point

via a novel joint root-ﬁnding algorithm (Section 3.3). We then sample near/far

surface points {¯

x}. (b) The sampled points are mapped into canonical space

as {ˆ

x}via root-ﬁnding. (c) In canonical space, we run an SDF-based volume

rendering with canonicalized points {ˆ

x}, local body poses and shape (θ, β), an

SDF network feature z, surface normals n, and a per-frame latent code Zto

predict the corresponding pixel value of the input ray (Section 3.4). (d) All net-

work modules, including the forward LBS network LBSσω, the canonical SDF

network fσf, and the canonical color network fσc, are trained end-to-end with

a photometric loss in image space and regularization losses in canonical space

(Section 3.5).

mated normals from PIFuHD [66] while our approach achieves detailed geometry

reconstructions without such supervision.

3 Method

Our pipeline is illustrated in Fig. 2. Our model consists of a forward linear blend

skinning (LBS) network (Section 3.1), a canonical SDF network, and a canon-

ical color network (Section 3.2). When rendering a speciﬁc pixel of the image

in observation space, we ﬁrst ﬁnd the intersection of the corresponding camera

ray and the observation-space SDF iso-surface. Since we model a canonical SDF

and a forward LBS, we propose a novel joint root-ﬁnding algorithm that can

simultaneously search for the ray-surface intersection and the canonical corre-

spondence of the intersection point (Section 3.3). Such a formulation does not

condition the networks on observations in observation space. Consequently, it

can generalize to unseen poses. Once the ray-surface intersection is found, we

sample near/far surface points on the camera ray and ﬁnd their canonical corre-

spondences via forward LBS root-ﬁnding. The canonicalized points are used for

volume rendering to compose the ﬁnal RGB value at the pixel (Section 3.4). The

predicted pixel color is then compared to the observation using a photometric

loss (Section 3.5). The model is trained end-to-end using the photometric loss

and regularization losses. The learned networks represent a personalized animat-

able avatar that can robustly synthesize new geometries and appearances under

novel poses (Section 4.1).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ARAH:AnimatableVolumeRenderingofArticulatedHumanSDFsShaofeiWang1,KatjaSchwarz2,3,AndreasGeiger2,3,andSiyuTang11ETHZ¨urich2MaxPlanckInstituteforIntelligentSystems,T¨ubingen3UniversityofT¨ubingenAbstract.Combininghumanbodymodelswithdifferentiablerenderinghasrecentlyenabledanimatableavatarsofclothedhum...

展开>> 收起<<

ARAH Animatable Volume Rendering of Articulated Human SDFs Shaofei Wang1 Katja Schwarz23 Andreas Geiger23 and Siyu Tang1.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ARAH Animatable Volume Rendering of Articulated Human SDFs Shaofei Wang1 Katja Schwarz23 Andreas Geiger23 and Siyu Tang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: