ARAH Animatable Volume Rendering of Articulated Human SDFs Shaofei Wang1 Katja Schwarz23 Andreas Geiger23 and Siyu Tang1

2025-04-30 0 0 6.67MB 35 页 10玖币
侵权投诉
ARAH: Animatable Volume Rendering of
Articulated Human SDFs
Shaofei Wang1, Katja Schwarz2,3, Andreas Geiger2,3, and Siyu Tang1
1ETH Z¨urich
2Max Planck Institute for Intelligent Systems, T¨ubingen
3University of T¨ubingen
Abstract. Combining human body models with differentiable rendering
has recently enabled animatable avatars of clothed humans from sparse
sets of multi-view RGB videos. While state-of-the-art approaches achieve
a realistic appearance with neural radiance fields (NeRF), the inferred
geometry often lacks detail due to missing geometric constraints. Fur-
ther, animating avatars in out-of-distribution poses is not yet possible
because the mapping from observation space to canonical space does
not generalize faithfully to unseen poses. In this work, we address these
shortcomings and propose a model to create animatable clothed human
avatars with detailed geometry that generalize well to out-of-distribution
poses. To achieve detailed geometry, we combine an articulated implicit
surface representation with volume rendering. For generalization, we pro-
pose a novel joint root-finding algorithm for simultaneous ray-surface
intersection search and correspondence search. Our algorithm enables
efficient point sampling and accurate point canonicalization while gener-
alizing well to unseen poses. We demonstrate that our proposed pipeline
can generate clothed avatars with high-quality pose-dependent geome-
try and appearance from a sparse set of multi-view RGB videos. Our
method achieves state-of-the-art performance on geometry and appear-
ance reconstruction while creating animatable avatars that generalize
well to out-of-distribution poses beyond the small number of training
poses.
Keywords: 3D Computer Vision, Clothed Human Modeling, Cloth Mod-
eling, Neural Rendering, Neural Implicit Functions
1 Introduction
Reconstruction and animation of clothed human avatars is a rising topic in
computer vision research. It is of particular interest for various applications in
AR/VR and the future metaverse. Various sensors can be used to create clothed
human avatars, ranging from 4D scanners over depth sensors to simple RGB
cameras. Among these data sources, RGB videos are by far the most accessi-
ble and user-friendly choice. However, they also provide the least supervision,
making this setup the most challenging for the reconstruction and animation of
clothed humans.
arXiv:2210.10036v1 [cs.CV] 18 Oct 2022
2 Wang et al.
Inputs:
Sparse Multi-view Videos
(Observation Space)
Output:
Animatable Avatar
(Canonical Space)
Our Results on
Out-of-distribution Poses
Existing Works
(Neural Body, Ani-NeRF)
Fig. 1: Detailed Geometry and Generalization to Extreme Poses. Given
sparse multi-view videos with SMPL fittings and foreground masks, our approach
synthesizes animatable clothed avatars with realistic pose-dependent geometry
and appearance. While existing works, e.g. Neural Body [60] and Ani-NeRF [58],
struggle with generalizing to unseen poses, our approach enables avatars that
can be animated in extreme out-of-distribution poses.
Traditional works in clothed human modeling use explicit mesh [1,2,6,7,18,
19,31,35,56,69,75,85,90] or truncated signed distance fields (TSDFs) of fixed
grid resolution [36,37,73,83,88] to represent the geometry of humans. Textures
are often represented by vertex colors or UV-maps. With the recent success
of neural implicit representations, significant progress has been made towards
modeling articulated clothed humans. PIFu [65] and PIFuHD [66] are among the
first works that propose to model clothed humans as continuous neural implicit
functions. ARCH [25] extends this idea and develops animatable clothed human
avatars from monocular images. However, this line of works does not handle
dynamic pose-dependent cloth deformations. Further, they require ground-truth
geometry for training. Such ground-truth data is expensive to acquire, limiting
the generalization of these methods.
Another line of works removes the need for ground-truth geometry by utiliz-
ing differentiable neural rendering. These methods aim to reconstruct humans
from a sparse set of multi-view videos with only image supervision. Many of
them use NeRF [49] as the underlying representation and achieve impressive vi-
sual fidelity on novel view synthesis tasks. However, there are two fundamental
drawbacks of these existing approaches: (1) the NeRF-based representation lacks
proper geometric regularization, leading to inaccurate geometry. This is particu-
larly detrimental in a sparse multi-view setup and often results in artifacts in the
form of erroneous color blobs under novel views or poses. (2) Existing approaches
condition their NeRF networks [60] or canonicalization networks [58] on inputs
in observation space. Thus, they cannot generalize to unseen out-of-distribution
poses.
In this work, we address these two major drawbacks of existing approaches.
(1) We improve geometry by building an articulated signed-distance-field (SDF)
representation for clothed human bodies to better capture the geometry of
clothed humans and improve the rendering quality. (2) In order to render the
SDF, we develop an efficient joint root-finding algorithm for the conversion from
observation space to canonical space. Specifically, we represent clothed human
ARAH: Animatable Volume Rendering of Articulated Human SDFs 3
avatars as a combination of a forward linear blend skinning (LBS) network, an
implicit SDF network, and a color network, all defined in canonical space and do
not condition on inputs in observation space. Given these networks and camera
rays in observation space, we apply our novel joint root-finding algorithm that
can efficiently find the iso-surface points in observation space and their corre-
spondences in canonical space. This enables us to perform efficient sampling on
camera rays around the iso-surface. All network modules can be trained with a
photometric loss in image space and regularization losses in canonical space.
We validate our approach on the ZJU-MoCap [60] and the H36M [26] dataset.
Our approach generalizes well to unseen poses, enabling robust animation of
clothed avatars even under out-of-distribution poses where existing works fail,
as shown in Fig. 1. We achieve significant improvements over state-of-the-arts
for novel pose synthesis and geometry reconstruction, while also outperforming
state-of-the-arts in the novel view synthesis task on training poses. Code and
data are available at https://neuralbodies.github.io/arah/.
2 Related Works
Clothed Human Modeling with Explicit Representations: Many ex-
plicit mesh-based approaches represent cloth deformations as deformation lay-
ers [1,2,68] added to minimally clothed parametric human body models [5,
21,28,39,54,57,82]. Such approaches enjoy compatibility with parametric hu-
man body models but have difficulties in modeling large garment deformations.
Other mesh-based approaches model garments as separate meshes [18,19,31,35,
56,69,75,85,90] in order to represent more detailed and physically plausible cloth
deformations. However, such methods often require accurate 3D-surface registra-
tion, synthetic 3D data or dense multi-view images for training and the garment
meshes need to be pre-defined for each cloth type. More recently, point-cloud-
based explicit methods [40,42,89] also showed promising results in modeling
clothed humans. However, they still require explicit 3D or depth supervision
for training, while our goal is to train using sparse multi-view RGB supervision
alone.
Clothed Humans as Implicit Functions: Neural implicit functions [13,44,
45,55,61] have been used to model clothed humans from various sensor inputs
including monocular images [22,23,25,33,6466,72,80,93], multi-view videos [30,
38,52,58,60,81], sparse point clouds [6,14,16,77,78,94], or 3D meshes [11,12,
15,47,48,67,74]. Among the image-based methods, [4,23,25] obtain animatable
reconstructions of clothed humans from a single image. However, they do not
model pose-dependent cloth deformations and require ground-truth geometry for
training. [30] learns generalizable NeRF models for human performance capture
and only requires multi-view images as supervision. But it needs images as inputs
for synthesizing novel poses. [38,52,58,60,81] take multi-view videos as inputs
and do not need ground-truth geometry during training. These methods generate
4 Wang et al.
personalized per-subject avatars and only need 2D supervision. Our approach
follows this line of work and also learns a personalized avatar for each subject.
Neural Rendering of Animatable Clothed Humans: Differentiable neural
rendering has been extended to model animatable human bodies by a number
of recent works [52,58,60,63,72,81]. Neural Body [60] proposes to diffuse la-
tent per-vertex codes associated with SMPL meshes in observation space and
condition NeRF [49] on such latent codes. However, the conditional inputs of
Neural Body are in the observation space. Therefore, it does not generalize well
to out-of-distribution poses. Several recent works [52,58,72] propose to model
the radiance field in canonical space and use a pre-defined or learned back-
ward mapping to map query points from observation space to this canonical
space. A-NeRF [72] uses a deterministic backward mapping defined by piece-
wise rigid bone transformations. This mapping is very coarse and the model
has to use a complicated bone-relative embedding to compensate for that. Ani-
NeRF [58] trains a backward LBS network that does not generalize well to out-
of-distribution poses, even when fine-tuned with a cycle consistency loss for its
backward LBS network for each test pose. Further, all aforementioned methods
utilize a volumetric radiance representation and hence suffer from noisy geome-
try [53,76,86,87]. In contrast to these works, we improve geometry by combining
an implicit surface representation with volume rendering and improve pose gen-
eralization via iterative root-finding. H-NeRF [81] achieves large improvements
in geometric reconstruction by co-training SDF and NeRF networks. However,
code and models of H-NeRF are not publicly available. Furthermore, H-NeRF’s
canonicalization process relies on imGHUM [3] to predict an accurate signed
distance in observation space. Therefore, imGHUM needs to be trained on a
large corpus of posed human scans and it is unclear whether the learned signed
distance fields generalize to out-of-distribution poses beyond the training set. In
contrast, our approach does not need to be trained on any posed scans and it
can generalize to extreme out-of-distribution poses.
Concurrent Works: Several concurrent works extend NeRF-based articulated
models to improve novel view synthesis, geometry reconstruction, or animation
quality [10,24,27,32,46,59,71,79,84,92]. [92] proposes to jointly learn forward
blending weights, a canonical occupancy network, and a canonical color network
using differentiable surface rendering for head-avatars. In contrast to human
heads, human bodies show much more articulation. Abrupt changes in depth
also occur more frequently when rendering human bodies, which is difficult to
capture with surface rendering [76]. Furthermore, [92] uses the secant method
to find surface points. For each secant step, this needs to solve a root-finding
problem from scratch. Instead, we use volume rendering of SDFs and formulate
the surface-finding task of articulated SDFs as a joint root-finding problem that
only needs to be solved once per ray. We remark that [27] proposes to formu-
late surface-finding and correspondence search as a joint root-finding problem
to tackle geometry reconstruction from photometric and mask losses. However,
they use pre-defined skinning fields and surface rendering. They also require esti-
ARAH: Animatable Volume Rendering of Articulated Human SDFs 5
(a) Root-finding and point sampling
Near Surface Points
Surface Points
(b) Canonicalization of sampled points
SDF Color
Predict Pixel
(c) SDF-based volume rendering
GT Pixel
L1
Loss
Far Surface Points
(d) Photometric loss
Joint root-finding
(Sec 3.3)
Fig. 2: Overview of Our Pipeline. (a) Given a ray (c,v) with camera center
cand ray direction vin observation space, we jointly search for its intersec-
tion with the SDF iso-surface and the correspondence of the intersection point
via a novel joint root-finding algorithm (Section 3.3). We then sample near/far
surface points {¯
x}. (b) The sampled points are mapped into canonical space
as {ˆ
x}via root-finding. (c) In canonical space, we run an SDF-based volume
rendering with canonicalized points {ˆ
x}, local body poses and shape (θ, β), an
SDF network feature z, surface normals n, and a per-frame latent code Zto
predict the corresponding pixel value of the input ray (Section 3.4). (d) All net-
work modules, including the forward LBS network LBSσω, the canonical SDF
network fσf, and the canonical color network fσc, are trained end-to-end with
a photometric loss in image space and regularization losses in canonical space
(Section 3.5).
mated normals from PIFuHD [66] while our approach achieves detailed geometry
reconstructions without such supervision.
3 Method
Our pipeline is illustrated in Fig. 2. Our model consists of a forward linear blend
skinning (LBS) network (Section 3.1), a canonical SDF network, and a canon-
ical color network (Section 3.2). When rendering a specific pixel of the image
in observation space, we first find the intersection of the corresponding camera
ray and the observation-space SDF iso-surface. Since we model a canonical SDF
and a forward LBS, we propose a novel joint root-finding algorithm that can
simultaneously search for the ray-surface intersection and the canonical corre-
spondence of the intersection point (Section 3.3). Such a formulation does not
condition the networks on observations in observation space. Consequently, it
can generalize to unseen poses. Once the ray-surface intersection is found, we
sample near/far surface points on the camera ray and find their canonical corre-
spondences via forward LBS root-finding. The canonicalized points are used for
volume rendering to compose the final RGB value at the pixel (Section 3.4). The
predicted pixel color is then compared to the observation using a photometric
loss (Section 3.5). The model is trained end-to-end using the photometric loss
and regularization losses. The learned networks represent a personalized animat-
able avatar that can robustly synthesize new geometries and appearances under
novel poses (Section 4.1).
摘要:

ARAH:AnimatableVolumeRenderingofArticulatedHumanSDFsShaofeiWang1,KatjaSchwarz2,3,AndreasGeiger2,3,andSiyuTang11ETHZ¨urich2MaxPlanckInstituteforIntelligentSystems,T¨ubingen3UniversityofT¨ubingenAbstract.Combininghumanbodymodelswithdifferentiablerenderinghasrecentlyenabledanimatableavatarsofclothedhum...

收起<<
ARAH Animatable Volume Rendering of Articulated Human SDFs Shaofei Wang1 Katja Schwarz23 Andreas Geiger23 and Siyu Tang1.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:6.67MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注