Capturing and Animation of Body and Clothing from Monocular Video_2

2025-04-27 0 0 9.36MB 11 页 10玖币
侵权投诉
Capturing and Animation of Body and Clothing from Monocular Video
YAO FENG, Max Planck Institute for Intelligent Systems, Germany and ETH Zürich, Switzerland
JINLONG YANG, Max Planck Institute for Intelligent Systems, Germany
MARC POLLEFEYS, ETH Zürich, Switzerland
MICHAEL J. BLACK, Max Planck Institute for Intelligent Systems, Germany
TIMO BOLKART, Max Planck Institute for Intelligent Systems, Germany
(a) Input monocular video (b) Disentangled captures from SCARF (c) Animatable face and hands (d) Clothing transfer
Fig. 1. Given a monocular video (a), our method (SCARF) builds an avatar where the body and clothing are disentangled (b). The body is represented by a
traditional mesh, while the clothing is captured by an implicit neural representation. SCARF enables animation with detailed control over the face and hands
(c) as well as clothing transfer between subjects (d).
While recent work has shown progress on extracting clothed 3D human
avatars from a single image, video, or a set of 3D scans, several limitations
remain. Most methods use a holistic representation to jointly model the body
and clothing, which means that the clothing and body cannot be separated
for applications like virtual try-on. Other methods separately model the body
and clothing, but they require training from a large set of 3D clothed human
meshes obtained from 3D/4D scanners or physics simulations. Our insight
is that the body and clothing have dierent modeling requirements. While
the body is well represented by a mesh-based parametric 3D model, implicit
representations and neural radiance elds are better suited to capturing
the large variety in shape and appearance present in clothing. Building
on this insight, we propose SCARF (Segmented Clothed Avatar Radiance
Field), a hybrid model combining a mesh-based body with a neural radiance
eld. Integrating the mesh into the volumetric rendering in combination
with a dierentiable rasterizer enables us to optimize SCARF directly from
monocular videos, without any 3D supervision. The hybrid modeling enables
SCARF to (i) animate the clothed body avatar by changing body poses
(including hand articulation and facial expressions), (ii) synthesize novel
views of the avatar, and (iii) transfer clothing between avatars in virtual
try-on applications. We demonstrate that SCARF reconstructs clothing with
higher visual quality than existing methods, that the clothing deforms with
changing body pose and body shape, and that clothing can be successfully
transferred between avatars of dierent subjects. The code and models are
available at https://github.com/YadiraF/SCARF.
CCS Concepts: Computing methodologies Shape modeling.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea
©2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9470-3/22/12.
https://doi.org/10.1145/3550469.3555423
ACM Reference Format:
Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart.
2022. Capturing and Animation of Body and Clothing from Monocular
Video. In SIGGRAPH Asia 2022 Conference Papers (SA ’22 Conference Papers),
December 6–9, 2022, Daegu, Republic of Korea. ACM, New York, NY, USA,
11 pages. https://doi.org/10.1145/3550469.3555423
1 INTRODUCTION
Realistic avatar creation is one of the key enablers of the metaverse,
and it supports many applications in virtual presence, tness, digital
fashion, and entertainment. Traditional ways to build avatars require
either complex capture systems or manual design by artists, both
of which are time-consuming and inecient for large-scale avatar
creation. To address this, previous work explores more practical
ways to create avatars directly from single RGB images or monocular
videos, which are more accessible to consumers.
The majority of work (e.g., [Choutas et al
.
2020; Feng et al
.
2021a;
Kanazawa et al
.
2018; Kolotouros et al
.
2019; Pavlakos et al
.
2019;
Rong et al
.
2021; Zanr et al
.
2021]) creates 3D human body avatars
from images by estimating parameters of statistical 3D mesh mod-
els such as SCAPE [Anguelov et al
.
2005], Adam [Joo et al
.
2018],
SMPL/SMPL-X [Loper et al
.
2015; Pavlakos et al
.
2019], GHUM [Xu
et al
.
2020], or STAR [Osman et al
.
2020], or implicit surface models
like imGHUM [Alldieck et al
.
2021] and LEAP [Mihajlovic et al
.
2021]. As these models are trained from minimally clothed body
scans, they are unable to capture clothing shape and appearance
variations, which require a more exible representation.
Methods that recover clothed bodies from images are instead
trained with a large set of 3D clothed human scans [Saito et al
.
2019,
2020; Xiu et al
.
2022], or optimize the clothed avatar directly from
multi-view images or videos [Chen et al
.
2021b; Jiang et al
.
2022; Liu
1
arXiv:2210.01868v1 [cs.CV] 4 Oct 2022
SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart
et al
.
2021b; Peng et al
.
2021a, 2022, 2021b; Xu et al
.
2021]. To handle
the complex topology of dierent clothing types, these methods
model the body and clothing with a holistic implicit representa-
tion. Hence, hands and faces are typically poorly reconstructed and
are not articulated. Additionally, holistic models of the body and
clothing do not permit virtual try-on applications, which require
the body and clothing to be represented separately. While neural
radiance elds (
NeRF
) is able to model the head well (e.g., [Hong
et al
.
2022]), it remains unclear how to eectively combine such a
part-based model with a clothed body representation.
Some methods treat the body and clothing separately with a
layered representation, where clothing is modeled as a layer on
top of the body [Corona et al
.
2021; Jiang et al
.
2020; Xiang et al
.
2021; Zhu et al
.
2020]. These methods require large datasets of 3D
clothing scans for training, but still lack generalization to diverse
clothing types. Furthermore, given an RGB image, they recover only
the geometry of the clothed body without appearance information
[Corona et al
.
2021; Jiang et al
.
2020; Zhu et al
.
2020]. Similarly, Xiang
et al
.
[2021] require multi-view video data and accurately registered
3D clothing meshes to build a subject-specic avatar; their method
is not applicable to loose clothing like skirts or dresses.
Our goal is to go beyond existing work to capture realistic avatars
from monocular videos that have detailed and animatable hands
and faces as well as clothing that can be easily transferred between
avatars. We observe that the body and clothing have dierent mod-
eling requirements. Human bodies have similar shapes that can
be modeled well by a statistical mesh model. In contrast, clothing
shape and appearance are much more varied, thus require more
exible 3D representations that could handle changing topologies
and transparent materials. With these observations, we propose
SCARF (Segmented Clothed Avatar Radiance Field), a hybrid repre-
sentation combining a mesh with a
NeRF
, to capture disentangled
clothed human avatars from monocular videos. Specically, we use
SMPL-X
to represent the human body and a
NeRF
on top of the body
mesh to capture clothing of varied topology. There are four main
challenges in building such a model from monocular video. First,
SCARF must accurately capture human motion in monocular video
and relate the body motion to the clothing. The
NeRF
is modeled
in canonical space, and we use the skinning transformation from
the
SMPL-X
body model to deform points in observation space to
the canonical space. This requires accurate estimates of body shape
and pose for every video frame. We estimate body pose and shape
parameters with PIXIE [Feng et al
.
2021a]. However, these estimates
are not accurate enough, resulting in blurry reconstructions. Thus,
we rene the body pose and shape during optimization. Second, the
cloth deformations are not fully explained by the
SMPL-X
skinning,
particularly in the presence of loose clothing. To overcome this, we
learn a non-rigid deformation eld to correct clothing deviations
from the body. Third, SCARF’s hybrid representation, combining a
NeRF
and a mesh, requires customized volumetric rendering. Specif-
ically, rendering the clothed body must account for the occlusions
between the body mesh and the clothing layer. To integrate a mesh
into volume rendering, we sample a ray from the camera’s optical
center until it intersects the body mesh, and accumulate the colors
along the ray up to the intersection point with the colored mesh
surface. Fourth, to disentangle the body and clothing, we must pre-
vent the
NeRF
from capturing all image information including the
body. To that end, we use clothing segmentation masks to penalize
the NeRF outside of clothed regions.
In summary, SCARF automatically creates a 3D clothed human
avatar from monocular video (Fig. 1) with disentangled clothing
on top of the human body. SCARF oers the best of two worlds
by combining dierent representations – a 3D parametric model
for the body and a
NeRF
for the clothing. Based on
SMPL-X
, the
reconstructed avatar oers animator control over body shape, pose,
hand articulation, and facial expression. Since SCARF factors cloth-
ing from the body, the clothing can be extracted and transferred
between avatars, enabling applications such as virtual try-on.
2 RELATED WORK
3D Bodies from images
. The 3D surface of a human body is
typically represented by a learned statistical 3D model [Alldieck
et al
.
2021; Anguelov et al
.
2005; Joo et al
.
2018; Loper et al
.
2015;
Osman et al
.
2020; Pavlakos et al
.
2019; Xu et al
.
2020]. Numerous
optimization and regression methods have been proposed to com-
pute 3D shape and pose parameters from images, videos, and scans.
See [Liu et al
.
2021a; Tian et al
.
2022] for recent surveys. We focus
on methods that capture full-body pose and shape, including the
hands and facial expressions [Choutas et al
.
2020; Feng et al
.
2021a;
Pavlakos et al
.
2019; Rong et al
.
2021; Xiang et al
.
2019; Xu et al
.
2020; Zhou et al
.
2021]. Such methods, however, do not capture
hair, clothing, or anything that deviates the body. Also, they rarely
recover texture information, due to the large geometric discrepancy
between the clothed human in the image and captured minimal
clothed body mesh. Unlike these prior works, we consider clothing
as an important component and capture both the parametric body
and non-parametric clothing from monocular videos.
Capturing clothed humans from images
. Clothing is more com-
plex than the body in terms of geometry, non-rigid deformation, and
appearance, making the capture of clothing from images challenging.
Mesh-based methods to capture clothing often use additional vertex
osets relative to the body mesh [Alldieck et al
.
2019a, 2018a,b,
2019b; Jin et al
.
2020; Lazova et al
.
2019; Ma et al
.
2020a,b]. While
such an approach works well for clothing that is similar to the
body, it does not capture clothing of varied topology like skirts
and dresses. To handle clothing shape variations, recent methods
exploit non-parametric models. For example, [He et al
.
2021; Huang
et al
.
2020; Saito et al
.
2019, 2020; Xiu et al
.
2022; Zheng et al
.
2021]
extract pixel-aligned spatial features from images and map them
to an implicit shape representation. To animate the captured non-
parametric clothed humans, Yang et al
.
[2021] predict skeleton and
skinning weights from images to drive the representation. Although
such non-parametric models can capture various clothing styles
much better than mesh-based approaches, faces and hands are usu-
ally poorly recovered due to the lack of a strong prior on how the
human body should be. In addition, such approaches typically re-
quire a large set of manually cleaned 3D scans as training data.
Recently, various methods recover 3D clothed humans directly from
multi-view or monocular RGB videos [Chen et al
.
2021b; Jiang et al
.
2022; Liu et al
.
2021b; Peng et al
.
2021a, 2022, 2021b; Su et al
.
2021;
2
Capturing and Animation of Body and Clothing from Monocular Video SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea
Fig. 2. SCARF takes monocular RGB video and clothing segmentation masks as input, and outputs a human avatar with separate body and clothing layers.
Blue leers indicate optimizable modules or parameters.
Weng et al
.
2022]. They optimize avatars from image information
using implicit shape rendering [Liu et al
.
2020; Niemeyer et al
.
2020;
Yariv et al
.
2021, 2020] or volume rendering [Mildenhall et al
.
2020],
no 3D scans are needed. Although these approaches demonstrate
impressive performance, hand gestures and facial expressions are
dicult to capture and animate due to the lack of model expressivity
and controllability. Unlike previous work, we capture clothing as a
separate component on top of the body. With such a formulation,
we use models tailored specically to bodies and clothing, enabling
applications such as virtual try-on and clothing transfer.
Capturing both clothing and body
. Several methods model
clothing as a separate layer on top of the human body. They use
training data produced by physics-based simulations [Bertiche et al
.
2020; Patel et al
.
2020; Santesteban et al
.
2019; Vidaurre et al
.
2020]
or require template meshes t to 3D scans [Chen et al
.
2021a; Pons-
Moll et al
.
2017; Tiwari et al
.
2020; Xiang et al
.
2021]. It is a much
harder problem to recover the body and clothing from images alone,
where 3D data is not provided. Jiang et al
.
[2020] and Zhu et al
.
[2020] train a multi-clothing model on 3D datasets with various
clothing styles. Then during inference, a trained network produces
the 3D clothing as a separate layer by recognizing and predicting the
clothing style from an image. Zhu et al
.
[2022] t template meshes
to non-parametric 3D reconstructions. While these methods recover
the clothing and body from images, they are limited in visual delity,
as they do not capture clothing appearance. Additionally, methods
with such predened clothing style templates can not easily handle
the real clothing variations, limiting their applications. In contrast,
Corona et al
.
[2021] represent clothing layers with deep unsigned
distance functions [Chibane et al
.
2020], and learn the clothing style
and clothing cut space with an auto-decoder. Once trained, the cloth-
ing latent code can be optimized to match image observations, but
it produces over-smooth results without detailed wrinkles. Instead,
SCARF models the clothing layer with a neural radiance eld, and
optimizes the body and clothing layer from scratch instead of the
latent space of a learned model. Therefore, SCARF produces avatars
with higher visual delity (see Section 4).
3 METHOD
SCARF extracts a clothed 3D avatar from a monocular video. SCARF
enables us to synthesize novel views of the reconstructed avatar, and
to animate the avatar with
SMPL-X
identity shape and pose control.
The disentanglement of body and clothing further enables us to
transfer clothing between subjects for virtual try-on applications.
Key idea
. SCARF is grounded in the observation that statistical
mesh models can represent human bodies well, but are ill-suited for
clothing due to the large variation in clothing shape and topology
(e.g., open & closed jackets, shirt, trousers, and skirts cannot be
modeled with meshes of the same topology). Instead,
NeRF
[Milden-
hall et al
.
2020] oers more exibility for modeling clothing, but
is less appropriate for bodies where good models already exist. In
particular, body
NeRF
s often lack facial details, poorly reconstruct
hands, and lack ne-grained control of hand articulation and facial
expression [Chen et al
.
2021b; Peng et al
.
2022, 2021b; Su et al
.
2021].
Motivated by the strengths and weaknesses of the dierent represen-
tations, we use a hybrid representation that combines the strengths
of body mesh models (specically
SMPL-X
) with the exibility of
NeRFs; see Figure 2 for an overview.
3.1 Hybrid Representation
We dene the clothed body model in a canonical space, where body
and clothing are represented separately.
Body representation
. We represent the body with the expressive
body model,
SMPL-X
[Pavlakos et al
.
2019], which captures whole-
body shape and pose variations, including nger articulation, and
facial expressions. Given parameters for identity body shape
𝜷
R|𝜷|
, pose
𝜽R3𝑛𝑘+3
, and facial expression
𝝍R|𝝍|
,
SMPL-X
is dened as a dierentiable function
𝑀(𝜷,𝜽,𝝍) → (V,F)
that
outputs a 3D human body mesh with
𝑛𝑣
vertices
VR𝑛𝑣×3
, and
𝑛𝑡
faces
FR𝑛𝑡×3
. To increase the exibility of the model, we add
an additional set of vertex osets
OR𝑛𝑣×3
to capture localized
geometric details, and dene the model as
𝑀(𝜷,𝜽,𝝍,O)=LBS(𝑇𝑃(𝜷,𝜽,𝝍,O),J(𝜷),𝜽,W),(1)
3
摘要:

CapturingandAnimationofBodyandClothingfromMonocularVideoYAOFENG,MaxPlanckInstituteforIntelligentSystems,GermanyandETHZürich,SwitzerlandJINLONGYANG,MaxPlanckInstituteforIntelligentSystems,GermanyMARCPOLLEFEYS,ETHZürich,SwitzerlandMICHAELJ.BLACK,MaxPlanckInstituteforIntelligentSystems,GermanyTIMOBOLKA...

展开>> 收起<<
Capturing and Animation of Body and Clothing from Monocular Video_2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:9.36MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注