Capturing and Animation of Body and Clothing from Monocular Video_2

2025-04-27 0 0 9.36MB 11 页 10玖币

侵权投诉

Capturing and Animation of Body and Clothing from Monocular Video

YAO FENG, Max Planck Institute for Intelligent Systems, Germany and ETH Zürich, Switzerland

JINLONG YANG, Max Planck Institute for Intelligent Systems, Germany

MARC POLLEFEYS, ETH Zürich, Switzerland

MICHAEL J. BLACK, Max Planck Institute for Intelligent Systems, Germany

TIMO BOLKART, Max Planck Institute for Intelligent Systems, Germany

(a) Input monocular video (b) Disentangled captures from SCARF (c) Animatable face and hands (d) Clothing transfer

Fig. 1. Given a monocular video (a), our method (SCARF) builds an avatar where the body and clothing are disentangled (b). The body is represented by a

traditional mesh, while the clothing is captured by an implicit neural representation. SCARF enables animation with detailed control over the face and hands

While recent work has shown progress on extracting clothed 3D human

avatars from a single image, video, or a set of 3D scans, several limitations

remain. Most methods use a holistic representation to jointly model the body

and clothing, which means that the clothing and body cannot be separated

for applications like virtual try-on. Other methods separately model the body

and clothing, but they require training from a large set of 3D clothed human

meshes obtained from 3D/4D scanners or physics simulations. Our insight

is that the body and clothing have dierent modeling requirements. While

the body is well represented by a mesh-based parametric 3D model, implicit

representations and neural radiance elds are better suited to capturing

the large variety in shape and appearance present in clothing. Building

on this insight, we propose SCARF (Segmented Clothed Avatar Radiance

Field), a hybrid model combining a mesh-based body with a neural radiance

eld. Integrating the mesh into the volumetric rendering in combination

with a dierentiable rasterizer enables us to optimize SCARF directly from

monocular videos, without any 3D supervision. The hybrid modeling enables

SCARF to (i) animate the clothed body avatar by changing body poses

(including hand articulation and facial expressions), (ii) synthesize novel

views of the avatar, and (iii) transfer clothing between avatars in virtual

try-on applications. We demonstrate that SCARF reconstructs clothing with

higher visual quality than existing methods, that the clothing deforms with

changing body pose and body shape, and that clothing can be successfully

transferred between avatars of dierent subjects. The code and models are

available at https://github.com/YadiraF/SCARF.

CCS Concepts: •Computing methodologies →Shape modeling.

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea

ACM ISBN 978-1-4503-9470-3/22/12.

https://doi.org/10.1145/3550469.3555423

ACM Reference Format:

Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart.

2022. Capturing and Animation of Body and Clothing from Monocular

Video. In SIGGRAPH Asia 2022 Conference Papers (SA ’22 Conference Papers),

December 6–9, 2022, Daegu, Republic of Korea. ACM, New York, NY, USA,

11 pages. https://doi.org/10.1145/3550469.3555423

1 INTRODUCTION

Realistic avatar creation is one of the key enablers of the metaverse,

and it supports many applications in virtual presence, tness, digital

fashion, and entertainment. Traditional ways to build avatars require

either complex capture systems or manual design by artists, both

of which are time-consuming and inecient for large-scale avatar

creation. To address this, previous work explores more practical

ways to create avatars directly from single RGB images or monocular

videos, which are more accessible to consumers.

The majority of work (e.g., [Choutas et al

2020; Feng et al

2021a;

Kanazawa et al

2018; Kolotouros et al

2019; Pavlakos et al

2019;

Rong et al

2021; Zanr et al

2021]) creates 3D human body avatars

from images by estimating parameters of statistical 3D mesh mod-

els such as SCAPE [Anguelov et al

2005], Adam [Joo et al

2018],

SMPL/SMPL-X [Loper et al

2015; Pavlakos et al

2019], GHUM [Xu

et al

2020], or STAR [Osman et al

2020], or implicit surface models

like imGHUM [Alldieck et al

2021] and LEAP [Mihajlovic et al

2021]. As these models are trained from minimally clothed body

scans, they are unable to capture clothing shape and appearance

variations, which require a more exible representation.

Methods that recover clothed bodies from images are instead

trained with a large set of 3D clothed human scans [Saito et al

2019,

2020; Xiu et al

2022], or optimize the clothed avatar directly from

multi-view images or videos [Chen et al

2021b; Jiang et al

2022; Liu

arXiv:2210.01868v1 [cs.CV] 4 Oct 2022

SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart

et al

2021b; Peng et al

2021a, 2022, 2021b; Xu et al

2021]. To handle

the complex topology of dierent clothing types, these methods

model the body and clothing with a holistic implicit representa-

tion. Hence, hands and faces are typically poorly reconstructed and

are not articulated. Additionally, holistic models of the body and

clothing do not permit virtual try-on applications, which require

the body and clothing to be represented separately. While neural

radiance elds (

NeRF

) is able to model the head well (e.g., [Hong

et al

2022]), it remains unclear how to eectively combine such a

part-based model with a clothed body representation.

Some methods treat the body and clothing separately with a

layered representation, where clothing is modeled as a layer on

top of the body [Corona et al

2021; Jiang et al

2020; Xiang et al

2021; Zhu et al

2020]. These methods require large datasets of 3D

clothing scans for training, but still lack generalization to diverse

clothing types. Furthermore, given an RGB image, they recover only

the geometry of the clothed body without appearance information

[Corona et al

2021; Jiang et al

2020; Zhu et al

2020]. Similarly, Xiang

et al

[2021] require multi-view video data and accurately registered

3D clothing meshes to build a subject-specic avatar; their method

is not applicable to loose clothing like skirts or dresses.

Our goal is to go beyond existing work to capture realistic avatars

from monocular videos that have detailed and animatable hands

and faces as well as clothing that can be easily transferred between

avatars. We observe that the body and clothing have dierent mod-

eling requirements. Human bodies have similar shapes that can

be modeled well by a statistical mesh model. In contrast, clothing

shape and appearance are much more varied, thus require more

exible 3D representations that could handle changing topologies

and transparent materials. With these observations, we propose

SCARF (Segmented Clothed Avatar Radiance Field), a hybrid repre-

sentation combining a mesh with a

NeRF

, to capture disentangled

clothed human avatars from monocular videos. Specically, we use

SMPL-X

to represent the human body and a

NeRF

on top of the body

mesh to capture clothing of varied topology. There are four main

challenges in building such a model from monocular video. First,

SCARF must accurately capture human motion in monocular video

and relate the body motion to the clothing. The

NeRF

is modeled

in canonical space, and we use the skinning transformation from

the

SMPL-X

body model to deform points in observation space to

the canonical space. This requires accurate estimates of body shape

and pose for every video frame. We estimate body pose and shape

parameters with PIXIE [Feng et al

2021a]. However, these estimates

are not accurate enough, resulting in blurry reconstructions. Thus,

we rene the body pose and shape during optimization. Second, the

cloth deformations are not fully explained by the

SMPL-X

skinning,

particularly in the presence of loose clothing. To overcome this, we

learn a non-rigid deformation eld to correct clothing deviations

from the body. Third, SCARF’s hybrid representation, combining a

NeRF

and a mesh, requires customized volumetric rendering. Specif-

ically, rendering the clothed body must account for the occlusions

between the body mesh and the clothing layer. To integrate a mesh

into volume rendering, we sample a ray from the camera’s optical

center until it intersects the body mesh, and accumulate the colors

along the ray up to the intersection point with the colored mesh

surface. Fourth, to disentangle the body and clothing, we must pre-

vent the

NeRF

from capturing all image information including the

body. To that end, we use clothing segmentation masks to penalize

the NeRF outside of clothed regions.

In summary, SCARF automatically creates a 3D clothed human

avatar from monocular video (Fig. 1) with disentangled clothing

on top of the human body. SCARF oers the best of two worlds

by combining dierent representations – a 3D parametric model

for the body and a

NeRF

for the clothing. Based on

SMPL-X

, the

reconstructed avatar oers animator control over body shape, pose,

hand articulation, and facial expression. Since SCARF factors cloth-

ing from the body, the clothing can be extracted and transferred

between avatars, enabling applications such as virtual try-on.

2 RELATED WORK

3D Bodies from images

. The 3D surface of a human body is

typically represented by a learned statistical 3D model [Alldieck

et al

2021; Anguelov et al

2005; Joo et al

2018; Loper et al

2015;

Osman et al

2020; Pavlakos et al

2019; Xu et al

2020]. Numerous

optimization and regression methods have been proposed to com-

pute 3D shape and pose parameters from images, videos, and scans.

See [Liu et al

2021a; Tian et al

2022] for recent surveys. We focus

on methods that capture full-body pose and shape, including the

hands and facial expressions [Choutas et al

2020; Feng et al

2021a;

Pavlakos et al

2019; Rong et al

2021; Xiang et al

2019; Xu et al

2020; Zhou et al

2021]. Such methods, however, do not capture

hair, clothing, or anything that deviates the body. Also, they rarely

recover texture information, due to the large geometric discrepancy

between the clothed human in the image and captured minimal

clothed body mesh. Unlike these prior works, we consider clothing

as an important component and capture both the parametric body

and non-parametric clothing from monocular videos.

Capturing clothed humans from images

. Clothing is more com-

plex than the body in terms of geometry, non-rigid deformation, and

appearance, making the capture of clothing from images challenging.

Mesh-based methods to capture clothing often use additional vertex

osets relative to the body mesh [Alldieck et al

2019a, 2018a,b,

2019b; Jin et al

2020; Lazova et al

2019; Ma et al

2020a,b]. While

such an approach works well for clothing that is similar to the

body, it does not capture clothing of varied topology like skirts

and dresses. To handle clothing shape variations, recent methods

exploit non-parametric models. For example, [He et al

2021; Huang

et al

2020; Saito et al

2019, 2020; Xiu et al

2022; Zheng et al

2021]

extract pixel-aligned spatial features from images and map them

to an implicit shape representation. To animate the captured non-

parametric clothed humans, Yang et al

[2021] predict skeleton and

skinning weights from images to drive the representation. Although

such non-parametric models can capture various clothing styles

much better than mesh-based approaches, faces and hands are usu-

ally poorly recovered due to the lack of a strong prior on how the

human body should be. In addition, such approaches typically re-

quire a large set of manually cleaned 3D scans as training data.

Recently, various methods recover 3D clothed humans directly from

multi-view or monocular RGB videos [Chen et al

2021b; Jiang et al

2022; Liu et al

2021b; Peng et al

2021a, 2022, 2021b; Su et al

2021;

Capturing and Animation of Body and Clothing from Monocular Video SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea

Fig. 2. SCARF takes monocular RGB video and clothing segmentation masks as input, and outputs a human avatar with separate body and clothing layers.

Blue leers indicate optimizable modules or parameters.

Weng et al

2022]. They optimize avatars from image information

using implicit shape rendering [Liu et al

2020; Niemeyer et al

2020;

Yariv et al

2021, 2020] or volume rendering [Mildenhall et al

2020],

no 3D scans are needed. Although these approaches demonstrate

impressive performance, hand gestures and facial expressions are

dicult to capture and animate due to the lack of model expressivity

and controllability. Unlike previous work, we capture clothing as a

separate component on top of the body. With such a formulation,

we use models tailored specically to bodies and clothing, enabling

applications such as virtual try-on and clothing transfer.

Capturing both clothing and body

. Several methods model

clothing as a separate layer on top of the human body. They use

training data produced by physics-based simulations [Bertiche et al

2020; Patel et al

2020; Santesteban et al

2019; Vidaurre et al

2020]

or require template meshes t to 3D scans [Chen et al

2021a; Pons-

Moll et al

2017; Tiwari et al

2020; Xiang et al

2021]. It is a much

harder problem to recover the body and clothing from images alone,

where 3D data is not provided. Jiang et al

[2020] and Zhu et al

[2020] train a multi-clothing model on 3D datasets with various

clothing styles. Then during inference, a trained network produces

the 3D clothing as a separate layer by recognizing and predicting the

clothing style from an image. Zhu et al

[2022] t template meshes

to non-parametric 3D reconstructions. While these methods recover

the clothing and body from images, they are limited in visual delity,

as they do not capture clothing appearance. Additionally, methods

with such predened clothing style templates can not easily handle

the real clothing variations, limiting their applications. In contrast,

Corona et al

[2021] represent clothing layers with deep unsigned

distance functions [Chibane et al

2020], and learn the clothing style

and clothing cut space with an auto-decoder. Once trained, the cloth-

ing latent code can be optimized to match image observations, but

it produces over-smooth results without detailed wrinkles. Instead,

SCARF models the clothing layer with a neural radiance eld, and

optimizes the body and clothing layer from scratch instead of the

latent space of a learned model. Therefore, SCARF produces avatars

with higher visual delity (see Section 4).

3 METHOD

SCARF extracts a clothed 3D avatar from a monocular video. SCARF

enables us to synthesize novel views of the reconstructed avatar, and

to animate the avatar with

SMPL-X

identity shape and pose control.

The disentanglement of body and clothing further enables us to

transfer clothing between subjects for virtual try-on applications.

Key idea

. SCARF is grounded in the observation that statistical

mesh models can represent human bodies well, but are ill-suited for

clothing due to the large variation in clothing shape and topology

(e.g., open & closed jackets, shirt, trousers, and skirts cannot be

modeled with meshes of the same topology). Instead,

NeRF

[Milden-

hall et al

2020] oers more exibility for modeling clothing, but

is less appropriate for bodies where good models already exist. In

particular, body

NeRF

s often lack facial details, poorly reconstruct

hands, and lack ne-grained control of hand articulation and facial

expression [Chen et al

2021b; Peng et al

2022, 2021b; Su et al

2021].

Motivated by the strengths and weaknesses of the dierent represen-

tations, we use a hybrid representation that combines the strengths

of body mesh models (specically

SMPL-X

) with the exibility of

NeRFs; see Figure 2 for an overview.

3.1 Hybrid Representation

We dene the clothed body model in a canonical space, where body

and clothing are represented separately.

Body representation

. We represent the body with the expressive

body model,

SMPL-X

[Pavlakos et al

2019], which captures whole-

body shape and pose variations, including nger articulation, and

facial expressions. Given parameters for identity body shape

𝜷∈

R|𝜷|

, pose

𝜽∈R3𝑛𝑘+3

, and facial expression

𝝍∈R|𝝍|

SMPL-X

is dened as a dierentiable function

𝑀(𝜷,𝜽,𝝍) → (V,F)

that

outputs a 3D human body mesh with

𝑛𝑣

vertices

V∈R𝑛𝑣×3

, and

𝑛𝑡

faces

F∈R𝑛𝑡×3

. To increase the exibility of the model, we add

an additional set of vertex osets

O∈R𝑛𝑣×3

to capture localized

geometric details, and dene the model as

𝑀(𝜷,𝜽,𝝍,O)=LBS(𝑇𝑃(𝜷,𝜽,𝝍,O),J(𝜷),𝜽,W),(1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CapturingandAnimationofBodyandClothingfromMonocularVideoYAOFENG,MaxPlanckInstituteforIntelligentSystems,GermanyandETHZürich,SwitzerlandJINLONGYANG,MaxPlanckInstituteforIntelligentSystems,GermanyMARCPOLLEFEYS,ETHZürich,SwitzerlandMICHAELJ.BLACK,MaxPlanckInstituteforIntelligentSystems,GermanyTIMOBOLKA...

展开>> 收起<<

Capturing and Animation of Body and Clothing from Monocular Video_2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Capturing and Animation of Body and Clothing from Monocular Video_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: