SA ’22 Conference Papers, December 6–9, 2022, Daegu, Republic of Korea Yao Feng, Jinlong Yang, Marc Pollefeys, Michael J. Black, and Timo Bolkart
et al
.
2021b; Peng et al
.
2021a, 2022, 2021b; Xu et al
.
2021]. To handle
the complex topology of dierent clothing types, these methods
model the body and clothing with a holistic implicit representa-
tion. Hence, hands and faces are typically poorly reconstructed and
are not articulated. Additionally, holistic models of the body and
clothing do not permit virtual try-on applications, which require
the body and clothing to be represented separately. While neural
radiance elds (
NeRF
) is able to model the head well (e.g., [Hong
et al
.
2022]), it remains unclear how to eectively combine such a
part-based model with a clothed body representation.
Some methods treat the body and clothing separately with a
layered representation, where clothing is modeled as a layer on
top of the body [Corona et al
.
2021; Jiang et al
.
2020; Xiang et al
.
2021; Zhu et al
.
2020]. These methods require large datasets of 3D
clothing scans for training, but still lack generalization to diverse
clothing types. Furthermore, given an RGB image, they recover only
the geometry of the clothed body without appearance information
[Corona et al
.
2021; Jiang et al
.
2020; Zhu et al
.
2020]. Similarly, Xiang
et al
.
[2021] require multi-view video data and accurately registered
3D clothing meshes to build a subject-specic avatar; their method
is not applicable to loose clothing like skirts or dresses.
Our goal is to go beyond existing work to capture realistic avatars
from monocular videos that have detailed and animatable hands
and faces as well as clothing that can be easily transferred between
avatars. We observe that the body and clothing have dierent mod-
eling requirements. Human bodies have similar shapes that can
be modeled well by a statistical mesh model. In contrast, clothing
shape and appearance are much more varied, thus require more
exible 3D representations that could handle changing topologies
and transparent materials. With these observations, we propose
SCARF (Segmented Clothed Avatar Radiance Field), a hybrid repre-
sentation combining a mesh with a
NeRF
, to capture disentangled
clothed human avatars from monocular videos. Specically, we use
SMPL-X
to represent the human body and a
NeRF
on top of the body
mesh to capture clothing of varied topology. There are four main
challenges in building such a model from monocular video. First,
SCARF must accurately capture human motion in monocular video
and relate the body motion to the clothing. The
NeRF
is modeled
in canonical space, and we use the skinning transformation from
the
SMPL-X
body model to deform points in observation space to
the canonical space. This requires accurate estimates of body shape
and pose for every video frame. We estimate body pose and shape
parameters with PIXIE [Feng et al
.
2021a]. However, these estimates
are not accurate enough, resulting in blurry reconstructions. Thus,
we rene the body pose and shape during optimization. Second, the
cloth deformations are not fully explained by the
SMPL-X
skinning,
particularly in the presence of loose clothing. To overcome this, we
learn a non-rigid deformation eld to correct clothing deviations
from the body. Third, SCARF’s hybrid representation, combining a
NeRF
and a mesh, requires customized volumetric rendering. Specif-
ically, rendering the clothed body must account for the occlusions
between the body mesh and the clothing layer. To integrate a mesh
into volume rendering, we sample a ray from the camera’s optical
center until it intersects the body mesh, and accumulate the colors
along the ray up to the intersection point with the colored mesh
surface. Fourth, to disentangle the body and clothing, we must pre-
vent the
NeRF
from capturing all image information including the
body. To that end, we use clothing segmentation masks to penalize
the NeRF outside of clothed regions.
In summary, SCARF automatically creates a 3D clothed human
avatar from monocular video (Fig. 1) with disentangled clothing
on top of the human body. SCARF oers the best of two worlds
by combining dierent representations – a 3D parametric model
for the body and a
NeRF
for the clothing. Based on
SMPL-X
, the
reconstructed avatar oers animator control over body shape, pose,
hand articulation, and facial expression. Since SCARF factors cloth-
ing from the body, the clothing can be extracted and transferred
between avatars, enabling applications such as virtual try-on.
2 RELATED WORK
3D Bodies from images
. The 3D surface of a human body is
typically represented by a learned statistical 3D model [Alldieck
et al
.
2021; Anguelov et al
.
2005; Joo et al
.
2018; Loper et al
.
2015;
Osman et al
.
2020; Pavlakos et al
.
2019; Xu et al
.
2020]. Numerous
optimization and regression methods have been proposed to com-
pute 3D shape and pose parameters from images, videos, and scans.
See [Liu et al
.
2021a; Tian et al
.
2022] for recent surveys. We focus
on methods that capture full-body pose and shape, including the
hands and facial expressions [Choutas et al
.
2020; Feng et al
.
2021a;
Pavlakos et al
.
2019; Rong et al
.
2021; Xiang et al
.
2019; Xu et al
.
2020; Zhou et al
.
2021]. Such methods, however, do not capture
hair, clothing, or anything that deviates the body. Also, they rarely
recover texture information, due to the large geometric discrepancy
between the clothed human in the image and captured minimal
clothed body mesh. Unlike these prior works, we consider clothing
as an important component and capture both the parametric body
and non-parametric clothing from monocular videos.
Capturing clothed humans from images
. Clothing is more com-
plex than the body in terms of geometry, non-rigid deformation, and
appearance, making the capture of clothing from images challenging.
Mesh-based methods to capture clothing often use additional vertex
osets relative to the body mesh [Alldieck et al
.
2019a, 2018a,b,
2019b; Jin et al
.
2020; Lazova et al
.
2019; Ma et al
.
2020a,b]. While
such an approach works well for clothing that is similar to the
body, it does not capture clothing of varied topology like skirts
and dresses. To handle clothing shape variations, recent methods
exploit non-parametric models. For example, [He et al
.
2021; Huang
et al
.
2020; Saito et al
.
2019, 2020; Xiu et al
.
2022; Zheng et al
.
2021]
extract pixel-aligned spatial features from images and map them
to an implicit shape representation. To animate the captured non-
parametric clothed humans, Yang et al
.
[2021] predict skeleton and
skinning weights from images to drive the representation. Although
such non-parametric models can capture various clothing styles
much better than mesh-based approaches, faces and hands are usu-
ally poorly recovered due to the lack of a strong prior on how the
human body should be. In addition, such approaches typically re-
quire a large set of manually cleaned 3D scans as training data.
Recently, various methods recover 3D clothed humans directly from
multi-view or monocular RGB videos [Chen et al
.
2021b; Jiang et al
.
2022; Liu et al
.
2021b; Peng et al
.
2021a, 2022, 2021b; Su et al
.
2021;
2