SUPR A Sparse Unified Part-Based Human Representation Ahmed A. A. Osman1 Timo Bolkart1 Dimitrios Tzionas2 and

2025-05-02 0 0 9.65MB 43 页 10玖币

侵权投诉

SUPR: A Sparse Uniﬁed Part-Based Human

Representation

Ahmed A. A. Osman1, Timo Bolkart1, Dimitrios Tzionas2, and

Michael J. Black1

1Max Planck Institute for Intelligent Systems, T¨ubingen, Germany

2University of Amsterdam

{aosman,tbolkart,black}@tuebingen.mpg.de,d.tzionas@uva.nl

Abstract. Statistical 3D shape models of the head, hands, and full body

are widely used in computer vision and graphics. Despite their wide use,

we show that existing models of the head and hands fail to capture the

full range of motion for these parts. Moreover, existing work largely ig-

nores the feet, which are crucial for modeling human movement and have

applications in biomechanics, animation, and the footwear industry. The

problem is that previous body part models are trained using 3D scans

that are isolated to the individual parts. Such data does not capture the

full range of motion for such parts, e.g. the motion of head relative to

the neck. Our observation is that full-body scans provide important in-

formation about the motion of the body parts. Consequently, we propose

a new learning scheme that jointly trains a full-body model and speciﬁc

part models using a federated dataset of full-body and body-part scans.

Speciﬁcally, we train an expressive human body model called SUPR

(Sparse Uniﬁed Part-Based Representation), where each joint strictly

inﬂuences a sparse set of model vertices. The factorized representation

enables separating SUPR into an entire suite of body part models: an

expressive head (SUPR-Head), an articulated hand (SUPR-Hand), and

a novel foot (SUPR-Foot). Note that feet have received little attention

and existing 3D body models have highly under-actuated feet. Using

novel 4D scans of feet, we train a model with an extended kinematic

tree that captures the range of motion of the toes. Additionally, feet de-

form due to ground contact. To model this, we include a novel non-linear

deformation function that predicts foot deformation conditioned on the

foot pose, shape, and ground contact. We train SUPR on an unprece-

dented number of scans: 1.2 million body, head, hand and foot scans.

We quantitatively compare SUPR and the separate body parts to ex-

isting expressive human body models and body-part models and ﬁnd

that our suite of models generalizes better and captures the body parts’

full range of motion. SUPR is publicly available for research purposes at

http://supr.is.tue.mpg.de

1 Introduction

Generative 3D models of the human body and its parts play an important role

in understanding human behaviour. Over the past two decades, numerous 3D

arXiv:2210.13861v1 [cs.CV] 25 Oct 2022

2 Osman et al.

Fig. 1: Expressive part-based human body model. SUPR is a factorized

representation of the human body that can be separated into a full suite of body

part models.

models of the body [1,2,3,4,5,6,7,8,9], face [10,11,12,13,14,15,16,17] and

hands [18,19,20,21,22,23] have been proposed. Such models enabled a myriad

of applications ranging from reconstructing bodies [24,25,26], faces [27,28,29],

and hands [30,31] from images and videos, modeling human interactions [32],

generating 3D clothed humans [33,34,35,36,37,38,39], or generating humans

in scenes [40,41,42]. They are also used as priors for ﬁtting models to a wide

range of sensory input measurements like motion capture markers [43,44] or

IMUs [45,46,47].

Hand [21,48,22,49], head [12,13,49] and body [6,7] models are typically

built independently. Heads and hands are captured with a 3D scanner in which a

subject remains static, while the face and hands are articulated. This data is un-

natural as it does not capture how the body parts move together with the body.

As a consequence, the construction of head/hand models implicitly assumes a

static body, and use a simple kinematic tree that fails to model the head/hand

full degrees of freedom. For example, in Fig. 2a we ﬁt the FLAME head model

[13] to a pose where the subject is looking right and ﬁnd that FLAME exhibits a

signiﬁcant error in the neck region. Similarly, we ﬁt the MANO [21] hand model

to a hand pose where the the wrist is fully bent downwards. MANO fails to cap-

ture the wrist deformation that results from the bent wrist. This is a systematic

limitation of existing head/hand models, which can not be addressed by simply

training on more data.

Another signiﬁcant limitation of existing body-part models is the lack of

an articulated foot model in the literature. This is surprising given the many

applications of a 3D foot model in the design, sale, and animation of footwear.

Feet are also critical for human locomotion. Any biomechanical or physics-based

model must have realistic feet to be faithful. The feet on existing full body

models like SMPL are overly simplistic, have limited articulation, and do not

deform with contact as shown in Fig. 2b.

SUPR 3

Registration Model Fit Error Heatmap

FLAME

MANO

1 cm

(a) Body part models boundary error.

Registration

SMPL

(b) SMPL ground penetration.

Fig. 2: Body part models failure cases. Left: Existing body part models such

as the FLAME [13] head model and the MANO [21] hand model fail to capture

the corresponding body part’s shape through the full range of motion. Fitting

FLAME to a subject looking left results in signiﬁcant error in the neck region.

Similarly, ﬁtting MANO to hands with a bent wrist, results in signiﬁcant error

at the wrist region. Right: The foot of SMPL [6] fails to model deformations due

to ground contact, hence penetrating the ground. Additionally, it has a limited

number of joints to model the toes articulation.

In contrast to the existing approaches, we propose to jointly train the full

human body and body part models together. We ﬁrst train a new full-body model

called SUPR, with articulated hands and an expressive head using a federated

dataset of body, hand, head and foot scans. This joint learning captures the

full range of motion of the body parts along with the associated deformation.

Then, given the learned deformations, we separate the body model into body

part models. To enable separating SUPR into compact individual body parts we

learn a sparse factorization of the pose-corrective blend shape function as shown

in the teaser Fig. 1. The factored representation of SUPR enables separating

SUPR into an entire suite of models: SUPR-Head, SUPR-Hand and SUPR-Foot.

A body part model is separated by considering all the joints that inﬂuence the set

of vertices deﬁned by the body part template mesh. We show that the learned

kinematic tree structure for the head/hand contains signiﬁcantly more joints

than commonly used by head/hand models. In contrast to the existing body

part models that are learned in isolation of the body, our training algorithm

uniﬁes many disparate prior eﬀorts and results in a suite of models that can

capture the full range of motion of the head, hands, and feet.

SUPR goes beyond existing statistical body models to include a novel foot

model. To do so, we extend the standard kinematic tree for the foot to allow more

degrees of freedom. To train the model, we capture foot scans using a custom 4D

foot scanner (see Sup. Mat.), where the foot is visible from all views, including

the sole of the foot which is imaged through a glass plate. This uniquely allows

4 Osman et al.

us to capture how the foot is deformed by contact with the ground. We then

model this deformation as a function of body pose and contact.

We train SUPR on 1.2 million hand, head, foot, and body scans, which is

an order of magnitude more data than the largest training dataset reported

in the literature (60K GHUM [49]). The training data contains extreme body

shapes such as anorexia patients, body builders, 14K registrations from the CAE-

SAR [50] and SizeUSA [51] datasets and 7K feet registrations from the ANSUR

II dataset [52]. All subjects gave informed written consent for participation and

the use of their data in statistical models. Capture protocols were reviewed by

the local university ethics board.

We quantitatively compare SUPR and the individual body-part models to

existing models including SMPL-X, GHUM, MANO, and FLAME. We ﬁnd that

SUPR is more expressive, is more accurate, and generalizes better. In summary

our main contributions are: (1) A uniﬁed framework for learning both expressive

body models and a suite of high-ﬁdelity body part models. (2) A novel 3D

articulated foot model that captures compression due to contact. (3) SUPR, a

sparse expressive and compact body model that generalizes better than existing

expressive human body models. (4) An entire suite of body part models for

the head, hand and feet, where the model kinematic tree and pose deformation

are learned instead of being artist deﬁned. (5) The Tensorﬂow and a PyTorch

implementations of all the models are publicly available for research purposes.

2 Related Work

Body Models: SCAPE [2] is the ﬁrst 3D model to factor body shape into

separate pose and a shape spaces. SCAPE is based on triangle deformations

and is not compatible with existing graphics pipelines. In contrast, SMPL [6]

is the ﬁrst learned statistical body model compatible with game engines SMPL

is a vertex-based model with linear blend skinning (LBS) and learned pose and

shape corrective blendshapes. A key drawback of SMPL is that it relates the pose

corrective blendshapes to the elements of the part rotations matrices of all the

model joints in the kinematic tree. Consequently, it learns spurious long-range

correlations in the training data. STAR [7] addresses many of the drawback of

SMPL by using a compact representation of the kinematic tree based on quater-

nions and learning sparse pose corrective blendshapes where each joint strictly

inﬂuences a sparse set of the model vertices. The pose corrective blendshape

formulation in SUPR is based on STAR. Also related to our work, the Stitched

Puppet [53] is a part-based model of the human body. The body is segmented

into 16 independent parts with learned pose and shape corrective blendshapes.

A pairwise stitching function fuses the parts, but leaves visible discontinuities.

While SUPR is also part-based model, we start with a uniﬁed model and learn

its segmentation into parts during training from a federated training dataset.

Expressive Body Models: The most related to SUPR are expressive body

models such as Frank [54], SMPL-X [55], and GHUM & GHUML [49,56]. Frank

[54] merges the body of SMPL [6] with the FaceWarehouse [12] face model and an

SUPR 5

artist-deﬁned hand rig. Due to the fusion of diﬀerent models learned in isolation,

Frank looks unrealistic. SMPL-X [55] learns an expressive body model and fuses

the MANO hand model [21] pose blendshapes and the FLAME head model [13]

expression space. However, since MANO and FLAME are learned in isolation of

the body, they do not capture the full degrees of freedom of the head and hands.

Thus, fusing the parameters results in artifacts at the boundaries. In contrast

to the construction of Frank and SMPL-X, for SUPR, we start with a coherent

full body model, trained on a federated dataset of body, hand, head and feet

scans, then separate the model into individual body parts. Xu et al. [49] propose

GHUM & GHUML, which are trained on a federated dataset of 60Khead, hand

and body scans and use a fully connected neural network architecture to predict

the pose deformation. The GHUM model can not be separated into body parts

as a result of the dense fully connected formulation that relates all the vertices

to all the joints in the model kinematic tree. In contrast, the SUPR factorized

representation of the pose space deformations enables seamless separation of the

body into head/hand and foot models.

Head Models: There are many models of 3D head shape [57,58,59], shape

and expression [10,11,12,14,15,16,17] or shape, pose and expression [13]. We

focus here on models with a full head template, including a neck. The FLAME

head model [13], like SMPL, uses a dense pose corrective blendshape formulation

that relates all vertices to all joints. Xu et al. [49] also propose GHUM-Head,

where the template is based on the GHUM head with a retrained pose depen-

dant corrector network (PSD). Both GHUM-Head and FLAME are trained in

isolation of the body and do not have suﬃcient joints to model the full head

degrees of freedom. In contrast to the previous methods, SUPR-Head is trained

jointly with the body on a federated dataset of head and body meshes, which

is critical to model the head full range of motion. It also has more joints than

GHUM-Head or FLAME, which we show is crucial to model the head full range

of motion.

Hand Models: MANO [21] is widely use and is based on the SMPL formu-

lation where the pose corrective blendshapes deformations are regularised to be

local. The kinematic tree of MANO is based on spherical joints allowing redun-

dant degrees of freedom for the ﬁngers. Xu et al. [49] introduce the GHUM-Hand

model where they separate the hands from the template mesh of GHUM and

train a hand-speciﬁc pose-dependant corrector network (PSD). Both MANO and

GHUM-Hand are trained in isolation of the body and result in implausible de-

formation around the wrist area. SUPR-Hand is trained jointly with the body

and has a wrist joint which is critical to model the hands full range of motion.

Foot Models: Statistical shape models of the feet are less studied than those

of the body, head, and hands. Conard et al. [60] propose a statistical shape model

of the human foot, which is a PCA space learned from static foot scans. However,

the human feet deform with motion and models learned from static scans can

not capture the complexity of 3D foot deformations. To address the limitations

of static scans, Boppana et al. [61] propose the DynaMo system to capture scans

of the feet in motion and learn a PCA-based model from the scans. However,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SUPR:ASparseUnifiedPart-BasedHumanRepresentationAhmedA.A.Osman1,TimoBolkart1,DimitriosTzionas2,andMichaelJ.Black11MaxPlanckInstituteforIntelligentSystems,T¨ubingen,Germany2UniversityofAmsterdam{aosman,tbolkart,black}@tuebingen.mpg.de,d.tzionas@uva.nlAbstract.Statistical3Dshapemodelsofthehead,hands,a...

展开>> 收起<<

SUPR A Sparse Unified Part-Based Human Representation Ahmed A. A. Osman1 Timo Bolkart1 Dimitrios Tzionas2 and.pdf

共43页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SUPR A Sparse Unified Part-Based Human Representation Ahmed A. A. Osman1 Timo Bolkart1 Dimitrios Tzionas2 and

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: