SUPR A Sparse Unified Part-Based Human Representation Ahmed A. A. Osman1 Timo Bolkart1 Dimitrios Tzionas2 and

2025-05-02 0 0 9.65MB 43 页 10玖币
侵权投诉
SUPR: A Sparse Unified Part-Based Human
Representation
Ahmed A. A. Osman1, Timo Bolkart1, Dimitrios Tzionas2, and
Michael J. Black1
1Max Planck Institute for Intelligent Systems, T¨ubingen, Germany
2University of Amsterdam
{aosman,tbolkart,black}@tuebingen.mpg.de,d.tzionas@uva.nl
Abstract. Statistical 3D shape models of the head, hands, and full body
are widely used in computer vision and graphics. Despite their wide use,
we show that existing models of the head and hands fail to capture the
full range of motion for these parts. Moreover, existing work largely ig-
nores the feet, which are crucial for modeling human movement and have
applications in biomechanics, animation, and the footwear industry. The
problem is that previous body part models are trained using 3D scans
that are isolated to the individual parts. Such data does not capture the
full range of motion for such parts, e.g. the motion of head relative to
the neck. Our observation is that full-body scans provide important in-
formation about the motion of the body parts. Consequently, we propose
a new learning scheme that jointly trains a full-body model and specific
part models using a federated dataset of full-body and body-part scans.
Specifically, we train an expressive human body model called SUPR
(Sparse Unified Part-Based Representation), where each joint strictly
influences a sparse set of model vertices. The factorized representation
enables separating SUPR into an entire suite of body part models: an
expressive head (SUPR-Head), an articulated hand (SUPR-Hand), and
a novel foot (SUPR-Foot). Note that feet have received little attention
and existing 3D body models have highly under-actuated feet. Using
novel 4D scans of feet, we train a model with an extended kinematic
tree that captures the range of motion of the toes. Additionally, feet de-
form due to ground contact. To model this, we include a novel non-linear
deformation function that predicts foot deformation conditioned on the
foot pose, shape, and ground contact. We train SUPR on an unprece-
dented number of scans: 1.2 million body, head, hand and foot scans.
We quantitatively compare SUPR and the separate body parts to ex-
isting expressive human body models and body-part models and find
that our suite of models generalizes better and captures the body parts’
full range of motion. SUPR is publicly available for research purposes at
http://supr.is.tue.mpg.de
1 Introduction
Generative 3D models of the human body and its parts play an important role
in understanding human behaviour. Over the past two decades, numerous 3D
arXiv:2210.13861v1 [cs.CV] 25 Oct 2022
2 Osman et al.
Fig. 1: Expressive part-based human body model. SUPR is a factorized
representation of the human body that can be separated into a full suite of body
part models.
models of the body [1,2,3,4,5,6,7,8,9], face [10,11,12,13,14,15,16,17] and
hands [18,19,20,21,22,23] have been proposed. Such models enabled a myriad
of applications ranging from reconstructing bodies [24,25,26], faces [27,28,29],
and hands [30,31] from images and videos, modeling human interactions [32],
generating 3D clothed humans [33,34,35,36,37,38,39], or generating humans
in scenes [40,41,42]. They are also used as priors for fitting models to a wide
range of sensory input measurements like motion capture markers [43,44] or
IMUs [45,46,47].
Hand [21,48,22,49], head [12,13,49] and body [6,7] models are typically
built independently. Heads and hands are captured with a 3D scanner in which a
subject remains static, while the face and hands are articulated. This data is un-
natural as it does not capture how the body parts move together with the body.
As a consequence, the construction of head/hand models implicitly assumes a
static body, and use a simple kinematic tree that fails to model the head/hand
full degrees of freedom. For example, in Fig. 2a we fit the FLAME head model
[13] to a pose where the subject is looking right and find that FLAME exhibits a
significant error in the neck region. Similarly, we fit the MANO [21] hand model
to a hand pose where the the wrist is fully bent downwards. MANO fails to cap-
ture the wrist deformation that results from the bent wrist. This is a systematic
limitation of existing head/hand models, which can not be addressed by simply
training on more data.
Another significant limitation of existing body-part models is the lack of
an articulated foot model in the literature. This is surprising given the many
applications of a 3D foot model in the design, sale, and animation of footwear.
Feet are also critical for human locomotion. Any biomechanical or physics-based
model must have realistic feet to be faithful. The feet on existing full body
models like SMPL are overly simplistic, have limited articulation, and do not
deform with contact as shown in Fig. 2b.
SUPR 3
Registration Model Fit Error Heatmap
FLAME
MANO
1 cm
0
(a) Body part models boundary error.
Registration
SMPL
(b) SMPL ground penetration.
Fig. 2: Body part models failure cases. Left: Existing body part models such
as the FLAME [13] head model and the MANO [21] hand model fail to capture
the corresponding body part’s shape through the full range of motion. Fitting
FLAME to a subject looking left results in significant error in the neck region.
Similarly, fitting MANO to hands with a bent wrist, results in significant error
at the wrist region. Right: The foot of SMPL [6] fails to model deformations due
to ground contact, hence penetrating the ground. Additionally, it has a limited
number of joints to model the toes articulation.
In contrast to the existing approaches, we propose to jointly train the full
human body and body part models together. We first train a new full-body model
called SUPR, with articulated hands and an expressive head using a federated
dataset of body, hand, head and foot scans. This joint learning captures the
full range of motion of the body parts along with the associated deformation.
Then, given the learned deformations, we separate the body model into body
part models. To enable separating SUPR into compact individual body parts we
learn a sparse factorization of the pose-corrective blend shape function as shown
in the teaser Fig. 1. The factored representation of SUPR enables separating
SUPR into an entire suite of models: SUPR-Head, SUPR-Hand and SUPR-Foot.
A body part model is separated by considering all the joints that influence the set
of vertices defined by the body part template mesh. We show that the learned
kinematic tree structure for the head/hand contains significantly more joints
than commonly used by head/hand models. In contrast to the existing body
part models that are learned in isolation of the body, our training algorithm
unifies many disparate prior efforts and results in a suite of models that can
capture the full range of motion of the head, hands, and feet.
SUPR goes beyond existing statistical body models to include a novel foot
model. To do so, we extend the standard kinematic tree for the foot to allow more
degrees of freedom. To train the model, we capture foot scans using a custom 4D
foot scanner (see Sup. Mat.), where the foot is visible from all views, including
the sole of the foot which is imaged through a glass plate. This uniquely allows
4 Osman et al.
us to capture how the foot is deformed by contact with the ground. We then
model this deformation as a function of body pose and contact.
We train SUPR on 1.2 million hand, head, foot, and body scans, which is
an order of magnitude more data than the largest training dataset reported
in the literature (60K GHUM [49]). The training data contains extreme body
shapes such as anorexia patients, body builders, 14K registrations from the CAE-
SAR [50] and SizeUSA [51] datasets and 7K feet registrations from the ANSUR
II dataset [52]. All subjects gave informed written consent for participation and
the use of their data in statistical models. Capture protocols were reviewed by
the local university ethics board.
We quantitatively compare SUPR and the individual body-part models to
existing models including SMPL-X, GHUM, MANO, and FLAME. We find that
SUPR is more expressive, is more accurate, and generalizes better. In summary
our main contributions are: (1) A unified framework for learning both expressive
body models and a suite of high-fidelity body part models. (2) A novel 3D
articulated foot model that captures compression due to contact. (3) SUPR, a
sparse expressive and compact body model that generalizes better than existing
expressive human body models. (4) An entire suite of body part models for
the head, hand and feet, where the model kinematic tree and pose deformation
are learned instead of being artist defined. (5) The Tensorflow and a PyTorch
implementations of all the models are publicly available for research purposes.
2 Related Work
Body Models: SCAPE [2] is the first 3D model to factor body shape into
separate pose and a shape spaces. SCAPE is based on triangle deformations
and is not compatible with existing graphics pipelines. In contrast, SMPL [6]
is the first learned statistical body model compatible with game engines SMPL
is a vertex-based model with linear blend skinning (LBS) and learned pose and
shape corrective blendshapes. A key drawback of SMPL is that it relates the pose
corrective blendshapes to the elements of the part rotations matrices of all the
model joints in the kinematic tree. Consequently, it learns spurious long-range
correlations in the training data. STAR [7] addresses many of the drawback of
SMPL by using a compact representation of the kinematic tree based on quater-
nions and learning sparse pose corrective blendshapes where each joint strictly
influences a sparse set of the model vertices. The pose corrective blendshape
formulation in SUPR is based on STAR. Also related to our work, the Stitched
Puppet [53] is a part-based model of the human body. The body is segmented
into 16 independent parts with learned pose and shape corrective blendshapes.
A pairwise stitching function fuses the parts, but leaves visible discontinuities.
While SUPR is also part-based model, we start with a unified model and learn
its segmentation into parts during training from a federated training dataset.
Expressive Body Models: The most related to SUPR are expressive body
models such as Frank [54], SMPL-X [55], and GHUM & GHUML [49,56]. Frank
[54] merges the body of SMPL [6] with the FaceWarehouse [12] face model and an
SUPR 5
artist-defined hand rig. Due to the fusion of different models learned in isolation,
Frank looks unrealistic. SMPL-X [55] learns an expressive body model and fuses
the MANO hand model [21] pose blendshapes and the FLAME head model [13]
expression space. However, since MANO and FLAME are learned in isolation of
the body, they do not capture the full degrees of freedom of the head and hands.
Thus, fusing the parameters results in artifacts at the boundaries. In contrast
to the construction of Frank and SMPL-X, for SUPR, we start with a coherent
full body model, trained on a federated dataset of body, hand, head and feet
scans, then separate the model into individual body parts. Xu et al. [49] propose
GHUM & GHUML, which are trained on a federated dataset of 60Khead, hand
and body scans and use a fully connected neural network architecture to predict
the pose deformation. The GHUM model can not be separated into body parts
as a result of the dense fully connected formulation that relates all the vertices
to all the joints in the model kinematic tree. In contrast, the SUPR factorized
representation of the pose space deformations enables seamless separation of the
body into head/hand and foot models.
Head Models: There are many models of 3D head shape [57,58,59], shape
and expression [10,11,12,14,15,16,17] or shape, pose and expression [13]. We
focus here on models with a full head template, including a neck. The FLAME
head model [13], like SMPL, uses a dense pose corrective blendshape formulation
that relates all vertices to all joints. Xu et al. [49] also propose GHUM-Head,
where the template is based on the GHUM head with a retrained pose depen-
dant corrector network (PSD). Both GHUM-Head and FLAME are trained in
isolation of the body and do not have sufficient joints to model the full head
degrees of freedom. In contrast to the previous methods, SUPR-Head is trained
jointly with the body on a federated dataset of head and body meshes, which
is critical to model the head full range of motion. It also has more joints than
GHUM-Head or FLAME, which we show is crucial to model the head full range
of motion.
Hand Models: MANO [21] is widely use and is based on the SMPL formu-
lation where the pose corrective blendshapes deformations are regularised to be
local. The kinematic tree of MANO is based on spherical joints allowing redun-
dant degrees of freedom for the fingers. Xu et al. [49] introduce the GHUM-Hand
model where they separate the hands from the template mesh of GHUM and
train a hand-specific pose-dependant corrector network (PSD). Both MANO and
GHUM-Hand are trained in isolation of the body and result in implausible de-
formation around the wrist area. SUPR-Hand is trained jointly with the body
and has a wrist joint which is critical to model the hands full range of motion.
Foot Models: Statistical shape models of the feet are less studied than those
of the body, head, and hands. Conard et al. [60] propose a statistical shape model
of the human foot, which is a PCA space learned from static foot scans. However,
the human feet deform with motion and models learned from static scans can
not capture the complexity of 3D foot deformations. To address the limitations
of static scans, Boppana et al. [61] propose the DynaMo system to capture scans
of the feet in motion and learn a PCA-based model from the scans. However,
摘要:

SUPR:ASparseUnifiedPart-BasedHumanRepresentationAhmedA.A.Osman1,TimoBolkart1,DimitriosTzionas2,andMichaelJ.Black11MaxPlanckInstituteforIntelligentSystems,T¨ubingen,Germany2UniversityofAmsterdam{aosman,tbolkart,black}@tuebingen.mpg.de,d.tzionas@uva.nlAbstract.Statistical3Dshapemodelsofthehead,hands,a...

展开>> 收起<<
SUPR A Sparse Unified Part-Based Human Representation Ahmed A. A. Osman1 Timo Bolkart1 Dimitrios Tzionas2 and.pdf

共43页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:43 页 大小:9.65MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 43
客服
关注