SUPR 5
artist-defined hand rig. Due to the fusion of different models learned in isolation,
Frank looks unrealistic. SMPL-X [55] learns an expressive body model and fuses
the MANO hand model [21] pose blendshapes and the FLAME head model [13]
expression space. However, since MANO and FLAME are learned in isolation of
the body, they do not capture the full degrees of freedom of the head and hands.
Thus, fusing the parameters results in artifacts at the boundaries. In contrast
to the construction of Frank and SMPL-X, for SUPR, we start with a coherent
full body model, trained on a federated dataset of body, hand, head and feet
scans, then separate the model into individual body parts. Xu et al. [49] propose
GHUM & GHUML, which are trained on a federated dataset of 60Khead, hand
and body scans and use a fully connected neural network architecture to predict
the pose deformation. The GHUM model can not be separated into body parts
as a result of the dense fully connected formulation that relates all the vertices
to all the joints in the model kinematic tree. In contrast, the SUPR factorized
representation of the pose space deformations enables seamless separation of the
body into head/hand and foot models.
Head Models: There are many models of 3D head shape [57,58,59], shape
and expression [10,11,12,14,15,16,17] or shape, pose and expression [13]. We
focus here on models with a full head template, including a neck. The FLAME
head model [13], like SMPL, uses a dense pose corrective blendshape formulation
that relates all vertices to all joints. Xu et al. [49] also propose GHUM-Head,
where the template is based on the GHUM head with a retrained pose depen-
dant corrector network (PSD). Both GHUM-Head and FLAME are trained in
isolation of the body and do not have sufficient joints to model the full head
degrees of freedom. In contrast to the previous methods, SUPR-Head is trained
jointly with the body on a federated dataset of head and body meshes, which
is critical to model the head full range of motion. It also has more joints than
GHUM-Head or FLAME, which we show is crucial to model the head full range
of motion.
Hand Models: MANO [21] is widely use and is based on the SMPL formu-
lation where the pose corrective blendshapes deformations are regularised to be
local. The kinematic tree of MANO is based on spherical joints allowing redun-
dant degrees of freedom for the fingers. Xu et al. [49] introduce the GHUM-Hand
model where they separate the hands from the template mesh of GHUM and
train a hand-specific pose-dependant corrector network (PSD). Both MANO and
GHUM-Hand are trained in isolation of the body and result in implausible de-
formation around the wrist area. SUPR-Hand is trained jointly with the body
and has a wrist joint which is critical to model the hands full range of motion.
Foot Models: Statistical shape models of the feet are less studied than those
of the body, head, and hands. Conard et al. [60] propose a statistical shape model
of the human foot, which is a PCA space learned from static foot scans. However,
the human feet deform with motion and models learned from static scans can
not capture the complexity of 3D foot deformations. To address the limitations
of static scans, Boppana et al. [61] propose the DynaMo system to capture scans
of the feet in motion and learn a PCA-based model from the scans. However,