tributions and stochastic perturbations of body shapes from
CAESAR. In contrast to both these methods, our approach
seeks adversarial samples in the low performance regime
of BMnet, enabling automatic discovery and mitigation of
weaknesses in dataset and network in a principled manner.
Synthesis for training: With advances in simulation
quality and realism, it has become increasingly common
to train deep neural networks using synthetic data [21,24,
40,59,63]. Recently, there have been attempts at learn-
ing to adapt distributions of generated synthetic data to im-
prove model training [2,6–8,25,46,66,74,91]. These
approaches focus on approximating a distribution that is
either similar to the natural test distribution or that mini-
mizes prediction error. Another flavor of approaches probes
the weaknesses of machine learning models using synthetic
data [28,37,38,48,56,67,77]. The works of [1,76,95]
generate robust synthetic training data for object recogni-
tion and visual-question-answering by varying scene pa-
rameters such as pose and lighting, while preserving object
characteristics. Shen et al. [75] tackle vehicle self-driving
by introducing adversarial camera corruptions in training.
In our work, we explore the impact of varying interpretable
parameters that directly control human body shape.
Adversarial techniques: We take inspiration from the
literature on adversarial attacks of neural networks [12,26,
52,81] and draw from ideas for improving network robust-
ness by training on images that have undergone white-box
adversarial attacks [47]. The main difference lies in the
search space: previous works search the image space while
we search the interpretable latent shape space of the body
model. The works by [58,65] find synthetic adversarial
samples for faces using either a GAN or a face simula-
tor. They are successful in finding interpretable attributes
leading to false predictions; however, they do not incorpo-
rate this knowledge in training to improve predictions on
real examples. In our work, we both discover adversarial
samples and use them in training to improve body measure-
ment estimation. Different from previous methods, we find
adversarial bodies by searching the latent space of a body
simulator comprising a pipeline of differentiable submod-
ules, namely: a 3D body shape model, body measurement
estimation network, height and weight regressors, and a ren-
derer based on a soft rasterizer [43].
Datasets: Widely used human body datasets such as
CAESAR [60] contain high volumes of 3D scans and body
measurements; however these do not come with real im-
ages, which must therefore be simulated from the scans
with a virtual camera. Recently Yan et al. [90] published
the BodyFit dataset comprising over 4K body scans from
which body measurements are computed, and silhouettes
are simulated. They also present a small collection of pho-
tographs and tape measurements of 194 subjects. To resolve
scale, they assume a fixed camera distance. Our BodyM is
the first large-scale dataset comprising body measurements
paired with silhouettes obtained by applying semantic seg-
mentation on real photographs. To resolve scale, we store
height and weight (easy to acquire) rather than assume fixed
camera distance (hard to enforce in practice).
3. Method
We use the SMPL model [45] as our basis for adversar-
ial body simulation. SMPL characterizes the human form
in terms of a finite number of shape parameters βand pose
parameters θ. Shape is modeled as a linear weighted com-
bination of basis shapes (with weights β) derived from the
CAESAR dataset, while pose is modeled as local 3D rota-
tion angles θon 24 skeleton joints. SMPL learns a regressor
M(β, θ)for generating an articulated body mesh of 6890
vertices from specified shape and pose using blend shapes.
3.1. Body Measurement Estimation Network
BMnet takes as input either single or multi-view silhou-
ette masks. For single-view, only a frontal segmentation
mask is used. For multi-view, the model also leverages the
lateral silhouette which provides crucial cues for accurate
measurement in the chest and waist areas. Additionally, we
use height and weight as input metadata. Height removes
the ambiguity in scale when predicting measurements from
subjects with variable distance to the camera, while weight
provides important cues for body size and shape. Our multi-
view measurement estimation network can be written as:
y=fψ(xf, xl, ξ, ω),(1)
where xfand xlare respectively the frontal and lateral sil-
houettes, (ξ, ω)are the height and the weight of the subject,
and ψrepresents network weights.
The network architecture comprises a MNASNet back-
bone [82] with a depth multiplier of 1 to extract features
from the silhouettes. Each silhouette is of size 640 ×480
and the two views are concatenated spatially to form a
640 ×960 image. Constant-valued images of the same size
representing height and weight are then concatenated depth-
wise to the silhouettes to produce an input tensor of dimen-
sion 3×640 ×960 for the network. The resulting feature
maps from MNASNet are fed into an MLP comprising a
hidden layer of 128 neurons and 14 outputs corresponding
to body measurements. Unlike previous approaches that at-
tempt the highly ambiguous problem of predicting a high-
dimensional body mesh and then subsequently computing
the measurements from the mesh [19], we directly regress
measurements, thus requiring a simpler architecture and ob-
viating the need for storing 3D body mesh ground truth.
3.2. Adversarial Body Simulator
We present an adversarial body simulator (ABS) that
searches the latent shape space of the SMPL model in order