Decanus to Legatus Synthetic training for 2D-3D human pose lifting Yue Zhu10000000209144815and David Picard10000000262964222

2025-05-06 0 0 4.5MB 20 页 10玖币
侵权投诉
Decanus to Legatus: Synthetic training for
2D-3D human pose lifting?
Yue Zhu1[0000000209144815] and David Picard1[0000000262964222]
LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vall´ee, France
{yue.zhu, david.picard}@enpc.fr
https://github.com/Zhuyue0324/Decanus-to-Legatus
Abstract. 3D human pose estimation is a challenging task because of
the difficulty to acquire ground-truth data outside of controlled envi-
ronments. A number of further issues have been hindering progress in
building a universal and robust model for this task, including domain
gaps between different datasets, unseen actions between train and test
datasets, various hardware settings and high cost of annotation, etc. In
this paper, we propose an algorithm to generate infinite 3D synthetic
human poses (Legatus) from a 3D pose distribution based on 10 initial
handcrafted 3D poses (Decanus) during the training of a 2D to 3D hu-
man pose lifter neural network. Our results show that we can achieve
3D pose estimation performance comparable to methods using real data
from specialized datasets but in a zero-shot setup, showing the general-
ization potential of our framework.
Keywords: 3D Human pose ·Synthetic training ·Zero-shot
1 Introduction
3D Human pose estimation from single images [1] is a challenging and yet very
important topic in computer vision because of its numerous applications from
pedestrian movement prediction to sports analysis. Given an RGB image, the
system predicts the 3D positions of the key body joints of human(s) in the
image. Recent works on deep learning methods have shown very promising results
on this topic [6,21,26,4850]. Current existing discriminative 3D human pose
estimation methods, in which the neural network directly outputs the positions,
can be put into two categories: One stage methods which directly estimate the 3D
poses inside the world or camera space [29,34], or two stage methods which first
estimate 2D human poses in the camera space, then lift 2D estimated skeletons
to 3D [18].
However, all these approaches require massive amount of supervision data to
train the neural network. Contrarily to 2D annotations, obtaining the 3D anno-
tations for training and evaluating these methods is usually limited to controlled
?This work was granted access to the HPC resources of IDRIS under the allocation
2021-AD011012640 made by GENCI, and was supported and funded by Ergonova.
arXiv:2210.02231v1 [cs.CV] 5 Oct 2022
2 Y. Zhu, D. Picard
Fig. 1. The main idea of our synthetic generation method: use a hierarchic probabilistic
tree and its per joint distribution to generate realistic synthetic 3D human poses.
environments for technical reasons (Motion capture systems, camera calibration,
etc). This brings a weakness in generalization to in-the-wild images, where there
can be more unseen scenarios with different kinds of human appearances, back-
grounds and camera parameters.
In comparison, obtaining 2D annotations is much easier, and there are much
more diverse existing 2D datasets in the wild [3,22,51]. This makes 2D to 3D pose
lifting very appealing since they can benefit from the more diverse 2D data at
least for their 2D detection part. Since the lifting part does not require the input
image but only the 2D keypoints, we infer that it can be trained without any
real ground-truth 3D information. Training 3D lifting without using explicit 3D
ground-truth has previously been realized by using multiple views and cross-view
consistency to ensure correct 3D reconstructions [45]. However, multiple views
can be cumbersome to acquire and are also limited to controlled environments.
In order to tackle this problem, we propose an algorithm which generates infi-
nite synthetic 3D human skeletons on the fly during the training of the lifter from
just a few initial handcrafted poses. This generator provides enough data to train
a lifter to invert 2D projections of these generated skeletons back to 3D, and can
also be used to generate multiple views for cross-view consistency. We introduce
a Markov chain with a tree structure (Markov tree) type of model, following a
hierarchical parent-child joint order which allows us to generate skeletons with
a distribution that we evolve through time so as to increase the complexity of
the generated poses (see Figure 1). We evaluate our approach on the two bench-
mark datasets Human3.6M and MPI-INF-3DHP and achieve zero-shot results
that are competitive with that of weakly supervised methods. To summarize,
our contributions are:
A 3D human pose generation algorithm following a probabilistic hierarchical
architecture and a set of distributions, which uses zero real 3D pose data.
A Markov tree model of distributions that evolve through time, allowing
generation of unseen human poses.
A semi-automatic way to handcraft few 3D poses to seed initial distribution.
Zero-shot results that are competitive with methods using real data.
Synthetic training for 2D-3D human pose lifting 3
2 Related work
Monocular 3D human pose estimation. In recent years, monocular 3D human
pose estimation has been widely explored in the community. The models can
be mainly categorized into generative models [2,4,7,24,33,39,47] which fit 3D
parametric models to the image, and discriminative models which directly learn
3D positions from image [1,38]. Generative models try to fit the shape of the en-
tire body and as such are great for augmented reality or animation purpose [35].
However, they tend to be less precise than discriminative models. On the other
hand, a difficulty that the discriminative models have is that depth information
is hard to infer from a single image when it is not explicitly modeled, and thus
additional bias must be learned using 3D supervision [25,26], multiview spatial
consistency [13,45,48] or temporal consistency [1,9,23]. Discriminative models
can also be categorized into one stage models which predict directly 3D poses
from images [14,25,29,34] and two stage methods which first learn a 2D pose
estimator, then lift the obtained 2D poses to 3D [18,28,45,48,49,52]. Lifting
2D pose to 3D is somewhat of an ill-posed problem because of depth ambiguity
ambiguity. But the larger quantity and diversity of 2D datasets [3,22,51], as well
as the already achieved much better performance in 2D human pose estimation
provide a strong argument for focusing on lifting 2D human poses to 3D.
Weak supervision methods. Since obtaining precise 3D annotations of human
poses are hard due to technical reasons and are mostly limited to controlled
environments, many research proposals tackled this problem by designing weak
supervision methods to avoid using 3D annotations. For example, Iqbal et al. [18]
apply a rigid-aligned multiview consistency 3D loss between multiple 3D poses
estimated from different 2D views of the same 3D sample. Mitra et al. [30] learn
3D pose in a canonical form and ensure same predicted poses from different
views. Fang et al. [13] propose a virtual mirror so that the estimated 3D poses,
after being symmetrically projected into the other side of the mirror, should also
look correctly, thus simulating another way of ‘multiview’ consistency. Finally,
Wandt et al. [45] learn lifted 3D poses in a canonical form as well as a camera
position so that every 3D pose lifted from a different view of a same 3D sample
should still have 2D reprojection consistencies. For us, in addition to 3D supervi-
sion obtained from our synthetical generation, we also use multiview consistency
to improve our training performance.
Synthetic human pose training. Since the early days of the Kinect, synthetic
training has been a popular option for estimating 3D human body pose [40]. The
most common strategy is to perform data augmentation in order to increase the
size and diversity of real datasets [16]. Others like Sminchisescu et al. [43] render
synthetically generated poses on natural indoor and outdoor image backgrounds.
Okada et al. [32] generate synthetic human poses in a subspace constructed
by PCA using the walking sequences extracted from the CMU Mocap dataset
[19]. Du et al. [12] create a synthetic height-map dataset to train a dual-stream
convolutional network for 2D joints localization. Ghezelghieh et al. [15] utilize
4 Y. Zhu, D. Picard
3D graphic software and the CMU Mocap dataset to synthesize humans with
different 3D poses and viewpoints. Pumarola et al. [36] created 3DPeople, a
large-scale synthetic dataset of photo-realistic images with a large variety of
subjects, activities and human outfits. Both [11] and [25] use pressure maps as
input to estimate 3D human pose with synthetic data. In this paper, we are only
interested in generating realistic 3D poses as a set of keypoints so as to train
a 2D to 3D lifting neural network. As such, we do not need to render visually
realistic humans with meshes, textures and colors for this much simpler task.
Human pose prior. Since the human body is highly constrained, it can be lever-
aged as an inductive bias in pose estimation. Bregleret al. [8] use kinematic-chain
human pose model that follow the skeletal structure, extended by Sigal et al. [42]
with interpenetration constraints. Chowet al. [10] introduced Chow-Liu tree, the
maximum spanning tree of all-pairwise-mutual-information tree to model pairs of
joints that exhibit a high flow of information. Lehrmannet al. [20] use a Chow-Liu
tree that maximize an entropy function depending on nearest neighbor distances
and learn local conditional distributions from data based on this tree structure.
Sidenblahnet al. [41] use cylinders and spheres to model human body. Akhter et
al. [2] learn joint-angle limits prior under local coordinate systems of 3 human
body parts as torso, head,and upper-legs. We use a variant of kinematic model
because the 3D limb lengths are fixed no matter the view, which can facilitate
the generation process of synthetic skeleton.
Cross dataset generalization. Due to the diversity of human appearances and
view points, cross-dataset generalization has recently been the center of attention
of several works. Wang et al. [46] learn to predict camera views so as to auto-
adjust to different datasets. Li et al. [21] and Gong et al. [16] perform data
augmentation to cover the possible unseen poses in test dataset. Rapczy´nski et
al. [37] discuss several methods including normalisation, viewpoint estimation,
etc., for improving cross-dataset generalization. In our method, since we use
purely synthetic data, we are always in a cross-dataset generalization setup.
3 Proposed method
The goal of our method is to create a simple synthetic human pose generation
model allowing us to train on pure synthetic data without any real 3D human
pose data information during the whole training procedure.
3.1 Synthetic human pose generation model
Local spherical coordinate system. Without loss of generalization, we use
Human3.6M skeleton layout shown in Figure 2 (a) throughout the paper. To
simplify human pose generation, we set the pelvis joint (joint 0) as root joint and
the origin of the global Cartesian coordinate system from which a tree structure
is applied to generate joints one by one. We suppose that the position of one joint
摘要:

DecanustoLegatus:Synthetictrainingfor2D-3Dhumanposelifting?YueZhu1[0000000209144815]andDavidPicard1[0000000262964222]LIGM,EcoledesPonts,UnivGustaveEi el,CNRS,Marne-la-Vallee,Francefyue.zhu,david.picardg@enpc.frhttps://github.com/Zhuyue0324/Decanus-to-LegatusAbstract.3Dhumanposeestimationisachalleng...

展开>> 收起<<
Decanus to Legatus Synthetic training for 2D-3D human pose lifting Yue Zhu10000000209144815and David Picard10000000262964222.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:4.5MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注