Decanus to Legatus Synthetic training for 2D-3D human pose lifting Yue Zhu10000000209144815and David Picard10000000262964222

2025-05-06 0 0 4.5MB 20 页 10玖币

侵权投诉

Decanus to Legatus: Synthetic training for

2D-3D human pose lifting?

Yue Zhu1[0000−0002−0914−4815] and David Picard1[0000−0002−6296−4222]

LIGM, Ecole des Ponts, Univ Gustave Eiﬀel, CNRS, Marne-la-Vall´ee, France

{yue.zhu, david.picard}@enpc.fr

https://github.com/Zhuyue0324/Decanus-to-Legatus

Abstract. 3D human pose estimation is a challenging task because of

the diﬃculty to acquire ground-truth data outside of controlled envi-

ronments. A number of further issues have been hindering progress in

building a universal and robust model for this task, including domain

gaps between diﬀerent datasets, unseen actions between train and test

datasets, various hardware settings and high cost of annotation, etc. In

this paper, we propose an algorithm to generate inﬁnite 3D synthetic

human poses (Legatus) from a 3D pose distribution based on 10 initial

handcrafted 3D poses (Decanus) during the training of a 2D to 3D hu-

man pose lifter neural network. Our results show that we can achieve

3D pose estimation performance comparable to methods using real data

from specialized datasets but in a zero-shot setup, showing the general-

ization potential of our framework.

Keywords: 3D Human pose ·Synthetic training ·Zero-shot

1 Introduction

3D Human pose estimation from single images [1] is a challenging and yet very

important topic in computer vision because of its numerous applications from

pedestrian movement prediction to sports analysis. Given an RGB image, the

system predicts the 3D positions of the key body joints of human(s) in the

image. Recent works on deep learning methods have shown very promising results

on this topic [6,21,26,48–50]. Current existing discriminative 3D human pose

estimation methods, in which the neural network directly outputs the positions,

can be put into two categories: One stage methods which directly estimate the 3D

poses inside the world or camera space [29,34], or two stage methods which ﬁrst

estimate 2D human poses in the camera space, then lift 2D estimated skeletons

to 3D [18].

However, all these approaches require massive amount of supervision data to

train the neural network. Contrarily to 2D annotations, obtaining the 3D anno-

tations for training and evaluating these methods is usually limited to controlled

?This work was granted access to the HPC resources of IDRIS under the allocation

2021-AD011012640 made by GENCI, and was supported and funded by Ergonova.

arXiv:2210.02231v1 [cs.CV] 5 Oct 2022

2 Y. Zhu, D. Picard

Fig. 1. The main idea of our synthetic generation method: use a hierarchic probabilistic

tree and its per joint distribution to generate realistic synthetic 3D human poses.

environments for technical reasons (Motion capture systems, camera calibration,

etc). This brings a weakness in generalization to in-the-wild images, where there

can be more unseen scenarios with diﬀerent kinds of human appearances, back-

grounds and camera parameters.

In comparison, obtaining 2D annotations is much easier, and there are much

more diverse existing 2D datasets in the wild [3,22,51]. This makes 2D to 3D pose

lifting very appealing since they can beneﬁt from the more diverse 2D data at

least for their 2D detection part. Since the lifting part does not require the input

image but only the 2D keypoints, we infer that it can be trained without any

real ground-truth 3D information. Training 3D lifting without using explicit 3D

ground-truth has previously been realized by using multiple views and cross-view

consistency to ensure correct 3D reconstructions [45]. However, multiple views

can be cumbersome to acquire and are also limited to controlled environments.

In order to tackle this problem, we propose an algorithm which generates inﬁ-

nite synthetic 3D human skeletons on the ﬂy during the training of the lifter from

just a few initial handcrafted poses. This generator provides enough data to train

a lifter to invert 2D projections of these generated skeletons back to 3D, and can

also be used to generate multiple views for cross-view consistency. We introduce

a Markov chain with a tree structure (Markov tree) type of model, following a

hierarchical parent-child joint order which allows us to generate skeletons with

a distribution that we evolve through time so as to increase the complexity of

the generated poses (see Figure 1). We evaluate our approach on the two bench-

mark datasets Human3.6M and MPI-INF-3DHP and achieve zero-shot results

that are competitive with that of weakly supervised methods. To summarize,

our contributions are:

–A 3D human pose generation algorithm following a probabilistic hierarchical

architecture and a set of distributions, which uses zero real 3D pose data.

–A Markov tree model of distributions that evolve through time, allowing

generation of unseen human poses.

–A semi-automatic way to handcraft few 3D poses to seed initial distribution.

–Zero-shot results that are competitive with methods using real data.

Synthetic training for 2D-3D human pose lifting 3

2 Related work

Monocular 3D human pose estimation. In recent years, monocular 3D human

pose estimation has been widely explored in the community. The models can

be mainly categorized into generative models [2,4,7,24,33,39,47] which ﬁt 3D

parametric models to the image, and discriminative models which directly learn

3D positions from image [1,38]. Generative models try to ﬁt the shape of the en-

tire body and as such are great for augmented reality or animation purpose [35].

However, they tend to be less precise than discriminative models. On the other

hand, a diﬃculty that the discriminative models have is that depth information

is hard to infer from a single image when it is not explicitly modeled, and thus

additional bias must be learned using 3D supervision [25,26], multiview spatial

consistency [13,45,48] or temporal consistency [1,9,23]. Discriminative models

can also be categorized into one stage models which predict directly 3D poses

from images [14,25,29,34] and two stage methods which ﬁrst learn a 2D pose

estimator, then lift the obtained 2D poses to 3D [18,28,45,48,49,52]. Lifting

2D pose to 3D is somewhat of an ill-posed problem because of depth ambiguity

ambiguity. But the larger quantity and diversity of 2D datasets [3,22,51], as well

as the already achieved much better performance in 2D human pose estimation

provide a strong argument for focusing on lifting 2D human poses to 3D.

Weak supervision methods. Since obtaining precise 3D annotations of human

poses are hard due to technical reasons and are mostly limited to controlled

environments, many research proposals tackled this problem by designing weak

supervision methods to avoid using 3D annotations. For example, Iqbal et al. [18]

apply a rigid-aligned multiview consistency 3D loss between multiple 3D poses

estimated from diﬀerent 2D views of the same 3D sample. Mitra et al. [30] learn

3D pose in a canonical form and ensure same predicted poses from diﬀerent

views. Fang et al. [13] propose a virtual mirror so that the estimated 3D poses,

after being symmetrically projected into the other side of the mirror, should also

look correctly, thus simulating another way of ‘multiview’ consistency. Finally,

Wandt et al. [45] learn lifted 3D poses in a canonical form as well as a camera

position so that every 3D pose lifted from a diﬀerent view of a same 3D sample

should still have 2D reprojection consistencies. For us, in addition to 3D supervi-

sion obtained from our synthetical generation, we also use multiview consistency

to improve our training performance.

Synthetic human pose training. Since the early days of the Kinect, synthetic

training has been a popular option for estimating 3D human body pose [40]. The

most common strategy is to perform data augmentation in order to increase the

size and diversity of real datasets [16]. Others like Sminchisescu et al. [43] render

synthetically generated poses on natural indoor and outdoor image backgrounds.

Okada et al. [32] generate synthetic human poses in a subspace constructed

by PCA using the walking sequences extracted from the CMU Mocap dataset

[19]. Du et al. [12] create a synthetic height-map dataset to train a dual-stream

convolutional network for 2D joints localization. Ghezelghieh et al. [15] utilize

4 Y. Zhu, D. Picard

3D graphic software and the CMU Mocap dataset to synthesize humans with

diﬀerent 3D poses and viewpoints. Pumarola et al. [36] created 3DPeople, a

large-scale synthetic dataset of photo-realistic images with a large variety of

subjects, activities and human outﬁts. Both [11] and [25] use pressure maps as

input to estimate 3D human pose with synthetic data. In this paper, we are only

interested in generating realistic 3D poses as a set of keypoints so as to train

a 2D to 3D lifting neural network. As such, we do not need to render visually

realistic humans with meshes, textures and colors for this much simpler task.

Human pose prior. Since the human body is highly constrained, it can be lever-

aged as an inductive bias in pose estimation. Bregleret al. [8] use kinematic-chain

human pose model that follow the skeletal structure, extended by Sigal et al. [42]

with interpenetration constraints. Chowet al. [10] introduced Chow-Liu tree, the

maximum spanning tree of all-pairwise-mutual-information tree to model pairs of

joints that exhibit a high ﬂow of information. Lehrmannet al. [20] use a Chow-Liu

tree that maximize an entropy function depending on nearest neighbor distances

and learn local conditional distributions from data based on this tree structure.

Sidenblahnet al. [41] use cylinders and spheres to model human body. Akhter et

al. [2] learn joint-angle limits prior under local coordinate systems of 3 human

body parts as torso, head,and upper-legs. We use a variant of kinematic model

because the 3D limb lengths are ﬁxed no matter the view, which can facilitate

the generation process of synthetic skeleton.

Cross dataset generalization. Due to the diversity of human appearances and

view points, cross-dataset generalization has recently been the center of attention

of several works. Wang et al. [46] learn to predict camera views so as to auto-

adjust to diﬀerent datasets. Li et al. [21] and Gong et al. [16] perform data

augmentation to cover the possible unseen poses in test dataset. Rapczy´nski et

al. [37] discuss several methods including normalisation, viewpoint estimation,

etc., for improving cross-dataset generalization. In our method, since we use

purely synthetic data, we are always in a cross-dataset generalization setup.

3 Proposed method

The goal of our method is to create a simple synthetic human pose generation

model allowing us to train on pure synthetic data without any real 3D human

pose data information during the whole training procedure.

3.1 Synthetic human pose generation model

Local spherical coordinate system. Without loss of generalization, we use

Human3.6M skeleton layout shown in Figure 2 (a) throughout the paper. To

simplify human pose generation, we set the pelvis joint (joint 0) as root joint and

the origin of the global Cartesian coordinate system from which a tree structure

is applied to generate joints one by one. We suppose that the position of one joint

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DecanustoLegatus:Synthetictrainingfor2D-3Dhumanposelifting?YueZhu1[0000000209144815]andDavidPicard1[0000000262964222]LIGM,EcoledesPonts,UnivGustaveEiel,CNRS,Marne-la-Vallee,Francefyue.zhu,david.picardg@enpc.frhttps://github.com/Zhuyue0324/Decanus-to-LegatusAbstract.3Dhumanposeestimationisachalleng...

展开>> 收起<<

Decanus to Legatus Synthetic training for 2D-3D human pose lifting Yue Zhu10000000209144815and David Picard10000000262964222.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Decanus to Legatus Synthetic training for 2D-3D human pose lifting Yue Zhu10000000209144815and David Picard10000000262964222

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: