Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement Junuk Cha10000 000323212797 Muhammad Saqlain120000000158776432

2025-05-02 1 0 6.36MB 18 页 10玖币

侵权投诉

Multi-Person 3D Pose and Shape Estimation

via Inverse Kinematics and Reﬁnement

Junuk Cha1[0000−0003−2321−2797], Muhammad Saqlain1,2†[0000−0001−5877−6432],

GeonU Kim1‡, Mingyu Shin1,3‡, and Seungryul Baek1[0000−0002−0856−6880]

1 UNIST, South Korea 2 eSmart Systems, Norway 3 Yeongnam Univ.,

South Korea

Abstract. Estimating 3D poses and shapes in the form of meshes from

monocular RGB images is challenging. Obviously, it is more diﬃcult than

estimating 3D poses only in the form of skeletons or heatmaps. When

interacting persons are involved, the 3D mesh reconstruction becomes

more challenging due to the ambiguity introduced by person-to-person

occlusions. To tackle the challenges, we propose a coarse-to-ﬁne pipeline

that beneﬁts from 1) inverse kinematics from the occlusion-robust 3D

skeleton estimation and 2) Transformer-based relation-aware reﬁnement

techniques. In our pipeline, we ﬁrst obtain occlusion-robust 3D skele-

tons for multiple persons from an RGB image. Then, we apply inverse

kinematics to convert the estimated skeletons to deformable 3D mesh pa-

rameters. Finally, we apply the Transformer-based mesh reﬁnement that

reﬁnes the obtained mesh parameters considering intra- and inter-person

relations of 3D meshes. Via extensive experiments, we demonstrate the

eﬀectiveness of our method, outperforming state-of-the-arts on 3DPW,

MuPoTS and AGORA datasets.

Keywords: Multi-person, 3D mesh reconstruction, Transformer

1 Introduction

Recovering 3D human body meshes for a single person or multi-person from a

monocular RGB image has made great progress in recent years [3, 10, 12, 17, 23,

27, 28, 30–33, 38, 39, 61, 64, 69, 71, 73]. The technique is essential to understand

people’s behaviors, intentions and person-to-person interactions. It has a wide

range of real-world applications such as human motion imitation [41], virtual try

on [47], motion capture [45], action recognition [5, 57, 66], etc.

Recently, deep convolutional neural network-based mesh reconstruction meth-

ods [6,10,12,17,23,27,28,30–33,38,39,61,64, 69,71,73] have shown the practical

performance on in-the-wild scenes [21,25,44,68]. Most of the existing 3D human

body pose and shape estimation approaches [6, 10, 17, 27, 28, 30–33, 38, 39, 69]

This research was conducted when Dr. Saqlain was the post-doctoral researcher at

UNIST†, and when Mr. Kim and Mr. Shin were undergraduate interns at UNIST‡.

arXiv:2210.13529v2 [cs.CV] 30 Oct 2022

2 Cha et al.

(a) (b) (c) (d)

(e)

(f)

Fig. 1: Example outputs from our pipeline: (a) input RGB image, (b) initial

skeleton estimation results obtained from the input image, (c) initial meshes

obtained from the inverse kinematics process, (d) reﬁned meshes obtained from

the reﬁnement Transformer, (e, f) top- and side-views for the reﬁned meshes.

achieved promising results for single-person cases. Generally, ﬁrstly they crop

the area with a person in an input image using bounding-box and then extract

features for each detected person, which are further used for 3D human mesh

regression.

Some of the recent studies [26–28, 30–33, 36, 39, 64, 71] reconstruct each per-

son 3D mesh individually for multi-person 3D mesh reconstruction using the

same bounding-boxes detector [4, 18, 55]. Multiple persons can create severe

person-to-person or person-to-environmental occlusions, erroneous monocular

depth and diverse human body appearance which results in performance ambi-

guity in crowded scenes, while in these methods, proper modules that tackle the

interacting persons have not been established yet. A few recent methods [23,73]

applied direct regression for multiple persons which do not require individual

person detection. Sun et al. [61] used body center heatmaps as the target rep-

resentation to identify mesh parameter map. However, without applying the

human detection, the human pose estimation result is frequently aﬀected by

unimportant pixels and it frequently fails to capture scale variations, which re-

sult in the inferior performance.

In parallel, there have been eﬀorts to reduce the ambiguity of estimating 3D

meshes from an RGB image. However in the aspect of the pose recovery, 3D body

mesh recovery methods [27,30,31,33] still fall behind the 3D skeleton or heatmap

estimation methods [8,9,22,60]. One drawback of 3D skeleton estimation method

is that it cannot reconstruct the full 3D body mesh. Recently, Li et al. [36]

proposed an inverse kinematics method for single-person mesh reconstruction

to recover 3D meshes from 3D skeletons. This approach is promising since it is

able to deliver good poses obtained from 3D skeleton estimator to the 3D mesh

reconstruction pipeline.

Multi-Person 3D Pose and Shape Est. via IK and Reﬁnement 3

To tackle the multi-person 3D body mesh reconstruction task, we propose a

coarse-to-ﬁne pipeline that ﬁrst estimates 3D skeletons, reconstruct 3D meshes

from 3D skeletons via inverse kinematics and reﬁne the initial 3D mesh param-

eters via relation-aware reﬁnement. Inspired by [59], our 3D skeleton estimator

involves metric-scale heatmaps and is trained by both relative and absolute po-

sitional 3D poses to be robust to occlusions. By extending the IK process [36]

towards the multi-person scenario, we are able to obtain the initial 3D meshes

for multiple persons from 3D skeletons; while the accuracy is limited especially

for interacting person cases. To compensate for the limitation, we propose the

relation-aware Transformer to reﬁne the initial mesh parameters considering

intra- and inter-person 3D mesh relationships. The Fig. 1 shows example out-

puts for intermediate steps. To summarize, our contributions are as follows:

–We propose a coarse-to-ﬁne multi-person 3D body mesh reconstruction pipeline

that ﬁrst estimates 3D skeletons and then delivers it toward 3D meshes via

inverse kinematics. To make our pipeline robust to interacting persons, we

borrowed the occlusion-robust techniques for 3D skeleton estimation.

–To further boost the performance, we propose the Transformer-based archi-

tecture for relation-aware mesh reﬁnement to reﬁne the initial mesh param-

eters considering intra- and inter-person relationships.

–Extensive comparisons are conducted involving three challenging multi-person

3D body pose benchmarks (i.e. 3DPW, MuPoTS and AGORA) and we have

demonstrated the state-of-the-art performance on each benchmark. Via ab-

lation studies, we prove that each component works in the meaningful way.

2 Related Works

Single-person 3D mesh regression. There is a long history of methods

for predicting 3D human body meshes from monocular RGB images or video

frames [16]. Recently, there has been quick advancement in this ﬁeld thanks to

SMPL [42] which provides a low dimensional parameterization of the 3D human

body mesh. Here we focus on a 3D body mesh regressing by adopting a paramet-

ric model like SMPL from a monocular RGB image. Bogo et al. [3] represented

an optimization-based method called SMPLify by ﬁtting SMPL on the detected

2D body joints iteratively. However, this optimization-based approach is com-

paratively time-consuming and struggle with the higher inference time per input

frame.

Some recent studies [34, 50, 53] use deep neural networks for SMPL param-

eters regression from images in a two-stage manner, which have been eﬀective

and can generate more accurate mesh reconstruction outputs in the presence

of large-scale 3D datasets. They ﬁrst determine intermediate renderings such

as silhouettes and 2D keypoints from input images and then map them to the

SMPL parameters. Impressive results have been achieved for in-the-wild images

by applying diverse weak supervision signals such as semantic segmentation [69],

texture consistency [52], eﬃcient temporal features [30,63,65], 2D pose [11,27,35],

motion dynamics [28], etc.

4 Cha et al.

More recently, Li et al. [36] proposed a 3D human body pose and shape

estimation method by collaborating the 3D keypoints and body meshes. Authors

introduced an inverse kinematics process to ﬁnd the relative rotations using

twist-and-swing decomposition which estimates targeted body joint locations.

Multi-person 3D skeleton regression. There have been variety of meth-

ods [13, 46, 48, 56] that tackle the 3D body pose estimation for multi-person:

Zanﬁr et al. [56] proposed LCR-Net that consists of localization, classiﬁcation,

and regression modules. The localization module detects multi-persons from a

single image. The classiﬁcation module classiﬁes the detected human into several

anchor-poses. Finally, the regression module reﬁnes the anchor-poses. Mehta et

al. [46] proposed a single-shot method for multi-person 3D pose estimation from

a single image. In addition, they introduced the MuCo-3DHP dataset which

has multi-person interactions and occlusions images. Moon et al. [48] proposed

top-down method for 3D multi-person pose estimation from a monocular RGB

image. This method consists of human detection, absolute 3D human root lo-

calization, and root-relative 3D single-person pose estimation modules. Dong et

al. [13] used the multi-view images for estimating the multi-person 3D pose.

They proposed a coarse-to-ﬁne method lifting the 2D joints to the 3D joints.

They obtained the 2D joints candidates from [4]. The initial 3D joints are trian-

gulated from 2D joints candidates of diﬀerent camera views of the same image.

In addition, the initial 3D joints are updated using the prior information using

the SMPL [42] model.

Recent multi-person 3D pose regression works [7, 54, 59,72] tackled a variety

of issues such as developing attention-based mechanism dedicated to the 3D pose

estimation problem which considers 3D-to-2D projection process [72], combining

the top-down and bottom-up networks [7], developing the tracking-based for

multi-person [54] and so on. S´ar´andi et al. [59] recently proposed a metric-scale

3D pose estimation method that is robust to truncations. It is able to reason

about the out-of-image joints well. Also, this method is robust to occlusion and

bounding-box noise.

Multi-person 3D mesh regression. There have been few works [12,23,61,62,

70,73] that concern the multi-person 3D body mesh regression: The approaches

could be categorized into two: bottom-up and top-down methods.

Bottom-up methods [23, 61, 62, 73] perform multi-person detection and 3D

mesh reconstruction simultaneously. Zhang et al. [73] proposed a Body Mesh

as Point (BMP) using a multi-scale 2D center map grid-level representation,

which locates selective persons at the grid cell’s center. Sun et al. [61] proposed

a ROMP, which creates parameter maps (i.e. body center heatmap, camera map

and mesh parameter map) for 2D human body detection, body positioning and

3D body mesh parameter regression, respectively. Jiang et al. [23] represented a

coherent reconstruction of multiple humans (CRMH) model, which utilizes the

Faster R-CNN based RoI-aligned feature of all persons to estimate SMPL pa-

rameters. They further deﬁned the position relevance between multiple persons

through a depth ordering-aware loss and an interpenetration. Sun et al. [62] fur-

ther introduced Bird’s-Eye-View (BEV) representation for reasoning the multi-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-Person3DPoseandShapeEstimationviaInverseKinematicsandRefinementJunukCha1[0000−0003−2321−2797],MuhammadSaqlain1,2†[0000−0001−5877−6432],GeonUKim1‡,MingyuShin1,3‡,andSeungryulBaek1[0000−0002−0856−6880]1UNIST,SouthKorea2eSmartSystems,Norway3YeongnamUniv.,SouthKoreaAbstract.Estimating3Dposesandsha...

展开>> 收起<<

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement Junuk Cha10000 000323212797 Muhammad Saqlain120000000158776432.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement Junuk Cha10000 000323212797 Muhammad Saqlain120000000158776432

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: