Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement Junuk Cha10000 000323212797 Muhammad Saqlain120000000158776432

2025-05-02 0 0 6.36MB 18 页 10玖币
侵权投诉
Multi-Person 3D Pose and Shape Estimation
via Inverse Kinematics and Refinement
Junuk Cha1[0000000323212797], Muhammad Saqlain1,2[0000000158776432],
GeonU Kim1, Mingyu Shin1,3, and Seungryul Baek1[0000000208566880]
1 UNIST, South Korea 2 eSmart Systems, Norway 3 Yeongnam Univ.,
South Korea
Abstract. Estimating 3D poses and shapes in the form of meshes from
monocular RGB images is challenging. Obviously, it is more difficult than
estimating 3D poses only in the form of skeletons or heatmaps. When
interacting persons are involved, the 3D mesh reconstruction becomes
more challenging due to the ambiguity introduced by person-to-person
occlusions. To tackle the challenges, we propose a coarse-to-fine pipeline
that benefits from 1) inverse kinematics from the occlusion-robust 3D
skeleton estimation and 2) Transformer-based relation-aware refinement
techniques. In our pipeline, we first obtain occlusion-robust 3D skele-
tons for multiple persons from an RGB image. Then, we apply inverse
kinematics to convert the estimated skeletons to deformable 3D mesh pa-
rameters. Finally, we apply the Transformer-based mesh refinement that
refines the obtained mesh parameters considering intra- and inter-person
relations of 3D meshes. Via extensive experiments, we demonstrate the
effectiveness of our method, outperforming state-of-the-arts on 3DPW,
MuPoTS and AGORA datasets.
Keywords: Multi-person, 3D mesh reconstruction, Transformer
1 Introduction
Recovering 3D human body meshes for a single person or multi-person from a
monocular RGB image has made great progress in recent years [3, 10, 12, 17, 23,
27, 28, 30–33, 38, 39, 61, 64, 69, 71, 73]. The technique is essential to understand
people’s behaviors, intentions and person-to-person interactions. It has a wide
range of real-world applications such as human motion imitation [41], virtual try
on [47], motion capture [45], action recognition [5, 57, 66], etc.
Recently, deep convolutional neural network-based mesh reconstruction meth-
ods [6,10,12,17,23,27,28,30–33,38,39,61,64, 69,71,73] have shown the practical
performance on in-the-wild scenes [21,25,44,68]. Most of the existing 3D human
body pose and shape estimation approaches [6, 10, 17, 27, 28, 30–33, 38, 39, 69]
This research was conducted when Dr. Saqlain was the post-doctoral researcher at
UNIST, and when Mr. Kim and Mr. Shin were undergraduate interns at UNIST.
arXiv:2210.13529v2 [cs.CV] 30 Oct 2022
2 Cha et al.
(a) (b) (c) (d)
(e)
(f)
Fig. 1: Example outputs from our pipeline: (a) input RGB image, (b) initial
skeleton estimation results obtained from the input image, (c) initial meshes
obtained from the inverse kinematics process, (d) refined meshes obtained from
the refinement Transformer, (e, f) top- and side-views for the refined meshes.
achieved promising results for single-person cases. Generally, firstly they crop
the area with a person in an input image using bounding-box and then extract
features for each detected person, which are further used for 3D human mesh
regression.
Some of the recent studies [26–28, 30–33, 36, 39, 64, 71] reconstruct each per-
son 3D mesh individually for multi-person 3D mesh reconstruction using the
same bounding-boxes detector [4, 18, 55]. Multiple persons can create severe
person-to-person or person-to-environmental occlusions, erroneous monocular
depth and diverse human body appearance which results in performance ambi-
guity in crowded scenes, while in these methods, proper modules that tackle the
interacting persons have not been established yet. A few recent methods [23,73]
applied direct regression for multiple persons which do not require individual
person detection. Sun et al. [61] used body center heatmaps as the target rep-
resentation to identify mesh parameter map. However, without applying the
human detection, the human pose estimation result is frequently affected by
unimportant pixels and it frequently fails to capture scale variations, which re-
sult in the inferior performance.
In parallel, there have been efforts to reduce the ambiguity of estimating 3D
meshes from an RGB image. However in the aspect of the pose recovery, 3D body
mesh recovery methods [27,30,31,33] still fall behind the 3D skeleton or heatmap
estimation methods [8,9,22,60]. One drawback of 3D skeleton estimation method
is that it cannot reconstruct the full 3D body mesh. Recently, Li et al. [36]
proposed an inverse kinematics method for single-person mesh reconstruction
to recover 3D meshes from 3D skeletons. This approach is promising since it is
able to deliver good poses obtained from 3D skeleton estimator to the 3D mesh
reconstruction pipeline.
Multi-Person 3D Pose and Shape Est. via IK and Refinement 3
To tackle the multi-person 3D body mesh reconstruction task, we propose a
coarse-to-fine pipeline that first estimates 3D skeletons, reconstruct 3D meshes
from 3D skeletons via inverse kinematics and refine the initial 3D mesh param-
eters via relation-aware refinement. Inspired by [59], our 3D skeleton estimator
involves metric-scale heatmaps and is trained by both relative and absolute po-
sitional 3D poses to be robust to occlusions. By extending the IK process [36]
towards the multi-person scenario, we are able to obtain the initial 3D meshes
for multiple persons from 3D skeletons; while the accuracy is limited especially
for interacting person cases. To compensate for the limitation, we propose the
relation-aware Transformer to refine the initial mesh parameters considering
intra- and inter-person 3D mesh relationships. The Fig. 1 shows example out-
puts for intermediate steps. To summarize, our contributions are as follows:
We propose a coarse-to-fine multi-person 3D body mesh reconstruction pipeline
that first estimates 3D skeletons and then delivers it toward 3D meshes via
inverse kinematics. To make our pipeline robust to interacting persons, we
borrowed the occlusion-robust techniques for 3D skeleton estimation.
To further boost the performance, we propose the Transformer-based archi-
tecture for relation-aware mesh refinement to refine the initial mesh param-
eters considering intra- and inter-person relationships.
Extensive comparisons are conducted involving three challenging multi-person
3D body pose benchmarks (i.e. 3DPW, MuPoTS and AGORA) and we have
demonstrated the state-of-the-art performance on each benchmark. Via ab-
lation studies, we prove that each component works in the meaningful way.
2 Related Works
Single-person 3D mesh regression. There is a long history of methods
for predicting 3D human body meshes from monocular RGB images or video
frames [16]. Recently, there has been quick advancement in this field thanks to
SMPL [42] which provides a low dimensional parameterization of the 3D human
body mesh. Here we focus on a 3D body mesh regressing by adopting a paramet-
ric model like SMPL from a monocular RGB image. Bogo et al. [3] represented
an optimization-based method called SMPLify by fitting SMPL on the detected
2D body joints iteratively. However, this optimization-based approach is com-
paratively time-consuming and struggle with the higher inference time per input
frame.
Some recent studies [34, 50, 53] use deep neural networks for SMPL param-
eters regression from images in a two-stage manner, which have been effective
and can generate more accurate mesh reconstruction outputs in the presence
of large-scale 3D datasets. They first determine intermediate renderings such
as silhouettes and 2D keypoints from input images and then map them to the
SMPL parameters. Impressive results have been achieved for in-the-wild images
by applying diverse weak supervision signals such as semantic segmentation [69],
texture consistency [52], efficient temporal features [30,63,65], 2D pose [11,27,35],
motion dynamics [28], etc.
4 Cha et al.
More recently, Li et al. [36] proposed a 3D human body pose and shape
estimation method by collaborating the 3D keypoints and body meshes. Authors
introduced an inverse kinematics process to find the relative rotations using
twist-and-swing decomposition which estimates targeted body joint locations.
Multi-person 3D skeleton regression. There have been variety of meth-
ods [13, 46, 48, 56] that tackle the 3D body pose estimation for multi-person:
Zanfir et al. [56] proposed LCR-Net that consists of localization, classification,
and regression modules. The localization module detects multi-persons from a
single image. The classification module classifies the detected human into several
anchor-poses. Finally, the regression module refines the anchor-poses. Mehta et
al. [46] proposed a single-shot method for multi-person 3D pose estimation from
a single image. In addition, they introduced the MuCo-3DHP dataset which
has multi-person interactions and occlusions images. Moon et al. [48] proposed
top-down method for 3D multi-person pose estimation from a monocular RGB
image. This method consists of human detection, absolute 3D human root lo-
calization, and root-relative 3D single-person pose estimation modules. Dong et
al. [13] used the multi-view images for estimating the multi-person 3D pose.
They proposed a coarse-to-fine method lifting the 2D joints to the 3D joints.
They obtained the 2D joints candidates from [4]. The initial 3D joints are trian-
gulated from 2D joints candidates of different camera views of the same image.
In addition, the initial 3D joints are updated using the prior information using
the SMPL [42] model.
Recent multi-person 3D pose regression works [7, 54, 59,72] tackled a variety
of issues such as developing attention-based mechanism dedicated to the 3D pose
estimation problem which considers 3D-to-2D projection process [72], combining
the top-down and bottom-up networks [7], developing the tracking-based for
multi-person [54] and so on. S´ar´andi et al. [59] recently proposed a metric-scale
3D pose estimation method that is robust to truncations. It is able to reason
about the out-of-image joints well. Also, this method is robust to occlusion and
bounding-box noise.
Multi-person 3D mesh regression. There have been few works [12,23,61,62,
70,73] that concern the multi-person 3D body mesh regression: The approaches
could be categorized into two: bottom-up and top-down methods.
Bottom-up methods [23, 61, 62, 73] perform multi-person detection and 3D
mesh reconstruction simultaneously. Zhang et al. [73] proposed a Body Mesh
as Point (BMP) using a multi-scale 2D center map grid-level representation,
which locates selective persons at the grid cell’s center. Sun et al. [61] proposed
a ROMP, which creates parameter maps (i.e. body center heatmap, camera map
and mesh parameter map) for 2D human body detection, body positioning and
3D body mesh parameter regression, respectively. Jiang et al. [23] represented a
coherent reconstruction of multiple humans (CRMH) model, which utilizes the
Faster R-CNN based RoI-aligned feature of all persons to estimate SMPL pa-
rameters. They further defined the position relevance between multiple persons
through a depth ordering-aware loss and an interpenetration. Sun et al. [62] fur-
ther introduced Bird’s-Eye-View (BEV) representation for reasoning the multi-
摘要:

Multi-Person3DPoseandShapeEstimationviaInverseKinematicsandRefinementJunukCha1[0000−0003−2321−2797],MuhammadSaqlain1,2†[0000−0001−5877−6432],GeonUKim1‡,MingyuShin1,3‡,andSeungryulBaek1[0000−0002−0856−6880]1UNIST,SouthKorea2eSmartSystems,Norway3YeongnamUniv.,SouthKoreaAbstract.Estimating3Dposesandsha...

展开>> 收起<<
Multi-Person 3D Pose and Shape Estimation via Inverse Kinematics and Refinement Junuk Cha10000 000323212797 Muhammad Saqlain120000000158776432.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.36MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注