Multi-view Human Body Mesh Translator Xiangjian Jiang School of Computer Science and Engineering

2025-05-02 0 0 2.03MB 13 页 10玖币
侵权投诉
Multi-view Human Body Mesh Translator
Xiangjian Jiang
School of Computer Science and Engineering
Beihang University
China
Xuecheng Nie
MT Lab
Meitu Inc.
China
Zitian Wang
Institute of Artificial Intelligence
Beihang University
China
Luoqi Liu
MT Lab
Meitu Inc.
China
Si Liu
Institute of Artificial Intelligence
Beihang University
China
Abstract
Existing methods for human mesh recovery mainly focus on single-view frame-
works, but they often fail to produce accurate results due to the ill-posed setup.
Considering the maturity of the multi-view motion capture system, in this paper,
we propose to solve the prior ill-posed problem by leveraging multiple images from
different views, thus significantly enhancing the quality of recovered meshes. In
particular, we present a novel
M
ulti-view human body
M
esh
T
ranslator (MMT)
model for estimating human body mesh with the help of vision transformer. Specif-
ically, MMT takes multi-view images as input and translates them to targeted
meshes in a single-forward manner. MMT fuses features of different views in
both encoding and decoding phases, leading to representations embedded with
global information. Additionally, to ensure the tokens are intensively focused on
the human pose and shape, MMT conducts cross-view alignment at the feature
level by projecting 3D keypoint positions to each view and enforcing their con-
sistency in geometry constraints. Comprehensive experiments demonstrate that
MMT outperforms existing single or multi-view models by a large margin for
human mesh recovery task, notably, 28.8% improvement in MPVE over the current
state-of-the-art method on the challenging HUMBI dataset. Qualitative evaluation
also verifies the effectiveness of MMT in reconstructing high-quality human mesh.
Codes will be made available upon acceptance.
1 Introduction
Human Mesh Recovery (HMR) from RGB images [
13
;
18
;
6
] is a fundamental task in Computer
Vision, aiming to estimate 3D vertices and their topology to model the body shape. It has a wide
range of applications in sports motion analysis [
48
;
54
;
35
], security surveillance [
37
;
52
] and also
plays a crucial role in building the Metaverse [51; 8].
In literature, existing works mainly follow the path to recovering 3D human mesh from a single
monocular RGB image [
16
;
38
;
47
;
13
;
27
] with neural networks, e.g., Convolutional Neural Networks
(CNNs) and Vision Transformer (ViT). However, they suffer from the severe ill-posed problem due
to the ambiguity when lifting human bodies from 2D onto 3D space. Some pioneering attempts [
40
;
22
;
20
;
36
;
45
] have tried to bridge the gap by post-processing estimations from multiple views
for refining human meshes, as shown in Fig. 1(a), but they fail to exploit features of different
Equal contribution
Corresponding author
Preprint. Under review.
arXiv:2210.01886v1 [cs.CV] 4 Oct 2022
Image Encoder
Feature
Body Shape
Regression Block
Multi-view Fusion
Transformer
Multi-view Prediction Refinement
Feature
Feature
Feature
Multi-layer Transformer Encoder
Contextualized Embedding
Multi-view Observation Space
Output Human Mesh
Encode Decode Multi-view Fusion
Encode Decode
(a) Existing methods: Output-level Fusion Strategy
(b) Ours: Feature-level Fusion Strategy
Multi-view Fusion
Figure 1: Comparison between the proposed MMT model and existing ones. After extracting image-
level features from different views, (a) Existing models conduct output-level fusion, which fails to
capture feature interactions of multiple views; (b) Differently, MMT conducts feature-level fusion,
resulting in a way to sufficiently leverage multi-view priors for deriving more accurate estimations.
views directly, resulting in insufficient usage of multi-view priors and achieving uncompetitive
performance to single-view counterparts [
49
;
23
;
24
]. Besides, thanks to the growing maturity of the
motion capture system, large-scale multi-view datasets for reconstructing volumetric human body
representations [
11
;
19
;
46
;
33
] have become available. Thus, there is an urgent need to develop an
effective solution for recovering high-quality 3D human body mesh from multiple camera views.
Motivated by this, we propose learning to fuse view-wise features to produce accurate human body
meshes adapted to different viewpoints. Specifically, we present a novel non-parametric model,
M
ulti-view human body
M
esh
T
ranslator (MMT) for this purpose. MMT translates consecutive
multi-view images to corresponding targeted meshes, mimicking the language translation process
from origin to target. Different from existing methods with output-level fusion, MMT conducts
feature-level fusion, which fuses multi-view features to contextualized embeddings for decoding
mesh vertices of targeted subjects, as shown in Fig. 1(b). This encoding-fusing-decoding scheme can
sufficiently leverage multi-view priors, overcoming drawbacks of existing encoding-decoding-fusing
that neglects considerable feature-level interaction. In addition, MMT performs estimations on human
mesh in a global manner according to multi-view features instead of depending on intermediate and
local representations, leading to coherent results for all views. Moreover, MMT introduces cross-view
alignment via the fusion of features from multi-view positions mapped to the same 3D keypoints.
This feature-level geometry constraint further improves mesh consistency.
In particular, we implement the proposed MMT model with Vision Transformer (ViT). MMT takes
multi-view images as input and outputs the corresponding body meshes for each view. Following the
convention, MMT first utilizes a CNN backbone to extract high-level features from original images.
Then, MMT introduces Multi-view Fusion Transformer to perform feature fusion, which organizes
encoded features as a token sequence and produces context-aware feature embedding based on the
underlying interactions among different views. To align features from different views, MMT conducts
3D human pose estimation as an auxiliary task and projects predicted keypoints into all views to
match the corresponding ground truth. This task guarantees that the fusion tokens are embedded
with cross-view consistent semantic clues for human body mesh reconstruction before providing final
results. Finally, with given contextualized embedding, MMT uses another multi-layer transformer
encoder with progressive dimensionality reduction [
23
] as a decoder to reconstruct the 3D human
pose and shape progressively.
Extensive experiments on Human3.6M [
11
] and HUMBI [
46
] benchmarks show the effectiveness
of MMT for recovering accurate human body meshes in a multi-view way. Quantitatively, MMT
outperforms current state-of-the-art [
24
] by 28.8% in MPVE on HUMBI. In addition, solid ablation
studies are conducted for each component within MMT and alternative model designs to reach a
general conclusion on model performance.
Our contributions are summarized as follows: 1) We propose a novel multi-view model for tackling
the human body mesh recovery task. Our model conducts feature-level fusion to sufficiently leverage
multi-view priors instead of output-level, leading to notably improved performance. 2) We design
a novel cross-view alignment module to fuse the semantic information relevant to human pose and
shape from different views with geometry constraints, helping to produce view-wise consistent results.
2
3) Our model surpasses previous ones under both single-view [
24
] and multi-view conditions [
20
]
and sets new state-of-the-art.
2 Related Works
Human Mesh Recovery
Current methods for human mesh recovery can be classified by the
number of input views. The single-view methods estimate the 3D human body shape from a single
monocular image [
16
;
38
;
47
;
13
;
27
;
3
;
23
;
18
;
24
;
49
], and they inevitably fail to make precise
predictions with severe ambiguity during the lifting from 2D plane to 3D space. Therefore, multi-view
methods rely on several synchronized camera views to mitigate the ill-posed problem. Existing works
mainly focus on the post-processing and combining the results from parallel single-view models
[
40
;
20
;
21
;
36
;
22
] and even use the temporal dimension [
10
] to improve the model performance
further. However, this pipeline ignores underlying interactions of image features from different views
and makes the model overdependent on the accuracy of single-view predictions. Meanwhile, above
methods are mostly based on parametric human model, such as SMPL [
25
], SMPL-X [
31
] and STAR
[
29
]. Consequently, they usually find it hard to reconstruct the human body of rare distribution due to
the limited number of samples used to obtain the parametric models.
Unlike the existing multi-view solutions, our proposed method especially attends to images captured
by synchronized cameras and fuses multi-view features. Then the non-parametric model makes final
inferences on 3D human pose and shape with transformer processing the vertex-to-vertex relationship.
Vision Transformers
Impressed by the great success of transformer in Natural Language Process-
ing [
4
;
1
;
41
], there have been pioneering works trying to extend this framework to 3D human pose
estimation and mesh recovery [
50
;
23
]. To the best of our knowledge, existing works on the HMR
task have not used transformer to handle the multi-view inputs directly as [
50
] does in pose regression.
Therefore, we propose a novel method with vision transformer as the core component to process
multi-view features and generate fusion tokens closely related to human pose and shape.
3 Method
We first briefly introduce the problem and solution setup in a formalized manner. A multi-view
non-parametric method for human mesh recovery can be defined as a quad of
(I,V,J,L)
, where
I
denotes the observation space,
(V,J)
is the 3D coordinates of mesh vertices and joints respectively,
and
L
is a loss function of predicted vertices and joints for evaluation and optimization. Specifically,
I={In}N
n=1
is a group of images for one person from
N
diverse viewpoints at the same time, and we
set
N= 4
in this paper. As for the datasets with SMPL annotations,
V={Vn}N
n=1 RN×6890×3
is a set of 6890 vertex coordinates. Following [
12
],
J={Jn}N
n=1 RN×14×3
contains 3D
coordinates of 14 keypoints from different viewpoints to model the skeleton of human body. In
summary, the model takes in RGB images from
N
perspectives and estimates 3D human pose
J
and
shape
V
under the supervision of ground truth and loss function
L
. In the following, we will explain
the proposed MMT model in detail as shown in Fig. 2.
3.1 Model Architecture
3.1.1 Feature Extract Network
As the first stage of the proposed method, we use a convolutional image encoder, including ResNet [
9
]
and HRNet [42], to obtain global feature vectors Fas illustrated below:
F=Concat(Conv(I1),Conv(I2), ..., Conv(IN)) RN×7×7×2048.(1)
Additionally, the image encoder is pre-trained on ImageNet classification task [
34
], which is beneficial
for reconstructing human mesh empirically [
23
]. For each image, we use the vectors from the last
layer with the shape of
7×7×2048
and concatenate them for subsequent multi-view feature fusion.
3.1.2 Multi-view Fusion Transformer
To construct view-aware features for 3D human pose and shape estimation, it is significant to exert
the complementary information embedded among multiple viewpoints. Different from [
2
;
55
], we
3
摘要:

Multi-viewHumanBodyMeshTranslatorXiangjianJiangSchoolofComputerScienceandEngineeringBeihangUniversityChinaXuechengNieMTLabMeituInc.ChinaZitianWangInstituteofArticialIntelligenceBeihangUniversityChinaLuoqiLiuMTLabMeituInc.ChinaSiLiuyInstituteofArticialIntelligenceBeihangUniversityChinaAbstractExi...

展开>> 收起<<
Multi-view Human Body Mesh Translator Xiangjian Jiang School of Computer Science and Engineering.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.03MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注