Multi-view Human Body Mesh Translator Xiangjian Jiang School of Computer Science and Engineering

2025-05-02 0 0 2.03MB 13 页 10玖币

侵权投诉

Multi-view Human Body Mesh Translator

Xiangjian Jiang ∗

School of Computer Science and Engineering

Beihang University

China

Xuecheng Nie ∗

MT Lab

Meitu Inc.

China

Zitian Wang

Institute of Artiﬁcial Intelligence

Beihang University

China

Luoqi Liu

MT Lab

Meitu Inc.

China

Si Liu †

Institute of Artiﬁcial Intelligence

Beihang University

China

Abstract

Existing methods for human mesh recovery mainly focus on single-view frame-

works, but they often fail to produce accurate results due to the ill-posed setup.

Considering the maturity of the multi-view motion capture system, in this paper,

we propose to solve the prior ill-posed problem by leveraging multiple images from

different views, thus signiﬁcantly enhancing the quality of recovered meshes. In

particular, we present a novel

ulti-view human body

esh

ranslator (MMT)

model for estimating human body mesh with the help of vision transformer. Specif-

ically, MMT takes multi-view images as input and translates them to targeted

meshes in a single-forward manner. MMT fuses features of different views in

both encoding and decoding phases, leading to representations embedded with

global information. Additionally, to ensure the tokens are intensively focused on

the human pose and shape, MMT conducts cross-view alignment at the feature

level by projecting 3D keypoint positions to each view and enforcing their con-

sistency in geometry constraints. Comprehensive experiments demonstrate that

MMT outperforms existing single or multi-view models by a large margin for

human mesh recovery task, notably, 28.8% improvement in MPVE over the current

state-of-the-art method on the challenging HUMBI dataset. Qualitative evaluation

also veriﬁes the effectiveness of MMT in reconstructing high-quality human mesh.

Codes will be made available upon acceptance.

1 Introduction

Human Mesh Recovery (HMR) from RGB images [

;

] is a fundamental task in Computer

Vision, aiming to estimate 3D vertices and their topology to model the body shape. It has a wide

range of applications in sports motion analysis [

;

], security surveillance [

;

] and also

plays a crucial role in building the Metaverse [51; 8].

In literature, existing works mainly follow the path to recovering 3D human mesh from a single

monocular RGB image [

;

] with neural networks, e.g., Convolutional Neural Networks

(CNNs) and Vision Transformer (ViT). However, they suffer from the severe ill-posed problem due

to the ambiguity when lifting human bodies from 2D onto 3D space. Some pioneering attempts [

;

] have tried to bridge the gap by post-processing estimations from multiple views

for reﬁning human meshes, as shown in Fig. 1(a), but they fail to exploit features of different

∗Equal contribution

†Corresponding author

Preprint. Under review.

arXiv:2210.01886v1 [cs.CV] 4 Oct 2022

Image Encoder

Feature

…

Body Shape

Regression Block

Multi-view Fusion

Transformer

Multi-view Prediction Refinement

…

Feature

!×

…

Multi-layer Transformer Encoder

Contextualized Embedding

Multi-view Observation Space

Output Human Mesh

Encode Decode Multi-view Fusion

Encode Decode

(a) Existing methods: Output-level Fusion Strategy

(b) Ours: Feature-level Fusion Strategy

Multi-view Fusion

Figure 1: Comparison between the proposed MMT model and existing ones. After extracting image-

level features from different views, (a) Existing models conduct output-level fusion, which fails to

capture feature interactions of multiple views; (b) Differently, MMT conducts feature-level fusion,

resulting in a way to sufﬁciently leverage multi-view priors for deriving more accurate estimations.

views directly, resulting in insufﬁcient usage of multi-view priors and achieving uncompetitive

performance to single-view counterparts [

;

]. Besides, thanks to the growing maturity of the

motion capture system, large-scale multi-view datasets for reconstructing volumetric human body

representations [

;

] have become available. Thus, there is an urgent need to develop an

effective solution for recovering high-quality 3D human body mesh from multiple camera views.

Motivated by this, we propose learning to fuse view-wise features to produce accurate human body

meshes adapted to different viewpoints. Speciﬁcally, we present a novel non-parametric model,

ulti-view human body

esh

ranslator (MMT) for this purpose. MMT translates consecutive

multi-view images to corresponding targeted meshes, mimicking the language translation process

from origin to target. Different from existing methods with output-level fusion, MMT conducts

feature-level fusion, which fuses multi-view features to contextualized embeddings for decoding

mesh vertices of targeted subjects, as shown in Fig. 1(b). This encoding-fusing-decoding scheme can

sufﬁciently leverage multi-view priors, overcoming drawbacks of existing encoding-decoding-fusing

that neglects considerable feature-level interaction. In addition, MMT performs estimations on human

mesh in a global manner according to multi-view features instead of depending on intermediate and

local representations, leading to coherent results for all views. Moreover, MMT introduces cross-view

alignment via the fusion of features from multi-view positions mapped to the same 3D keypoints.

This feature-level geometry constraint further improves mesh consistency.

In particular, we implement the proposed MMT model with Vision Transformer (ViT). MMT takes

multi-view images as input and outputs the corresponding body meshes for each view. Following the

convention, MMT ﬁrst utilizes a CNN backbone to extract high-level features from original images.

Then, MMT introduces Multi-view Fusion Transformer to perform feature fusion, which organizes

encoded features as a token sequence and produces context-aware feature embedding based on the

underlying interactions among different views. To align features from different views, MMT conducts

3D human pose estimation as an auxiliary task and projects predicted keypoints into all views to

match the corresponding ground truth. This task guarantees that the fusion tokens are embedded

with cross-view consistent semantic clues for human body mesh reconstruction before providing ﬁnal

results. Finally, with given contextualized embedding, MMT uses another multi-layer transformer

encoder with progressive dimensionality reduction [

] as a decoder to reconstruct the 3D human

pose and shape progressively.

Extensive experiments on Human3.6M [

] and HUMBI [

] benchmarks show the effectiveness

of MMT for recovering accurate human body meshes in a multi-view way. Quantitatively, MMT

outperforms current state-of-the-art [

] by 28.8% in MPVE on HUMBI. In addition, solid ablation

studies are conducted for each component within MMT and alternative model designs to reach a

general conclusion on model performance.

Our contributions are summarized as follows: 1) We propose a novel multi-view model for tackling

the human body mesh recovery task. Our model conducts feature-level fusion to sufﬁciently leverage

multi-view priors instead of output-level, leading to notably improved performance. 2) We design

a novel cross-view alignment module to fuse the semantic information relevant to human pose and

shape from different views with geometry constraints, helping to produce view-wise consistent results.

3) Our model surpasses previous ones under both single-view [

] and multi-view conditions [

]

and sets new state-of-the-art.

2 Related Works

Human Mesh Recovery

Current methods for human mesh recovery can be classiﬁed by the

number of input views. The single-view methods estimate the 3D human body shape from a single

monocular image [

;

], and they inevitably fail to make precise

predictions with severe ambiguity during the lifting from 2D plane to 3D space. Therefore, multi-view

methods rely on several synchronized camera views to mitigate the ill-posed problem. Existing works

mainly focus on the post-processing and combining the results from parallel single-view models

[

;

] and even use the temporal dimension [

] to improve the model performance

further. However, this pipeline ignores underlying interactions of image features from different views

and makes the model overdependent on the accuracy of single-view predictions. Meanwhile, above

methods are mostly based on parametric human model, such as SMPL [

], SMPL-X [

] and STAR

[

]. Consequently, they usually ﬁnd it hard to reconstruct the human body of rare distribution due to

the limited number of samples used to obtain the parametric models.

Unlike the existing multi-view solutions, our proposed method especially attends to images captured

by synchronized cameras and fuses multi-view features. Then the non-parametric model makes ﬁnal

inferences on 3D human pose and shape with transformer processing the vertex-to-vertex relationship.

Vision Transformers

Impressed by the great success of transformer in Natural Language Process-

ing [

;

], there have been pioneering works trying to extend this framework to 3D human pose

estimation and mesh recovery [

;

]. To the best of our knowledge, existing works on the HMR

task have not used transformer to handle the multi-view inputs directly as [

] does in pose regression.

Therefore, we propose a novel method with vision transformer as the core component to process

multi-view features and generate fusion tokens closely related to human pose and shape.

3 Method

We ﬁrst brieﬂy introduce the problem and solution setup in a formalized manner. A multi-view

non-parametric method for human mesh recovery can be deﬁned as a quad of

(I,V,J,L)

, where

denotes the observation space,

(V,J)

is the 3D coordinates of mesh vertices and joints respectively,

and

is a loss function of predicted vertices and joints for evaluation and optimization. Speciﬁcally,

I={In}N

n=1

is a group of images for one person from

diverse viewpoints at the same time, and we

set

N= 4

in this paper. As for the datasets with SMPL annotations,

V={Vn}N

n=1 ⊂RN×6890×3

is a set of 6890 vertex coordinates. Following [

J={Jn}N

n=1 ⊂RN×14×3

contains 3D

coordinates of 14 keypoints from different viewpoints to model the skeleton of human body. In

summary, the model takes in RGB images from

perspectives and estimates 3D human pose

and

shape

under the supervision of ground truth and loss function

. In the following, we will explain

the proposed MMT model in detail as shown in Fig. 2.

3.1 Model Architecture

3.1.1 Feature Extract Network

As the ﬁrst stage of the proposed method, we use a convolutional image encoder, including ResNet [

]

and HRNet [42], to obtain global feature vectors Fas illustrated below:

F=Concat(Conv(I1),Conv(I2), ..., Conv(IN)) ∈RN×7×7×2048.(1)

Additionally, the image encoder is pre-trained on ImageNet classiﬁcation task [

], which is beneﬁcial

for reconstructing human mesh empirically [

]. For each image, we use the vectors from the last

layer with the shape of

7×7×2048

and concatenate them for subsequent multi-view feature fusion.

3.1.2 Multi-view Fusion Transformer

To construct view-aware features for 3D human pose and shape estimation, it is signiﬁcant to exert

the complementary information embedded among multiple viewpoints. Different from [

;

], we

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-viewHumanBodyMeshTranslatorXiangjianJiangSchoolofComputerScienceandEngineeringBeihangUniversityChinaXuechengNieMTLabMeituInc.ChinaZitianWangInstituteofArticialIntelligenceBeihangUniversityChinaLuoqiLiuMTLabMeituInc.ChinaSiLiuyInstituteofArticialIntelligenceBeihangUniversityChinaAbstractExi...

展开>> 收起<<

Multi-view Human Body Mesh Translator Xiangjian Jiang School of Computer Science and Engineering.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-view Human Body Mesh Translator Xiangjian Jiang School of Computer Science and Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: