
Image Encoder
Feature
…
Body Shape
Regression Block
Multi-view Fusion
Transformer
Multi-view Prediction Refinement
…
Feature
Feature
Feature
!×
…
Multi-layer Transformer Encoder
Contextualized Embedding
Multi-view Observation Space
Output Human Mesh
Encode Decode Multi-view Fusion
Encode Decode
(a) Existing methods: Output-level Fusion Strategy
(b) Ours: Feature-level Fusion Strategy
Multi-view Fusion
Figure 1: Comparison between the proposed MMT model and existing ones. After extracting image-
level features from different views, (a) Existing models conduct output-level fusion, which fails to
capture feature interactions of multiple views; (b) Differently, MMT conducts feature-level fusion,
resulting in a way to sufficiently leverage multi-view priors for deriving more accurate estimations.
views directly, resulting in insufficient usage of multi-view priors and achieving uncompetitive
performance to single-view counterparts [
49
;
23
;
24
]. Besides, thanks to the growing maturity of the
motion capture system, large-scale multi-view datasets for reconstructing volumetric human body
representations [
11
;
19
;
46
;
33
] have become available. Thus, there is an urgent need to develop an
effective solution for recovering high-quality 3D human body mesh from multiple camera views.
Motivated by this, we propose learning to fuse view-wise features to produce accurate human body
meshes adapted to different viewpoints. Specifically, we present a novel non-parametric model,
M
ulti-view human body
M
esh
T
ranslator (MMT) for this purpose. MMT translates consecutive
multi-view images to corresponding targeted meshes, mimicking the language translation process
from origin to target. Different from existing methods with output-level fusion, MMT conducts
feature-level fusion, which fuses multi-view features to contextualized embeddings for decoding
mesh vertices of targeted subjects, as shown in Fig. 1(b). This encoding-fusing-decoding scheme can
sufficiently leverage multi-view priors, overcoming drawbacks of existing encoding-decoding-fusing
that neglects considerable feature-level interaction. In addition, MMT performs estimations on human
mesh in a global manner according to multi-view features instead of depending on intermediate and
local representations, leading to coherent results for all views. Moreover, MMT introduces cross-view
alignment via the fusion of features from multi-view positions mapped to the same 3D keypoints.
This feature-level geometry constraint further improves mesh consistency.
In particular, we implement the proposed MMT model with Vision Transformer (ViT). MMT takes
multi-view images as input and outputs the corresponding body meshes for each view. Following the
convention, MMT first utilizes a CNN backbone to extract high-level features from original images.
Then, MMT introduces Multi-view Fusion Transformer to perform feature fusion, which organizes
encoded features as a token sequence and produces context-aware feature embedding based on the
underlying interactions among different views. To align features from different views, MMT conducts
3D human pose estimation as an auxiliary task and projects predicted keypoints into all views to
match the corresponding ground truth. This task guarantees that the fusion tokens are embedded
with cross-view consistent semantic clues for human body mesh reconstruction before providing final
results. Finally, with given contextualized embedding, MMT uses another multi-layer transformer
encoder with progressive dimensionality reduction [
23
] as a decoder to reconstruct the 3D human
pose and shape progressively.
Extensive experiments on Human3.6M [
11
] and HUMBI [
46
] benchmarks show the effectiveness
of MMT for recovering accurate human body meshes in a multi-view way. Quantitatively, MMT
outperforms current state-of-the-art [
24
] by 28.8% in MPVE on HUMBI. In addition, solid ablation
studies are conducted for each component within MMT and alternative model designs to reach a
general conclusion on model performance.
Our contributions are summarized as follows: 1) We propose a novel multi-view model for tackling
the human body mesh recovery task. Our model conducts feature-level fusion to sufficiently leverage
multi-view priors instead of output-level, leading to notably improved performance. 2) We design
a novel cross-view alignment module to fuse the semantic information relevant to human pose and
shape from different views with geometry constraints, helping to produce view-wise consistent results.
2