PIXEL -ALIGNED NON-PARAMETRIC HAND MESH RE- CONSTRUCTION Shijian Jiang1 Guwen Han1 Danhang Tang2 Yang Zhou3 Xiang Li3 Jiming Chen1 Qi Ye1

2025-05-02 0 0 1002.82KB 14 页 10玖币

侵权投诉

PIXEL-ALIGNED NON-PARAMETRIC HAND MESH RE-

CONSTRUCTION

Shijian Jiang1, Guwen Han1, Danhang Tang2, Yang Zhou3, Xiang Li3, Jiming Chen1, Qi Ye1

∗

ABSTRACT

Non-parametric mesh reconstruction has recently shown signiﬁcant progress in

3D hand and body applications. In these methods, mesh vertices and edges are

visible to neural networks, enabling the possibility to establish a direct mapping

between 2D image pixels and 3D mesh vertices. In this paper, we seek to establish

and exploit this mapping with a simple and compact architecture. The network is

designed with these considerations: 1) aggregating both local 2D image features

from the encoder and 3D geometric features captured in the mesh decoder; 2)

decoding coarse-to-ﬁne meshes along the decoding layers to make the best use

of the hierarchical multi-scale information. Speciﬁcally, we propose an end-to-

end pipeline for hand mesh recovery tasks which consists of three phases: a

2D feature extractor constructing multi-scale feature maps, a feature mapping

module transforming local 2D image features to 3D vertex features via 3D-to-2D

projection, and a mesh decoder combining the graph convolution and self-attention

to reconstruct mesh. The decoder aggregate both local image features in pixels and

geometric features in vertices. It also regresses the mesh vertices in a coarse-to-

ﬁne manner, which can leverage multi-scale information. By exploiting the local

connection and designing the mesh decoder, Our approach achieves state-of-the-art

for hand mesh reconstruction on the public FreiHAND dataset.

1 INTRODUCTION

Reconstructing 3D hand mesh from a single RGB image has attracted tremendous attention as it

has numerous applications in human-computer interactions (HCI), VR/AR, robotics, etc. Recent

studies have made great efforts in the accurate hand mesh reconstruction and achieved very promising

results (Lin et al., 2021b; Moon & Lee, 2020; Kulon et al., 2020; Ge et al., 2019; Hasson et al., 2019).

Recent state-of-the-art approaches address the problem mainly by deep learning. These learning-

based methods can be roughly divided into two categories according to the representation of the hand

meshes, i.e., the parametric approaches, and the non-parametric ones. The parametric approaches use

a parametric model that projects hand meshes in a low dimensional space (e.g., MANO (Romero et al.,

2022)) and regresses the coefﬁcients in the space (e.g., the shape and pose parameters of MANO) to

recover the 3D hand (Hasson et al., 2019). The non-parametric ones instead directly regress the mesh

vertices using graph convolution neural network (Moon & Lee, 2020; Kulon et al., 2020; Chen et al.,

2021b) or transformer (Lin et al., 2021b).

Non-parametric approaches have shown substantial improvement over the parametric ones in recent

work, owing to the mapping between the image and the vertices is less non-linear than that between

the image and the coefﬁcients of the hand models (Taheri et al., 2021). Their pipelines (Kulon et al.,

2020; Chen et al., 2021b; Lin et al., 2021b) usually consist of three stages: a 2D encoder extracts

the global image feature, which is mapped to 3D mesh vertices before fed into a 3D mesh decoder

operating on the vertices and edges to get the ﬁnal mesh.

Despite the success, the potential of non-parametric approaches has not been fully uncovered with this

pipeline. In parametric methods, vertices and edges are not visible to the network, and no operation

is carried out in the manifold of the meshes; 2D image features are extracted only to learn a mapping

between the image content and the hand model parameters. Conversely in non-parametric methods,

∗1Zhejiang University, 2Google, 3OPPO, Emails:{jsj630, qi.ye}@zju.edu.cn

arXiv:2210.09198v1 [cs.CV] 17 Oct 2022

operations on vertices and edges, such as graph convolutions or attention modules, are designed to

aggregate the geometric features of the meshes. With the operation, vertices and edges are visible to

the networks; thus direct connections between pixels of the 2D image feature space and vertices of

the 3D mesh can be established and operations in the decoder can aggregate both image features and

geometric features, which can not be realized in the parametric methods. This connection and the

aggregation, however, have not been fully explored by previous work.

In this paper, we seek to establish the connections and merge the local hand features from appearance

in the input and geometry in the output. To this end, we utilize the pixel-aligned mapping module to

establish the connections and propose a simple and compact architecture to deploy the connections.

We design our network by making the following philosophical choices: 1) For the 2D feature extractor,

we keep feature maps of different scales in the encoder instead of using the ﬁnal global feature to

enable 2D local information mapping to 3D. 2) We decode the mesh in the coarse-to-ﬁne manner to

make the best use of the multi-scale information. 3) Both image features and geometric features are

aggregated in the operations of the mesh decoder rather than only geometric features.

Our design is shown in Figure 1. Multi-scale image features are naturally passed to the 3D mesh

decoder. Our experiments show the design enables better alignment between the image and the

reconstructed mesh. The aggregation of features not only improves the graph convolution network

substantially but also gains large superiority over the attention mechanism with global features.

To summarize, our key contributions are 1) Operations are capable of aggregating both local 2D image

features and 3D geometric features on meshes in different scales. 2) Connections between pixels of

2D image appearance in the encoder and vertices of 3D meshed in the decoder are established by a

pixel-vertex mapping module. 3) A novel graph convolution architecture achieves state-of-the-art

results on the FreiHAND benchmark.

2 RELATED WORK

Mesh Reconstruction.

Previous research methods employ pre-trained parametric human hand and

human models, namely MANO (Romero et al., 2022), SMPL (Loper et al., 2015). And estimate

the pose and shape coefﬁcients of the parametric model. However, it is challenging to regress pose

and shape coefﬁcients directly from input images. Researchers propose to train network models

with human priors, such as using skeletons (Lassner et al., 2017) or segmentation maps. Some

researchers have proposed regressed SMPL parameters by relying on human key points and contour

maps (Pavlakos et al., 2018; Tan et al., 2017) of the body. Coincidentally (Omran et al., 2018)

utilized the segmentation map of the human body as a supervision condition. A weakly supervised

approach (Kanazawa et al., 2018) using 2D keypoint reprojection and adversarial learning regression

SMPL parameters. Hsiao-Yu Tung (Tung et al., 2017) proposed a self-supervised approach to

regression of human parametric models.

Recently, model-free methods (Choi et al., 2020; Moon & Lee, 2020; Kolotouros et al., 2019) for

directly regressing human pose and shape from input images have received increasing attention.

Because it can express the nonlinear relationship between the image and the predicted 3D space.

Researchers have explored various ways to represent the human body and hand using 3D mesh (Lin

et al., 2021b; Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al., 2018; Ranjan

et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020), voxel spaces (Varol et al.,

2018), or occupancy ﬁelds (Saito et al., 2019; Niemeyer et al., 2020; Xu et al., 2019; Saito et al.,

2020; Peng et al., 2020). Among them, the voxel space method adopts a completely non-parametric

method, which requires a lot of computing resources, and the output voxel needs to ﬁt the body

model to obtain the ﬁnal human 3D mesh.Among the recent research methods, Graph Convolution

Neural Networks (GCNs) (Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al.,

2018; Ranjan et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020) is one of

the most popular methods. Because GCN is particularly convenient for convolution operations on

mesh data. However, GCN is good for representing the local features of the mesh, and the global

features of the long-distance interaction between human vertices and joints cannot be well represented.

Transformer-based methods (Lin et al., 2021b) use a self-attention mechanism to take full advantage

of the information interaction between vertex and joints and use the global information of the human

body to reconstruct more accurate vertex positions. But whether it is a GCN-based method or

an attention mechanism-based method. Neither considers pixel-level semantic feature alignment

information. Local pixel-level semantic feature alignment can compensate for the global information

that GCN and transformer methods focus on.

Graph Neural Networks.

Graph deep learning generalizes neural networks to non-Euclidean

domains, and we hope to apply graph convolution neural networks to learn shape-invariant features

on triangular meshes. For example, spectral graph convolution neural network methods (Bruna

et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2016; Levie et al., 2018) perform convolution

operations in the frequency domain. Local graph methods (Masci et al., 2015; Boscaini et al., 2016;

Monti et al., 2017) based on spatial graph convolutions make deep learning on Manifold more

convenient.

In the application of mesh reconstruction. (Ranjan et al., 2018) used fast local spectral ﬁlters to learn

nonlinear representations of human faces. (Kulon et al., 2019) extended autoencoder networks to 3D

representations of hands.Kolotouros proposed GraphCMR (Kolotouros et al., 2019) to regression

3D mesh vertices using a GCN (Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany

et al., 2018; Ranjan et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020).

Pose2Mesh (Choi et al., 2020) proposes to reconstruct a human mesh from a given human pose

representation based on a cascaded GCN. (Lim et al., 2018) proposed spiral convolution to handle

mesh in the spatial domain. Based on SpiralConv, Kulon (Kulon et al., 2020) introduced an automatic

method to generate training data from unannotated images for 3D hand reconstruction and pose

estimation. (Chen et al., 2021b;a) propose a novel aggregation method to collect effective 2D cues

and exploit high-level semantic relations for root-relative mesh recovery. Kevin Lin proposed

Graphormer (Lin et al., 2021a), combining Transformer and GCN to simulate the global interaction

between joints and mesh vertices.

Mesh-image alignment.

In the ﬁeld of 2D image processing, most deep learning methods employ

a "fully convolution" network framework that maintains spatial alignment between images and

outputs (Kirillov et al., 2020; Long et al., 2015; Tompson et al., 2014). Several research methods also

consider alignment relationships in the 3D domain. For example, PIFu (Saito et al., 2019) proposed

an implicit representation that locally aligns the pixels of a 2D image with the global context of their

corresponding 3D objects. PyMAF (Zhang et al., 2021) introduced a mesh alignment feedback loop,

where evidence of mesh alignment is used to correct parameters for better-aligned reconstruction

results. The alignment can take advantage of more informative features that are sensitive to position

to predict mesh.

Existing mesh recovery works (Tang et al., 2021; Li et al., 2022) face the shortcomings of complex

network structure when mesh-images alignment. Furthermore, the initial input of the 3D decoder

is a high-resolution mesh, which makes network optimization difﬁcult. This is critical for practical

applications. To address these issues, we propose a compact network framework to map 2d image

pixel features to 3d mesh vertex locations. We apply a multi-scale structure to the 2D feature

encoder and 3D mesh decoder respectively, to achieve coarse-to-ﬁne pixel alignment at corresponding

resolutions. Using multi-scale pixel-aligned features can achieve better mesh-image alignment than

previous methods.

3 METHODOLOGY

Given a monocular RGB image

, our goal is to predict the 3D positions of all the

vertices

V={vi}N

i=1

of the predeﬁned hand mesh

. The overall architecture of our network, as shown

in Figure 1, has two major components: a 2D feature extractor, as well as a 3D mesh decoder that

consists of feature mapping modules and mesh-conv layers. The 2D feature extractor is an hourglass

that encodes the image content into features at

levels of scale. Respectively the 3D mesh decoder

also recovers the vertices in a coarse-to-ﬁne manner in

different scales. By design the mesh decoder

at level

s∈S

leverages the 2D feature map at level

. In the following sections, we will describe the

architecture of the 2D feature extractor in 3.1, the pixel-aligned feature mapping module in 3.2, the

mesh decoder in 3.3, as well as training details in 3.4.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PIXEL-ALIGNEDNON-PARAMETRICHANDMESHRE-CONSTRUCTIONShijianJiang1,GuwenHan1,DanhangTang2,YangZhou3,XiangLi3,JimingChen1,QiYe1ABSTRACTNon-parametricmeshreconstructionhasrecentlyshownsignicantprogressin3Dhandandbodyapplications.Inthesemethods,meshverticesandedgesarevisibletoneuralnetworks,enablingthep...

展开>> 收起<<

PIXEL -ALIGNED NON-PARAMETRIC HAND MESH RE- CONSTRUCTION Shijian Jiang1 Guwen Han1 Danhang Tang2 Yang Zhou3 Xiang Li3 Jiming Chen1 Qi Ye1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PIXEL -ALIGNED NON-PARAMETRIC HAND MESH RE- CONSTRUCTION Shijian Jiang1 Guwen Han1 Danhang Tang2 Yang Zhou3 Xiang Li3 Jiming Chen1 Qi Ye1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: