PIXEL -ALIGNED NON-PARAMETRIC HAND MESH RE- CONSTRUCTION Shijian Jiang1 Guwen Han1 Danhang Tang2 Yang Zhou3 Xiang Li3 Jiming Chen1 Qi Ye1

2025-05-02 0 0 1002.82KB 14 页 10玖币
侵权投诉
PIXEL-ALIGNED NON-PARAMETRIC HAND MESH RE-
CONSTRUCTION
Shijian Jiang1, Guwen Han1, Danhang Tang2, Yang Zhou3, Xiang Li3, Jiming Chen1, Qi Ye1
ABSTRACT
Non-parametric mesh reconstruction has recently shown significant progress in
3D hand and body applications. In these methods, mesh vertices and edges are
visible to neural networks, enabling the possibility to establish a direct mapping
between 2D image pixels and 3D mesh vertices. In this paper, we seek to establish
and exploit this mapping with a simple and compact architecture. The network is
designed with these considerations: 1) aggregating both local 2D image features
from the encoder and 3D geometric features captured in the mesh decoder; 2)
decoding coarse-to-fine meshes along the decoding layers to make the best use
of the hierarchical multi-scale information. Specifically, we propose an end-to-
end pipeline for hand mesh recovery tasks which consists of three phases: a
2D feature extractor constructing multi-scale feature maps, a feature mapping
module transforming local 2D image features to 3D vertex features via 3D-to-2D
projection, and a mesh decoder combining the graph convolution and self-attention
to reconstruct mesh. The decoder aggregate both local image features in pixels and
geometric features in vertices. It also regresses the mesh vertices in a coarse-to-
fine manner, which can leverage multi-scale information. By exploiting the local
connection and designing the mesh decoder, Our approach achieves state-of-the-art
for hand mesh reconstruction on the public FreiHAND dataset.
1 INTRODUCTION
Reconstructing 3D hand mesh from a single RGB image has attracted tremendous attention as it
has numerous applications in human-computer interactions (HCI), VR/AR, robotics, etc. Recent
studies have made great efforts in the accurate hand mesh reconstruction and achieved very promising
results (Lin et al., 2021b; Moon & Lee, 2020; Kulon et al., 2020; Ge et al., 2019; Hasson et al., 2019).
Recent state-of-the-art approaches address the problem mainly by deep learning. These learning-
based methods can be roughly divided into two categories according to the representation of the hand
meshes, i.e., the parametric approaches, and the non-parametric ones. The parametric approaches use
a parametric model that projects hand meshes in a low dimensional space (e.g., MANO (Romero et al.,
2022)) and regresses the coefficients in the space (e.g., the shape and pose parameters of MANO) to
recover the 3D hand (Hasson et al., 2019). The non-parametric ones instead directly regress the mesh
vertices using graph convolution neural network (Moon & Lee, 2020; Kulon et al., 2020; Chen et al.,
2021b) or transformer (Lin et al., 2021b).
Non-parametric approaches have shown substantial improvement over the parametric ones in recent
work, owing to the mapping between the image and the vertices is less non-linear than that between
the image and the coefficients of the hand models (Taheri et al., 2021). Their pipelines (Kulon et al.,
2020; Chen et al., 2021b; Lin et al., 2021b) usually consist of three stages: a 2D encoder extracts
the global image feature, which is mapped to 3D mesh vertices before fed into a 3D mesh decoder
operating on the vertices and edges to get the final mesh.
Despite the success, the potential of non-parametric approaches has not been fully uncovered with this
pipeline. In parametric methods, vertices and edges are not visible to the network, and no operation
is carried out in the manifold of the meshes; 2D image features are extracted only to learn a mapping
between the image content and the hand model parameters. Conversely in non-parametric methods,
1Zhejiang University, 2Google, 3OPPO, Emails:{jsj630, qi.ye}@zju.edu.cn
1
arXiv:2210.09198v1 [cs.CV] 17 Oct 2022
operations on vertices and edges, such as graph convolutions or attention modules, are designed to
aggregate the geometric features of the meshes. With the operation, vertices and edges are visible to
the networks; thus direct connections between pixels of the 2D image feature space and vertices of
the 3D mesh can be established and operations in the decoder can aggregate both image features and
geometric features, which can not be realized in the parametric methods. This connection and the
aggregation, however, have not been fully explored by previous work.
In this paper, we seek to establish the connections and merge the local hand features from appearance
in the input and geometry in the output. To this end, we utilize the pixel-aligned mapping module to
establish the connections and propose a simple and compact architecture to deploy the connections.
We design our network by making the following philosophical choices: 1) For the 2D feature extractor,
we keep feature maps of different scales in the encoder instead of using the final global feature to
enable 2D local information mapping to 3D. 2) We decode the mesh in the coarse-to-fine manner to
make the best use of the multi-scale information. 3) Both image features and geometric features are
aggregated in the operations of the mesh decoder rather than only geometric features.
Our design is shown in Figure 1. Multi-scale image features are naturally passed to the 3D mesh
decoder. Our experiments show the design enables better alignment between the image and the
reconstructed mesh. The aggregation of features not only improves the graph convolution network
substantially but also gains large superiority over the attention mechanism with global features.
To summarize, our key contributions are 1) Operations are capable of aggregating both local 2D image
features and 3D geometric features on meshes in different scales. 2) Connections between pixels of
2D image appearance in the encoder and vertices of 3D meshed in the decoder are established by a
pixel-vertex mapping module. 3) A novel graph convolution architecture achieves state-of-the-art
results on the FreiHAND benchmark.
2 RELATED WORK
Mesh Reconstruction.
Previous research methods employ pre-trained parametric human hand and
human models, namely MANO (Romero et al., 2022), SMPL (Loper et al., 2015). And estimate
the pose and shape coefficients of the parametric model. However, it is challenging to regress pose
and shape coefficients directly from input images. Researchers propose to train network models
with human priors, such as using skeletons (Lassner et al., 2017) or segmentation maps. Some
researchers have proposed regressed SMPL parameters by relying on human key points and contour
maps (Pavlakos et al., 2018; Tan et al., 2017) of the body. Coincidentally (Omran et al., 2018)
utilized the segmentation map of the human body as a supervision condition. A weakly supervised
approach (Kanazawa et al., 2018) using 2D keypoint reprojection and adversarial learning regression
SMPL parameters. Hsiao-Yu Tung (Tung et al., 2017) proposed a self-supervised approach to
regression of human parametric models.
Recently, model-free methods (Choi et al., 2020; Moon & Lee, 2020; Kolotouros et al., 2019) for
directly regressing human pose and shape from input images have received increasing attention.
Because it can express the nonlinear relationship between the image and the predicted 3D space.
Researchers have explored various ways to represent the human body and hand using 3D mesh (Lin
et al., 2021b; Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al., 2018; Ranjan
et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020), voxel spaces (Varol et al.,
2018), or occupancy fields (Saito et al., 2019; Niemeyer et al., 2020; Xu et al., 2019; Saito et al.,
2020; Peng et al., 2020). Among them, the voxel space method adopts a completely non-parametric
method, which requires a lot of computing resources, and the output voxel needs to fit the body
model to obtain the final human 3D mesh.Among the recent research methods, Graph Convolution
Neural Networks (GCNs) (Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al.,
2018; Ranjan et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020) is one of
the most popular methods. Because GCN is particularly convenient for convolution operations on
mesh data. However, GCN is good for representing the local features of the mesh, and the global
features of the long-distance interaction between human vertices and joints cannot be well represented.
Transformer-based methods (Lin et al., 2021b) use a self-attention mechanism to take full advantage
of the information interaction between vertex and joints and use the global information of the human
body to reconstruct more accurate vertex positions. But whether it is a GCN-based method or
an attention mechanism-based method. Neither considers pixel-level semantic feature alignment
2
information. Local pixel-level semantic feature alignment can compensate for the global information
that GCN and transformer methods focus on.
Graph Neural Networks.
Graph deep learning generalizes neural networks to non-Euclidean
domains, and we hope to apply graph convolution neural networks to learn shape-invariant features
on triangular meshes. For example, spectral graph convolution neural network methods (Bruna
et al., 2013; Defferrard et al., 2016; Kipf & Welling, 2016; Levie et al., 2018) perform convolution
operations in the frequency domain. Local graph methods (Masci et al., 2015; Boscaini et al., 2016;
Monti et al., 2017) based on spatial graph convolutions make deep learning on Manifold more
convenient.
In the application of mesh reconstruction. (Ranjan et al., 2018) used fast local spectral filters to learn
nonlinear representations of human faces. (Kulon et al., 2019) extended autoencoder networks to 3D
representations of hands.Kolotouros proposed GraphCMR (Kolotouros et al., 2019) to regression
3D mesh vertices using a GCN (Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany
et al., 2018; Ranjan et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020).
Pose2Mesh (Choi et al., 2020) proposes to reconstruct a human mesh from a given human pose
representation based on a cascaded GCN. (Lim et al., 2018) proposed spiral convolution to handle
mesh in the spatial domain. Based on SpiralConv, Kulon (Kulon et al., 2020) introduced an automatic
method to generate training data from unannotated images for 3D hand reconstruction and pose
estimation. (Chen et al., 2021b;a) propose a novel aggregation method to collect effective 2D cues
and exploit high-level semantic relations for root-relative mesh recovery. Kevin Lin proposed
Graphormer (Lin et al., 2021a), combining Transformer and GCN to simulate the global interaction
between joints and mesh vertices.
Mesh-image alignment.
In the field of 2D image processing, most deep learning methods employ
a "fully convolution" network framework that maintains spatial alignment between images and
outputs (Kirillov et al., 2020; Long et al., 2015; Tompson et al., 2014). Several research methods also
consider alignment relationships in the 3D domain. For example, PIFu (Saito et al., 2019) proposed
an implicit representation that locally aligns the pixels of a 2D image with the global context of their
corresponding 3D objects. PyMAF (Zhang et al., 2021) introduced a mesh alignment feedback loop,
where evidence of mesh alignment is used to correct parameters for better-aligned reconstruction
results. The alignment can take advantage of more informative features that are sensitive to position
to predict mesh.
Existing mesh recovery works (Tang et al., 2021; Li et al., 2022) face the shortcomings of complex
network structure when mesh-images alignment. Furthermore, the initial input of the 3D decoder
is a high-resolution mesh, which makes network optimization difficult. This is critical for practical
applications. To address these issues, we propose a compact network framework to map 2d image
pixel features to 3d mesh vertex locations. We apply a multi-scale structure to the 2D feature
encoder and 3D mesh decoder respectively, to achieve coarse-to-fine pixel alignment at corresponding
resolutions. Using multi-scale pixel-aligned features can achieve better mesh-image alignment than
previous methods.
3 METHODOLOGY
Given a monocular RGB image
I
, our goal is to predict the 3D positions of all the
N
vertices
V={vi}N
i=1
of the predefined hand mesh
M
. The overall architecture of our network, as shown
in Figure 1, has two major components: a 2D feature extractor, as well as a 3D mesh decoder that
consists of feature mapping modules and mesh-conv layers. The 2D feature extractor is an hourglass
that encodes the image content into features at
S
levels of scale. Respectively the 3D mesh decoder
also recovers the vertices in a coarse-to-fine manner in
S
different scales. By design the mesh decoder
at level
sS
leverages the 2D feature map at level
s
. In the following sections, we will describe the
architecture of the 2D feature extractor in 3.1, the pixel-aligned feature mapping module in 3.2, the
mesh decoder in 3.3, as well as training details in 3.4.
3
摘要:

PIXEL-ALIGNEDNON-PARAMETRICHANDMESHRE-CONSTRUCTIONShijianJiang1,GuwenHan1,DanhangTang2,YangZhou3,XiangLi3,JimingChen1,QiYe1ABSTRACTNon-parametricmeshreconstructionhasrecentlyshownsignicantprogressin3Dhandandbodyapplications.Inthesemethods,meshverticesandedgesarevisibletoneuralnetworks,enablingthep...

展开>> 收起<<
PIXEL -ALIGNED NON-PARAMETRIC HAND MESH RE- CONSTRUCTION Shijian Jiang1 Guwen Han1 Danhang Tang2 Yang Zhou3 Xiang Li3 Jiming Chen1 Qi Ye1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1002.82KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注