
operations on vertices and edges, such as graph convolutions or attention modules, are designed to
aggregate the geometric features of the meshes. With the operation, vertices and edges are visible to
the networks; thus direct connections between pixels of the 2D image feature space and vertices of
the 3D mesh can be established and operations in the decoder can aggregate both image features and
geometric features, which can not be realized in the parametric methods. This connection and the
aggregation, however, have not been fully explored by previous work.
In this paper, we seek to establish the connections and merge the local hand features from appearance
in the input and geometry in the output. To this end, we utilize the pixel-aligned mapping module to
establish the connections and propose a simple and compact architecture to deploy the connections.
We design our network by making the following philosophical choices: 1) For the 2D feature extractor,
we keep feature maps of different scales in the encoder instead of using the final global feature to
enable 2D local information mapping to 3D. 2) We decode the mesh in the coarse-to-fine manner to
make the best use of the multi-scale information. 3) Both image features and geometric features are
aggregated in the operations of the mesh decoder rather than only geometric features.
Our design is shown in Figure 1. Multi-scale image features are naturally passed to the 3D mesh
decoder. Our experiments show the design enables better alignment between the image and the
reconstructed mesh. The aggregation of features not only improves the graph convolution network
substantially but also gains large superiority over the attention mechanism with global features.
To summarize, our key contributions are 1) Operations are capable of aggregating both local 2D image
features and 3D geometric features on meshes in different scales. 2) Connections between pixels of
2D image appearance in the encoder and vertices of 3D meshed in the decoder are established by a
pixel-vertex mapping module. 3) A novel graph convolution architecture achieves state-of-the-art
results on the FreiHAND benchmark.
2 RELATED WORK
Mesh Reconstruction.
Previous research methods employ pre-trained parametric human hand and
human models, namely MANO (Romero et al., 2022), SMPL (Loper et al., 2015). And estimate
the pose and shape coefficients of the parametric model. However, it is challenging to regress pose
and shape coefficients directly from input images. Researchers propose to train network models
with human priors, such as using skeletons (Lassner et al., 2017) or segmentation maps. Some
researchers have proposed regressed SMPL parameters by relying on human key points and contour
maps (Pavlakos et al., 2018; Tan et al., 2017) of the body. Coincidentally (Omran et al., 2018)
utilized the segmentation map of the human body as a supervision condition. A weakly supervised
approach (Kanazawa et al., 2018) using 2D keypoint reprojection and adversarial learning regression
SMPL parameters. Hsiao-Yu Tung (Tung et al., 2017) proposed a self-supervised approach to
regression of human parametric models.
Recently, model-free methods (Choi et al., 2020; Moon & Lee, 2020; Kolotouros et al., 2019) for
directly regressing human pose and shape from input images have received increasing attention.
Because it can express the nonlinear relationship between the image and the predicted 3D space.
Researchers have explored various ways to represent the human body and hand using 3D mesh (Lin
et al., 2021b; Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al., 2018; Ranjan
et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020), voxel spaces (Varol et al.,
2018), or occupancy fields (Saito et al., 2019; Niemeyer et al., 2020; Xu et al., 2019; Saito et al.,
2020; Peng et al., 2020). Among them, the voxel space method adopts a completely non-parametric
method, which requires a lot of computing resources, and the output voxel needs to fit the body
model to obtain the final human 3D mesh.Among the recent research methods, Graph Convolution
Neural Networks (GCNs) (Kolotouros et al., 2019; Choi et al., 2020; Lin et al., 2021a; Litany et al.,
2018; Ranjan et al., 2018; Verma et al., 2018; Wang et al., 2018; Moon & Lee, 2020) is one of
the most popular methods. Because GCN is particularly convenient for convolution operations on
mesh data. However, GCN is good for representing the local features of the mesh, and the global
features of the long-distance interaction between human vertices and joints cannot be well represented.
Transformer-based methods (Lin et al., 2021b) use a self-attention mechanism to take full advantage
of the information interaction between vertex and joints and use the global information of the human
body to reconstruct more accurate vertex positions. But whether it is a GCN-based method or
an attention mechanism-based method. Neither considers pixel-level semantic feature alignment
2