abilities in many domains such as NLP [8]. Transformers
have shown to be highly effective in many Computer Vi-
sion domains [10]. Many researchers have studied the effec-
tiveness of Transformers in hand pose and shape estimation
[20, 30, 43, 14, 40, 23].
In this paper, we propose the first —to the best of our
knowledge— approach with GCNs, Transformers, and self-
supervision which simultaneously estimates the 3D shape
and the 3D pose of two hands interacting with an object
together with the texture of each vertex of the hands given
a monocular RGB image as shown in Figure 1.
THOR-Net is based on Keypoint RCNN which extracts
several 2D features (i.e., heatmaps, bounding boxes, fea-
tures maps, and 2D pose) from the monocular RGB image.
To benefit from the power of the GCNs we model all this 2D
information as two graphs. One graph is passed through our
novel coarse-to-fine GraFormer shape generator network to
estimate meshes for the hands and the object. This network
gradually increases the number of nodes in the graph start-
ing from the pose until reaching the shape while gradually
decreasing the size of the features to only 3 values (x,y,z)
that correspond to each vertex location in 3D space. The
other graph is passed through a 2D-to-3D pose estimation
network which is based on GraFormer to estimate 3D poses
for the hands and object.
The hands’ textures of the meshes are directly regressed
by using a self-supervision photometric loss. To this end,
the texture of each vertex is learned by orthographic pro-
jection to the input image. In contrast to HTML [31] which
learns the statistical hand texture model from a limited set of
hand texture samples, our photometric loss approach allows
for learning hand textures from a huge set of RGB images
of any hands dataset.
To summarize, we make the following contributions:
• A novel pipeline to reconstruct a realistic 3D shape for
two hands and objects from RGB images with the fol-
lowing novelties:
–Utilizing heatmaps and features produced by the
Keypoint RCNN to build graphs that help our
GraFormer-based networks to estimate 3D pose
and shape.
–Proposing a coarse-to-fine GraFormer for two
hands and object reconstruction.
• Applying self-supervision based on photometric loss
to give a more realistic view of hands.
• Our method achieves state-of-the-art results for hand
mesh estimation on HO-3D (v3) and hand pose esti-
mation on the H2O dataset as shown in Section 4.
2. Related Work
Although most of the existing works focus on the recon-
struction of a single interacting hand, our work addresses a
more challenging problem of two hands and object recon-
struction. Here, we briefly describe the most related works.
2.1. GCNs for Pose Estimation
Recently, 3D pose estimation from 2D pose using Graph
Convolutional Networks (GCNs) showed very promising
results [9, 43]. Using a single Keypoint from the 2D pose
to estimate its counterpart in 3D is a nondeterministic prob-
lem. However, using the information about other 2D key-
points and their relation to the target keypoint can be useful
to estimate its 3D location. The authors of the HopeNet [9]
introduced an adaptive GraphUNet that pools the 2D pose
in five stages, and then unpools it to get the 3D pose while
having skip connections between the corresponding pooling
and unpooling layers.
The GraFormer [43] transforms 2D poses to 3D, how-
ever, it shows a much better performance than the HopeNet
because of combining Graph Convolutional layers with the
Transformer [37] and attention mechanism. The GraFormer
is able to extract local features from the nodes using graph
convolutional layers and also extract global information
about the entire graph using the attention layers.
The spatiotemporal graph solves the depth ambiguity
and severe occlusion challenges in 3D pose estimation
[4, 41]. Temporal continuity in videos imposes temporal
constraints [15]. Therefore, Cai et al. [4] created a Spatio-
temporal graph from a few temporally adjacent 2D body
poses by creating additional edges between the joints and
their counterparts in neighboring frames.
2.2. Hand-Object Reconstruction
Most of the existing works focus on hand shape esti-
mation under interaction with an object without consider-
ing object shape reconstruction. The Keypoint Transformer
[14] achieves state-of-the-art results in hand pose estima-
tion from RGB images by extracting features from the im-
age for each keypoint and correlating those features using
self-attention layers. HandOccNet [30] is a very recent and
robust transformer-based model that solves the ambiguity of
occlusions between hands and objects by injecting features
from visible areas of the hand to areas where the hand is oc-
cluded by the object. ArtiBoost [42] aims to solve the lack
of diversity of hand-object poses within the 3D space in any
hand-object dataset by creating synthetic images. They use
both synthetic and real images to train a CNN regression
model that estimates the pose. Liu et al. [24] leveraged spa-
tiotemporal consistency in RGB videos to generate labels
for semi-supervised training to estimate 3D pose.
Two-hands and object reconstruction does not get
enough attention compared to hand-object pose estimation