THOR-Net End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision Ahmed Tawfik Aboukhadra12Jameel Malik13Ahmed Elhayek4

2025-05-06 0 0 4.77MB 10 页 10玖币
侵权投诉
THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object
Reconstruction with Self-supervision
Ahmed Tawfik Aboukhadra1,2Jameel Malik1,3Ahmed Elhayek4
Nadia Robertini1Didier Stricker1,2
1DFKI-AV Kaiserslautern 2TU Kaiserslautern 3NUST-SEECS Pakistan 4UPM Saudi Arabia
Abstract
Realistic reconstruction of two hands interacting with
objects is a new and challenging problem that is essential
for building personalized Virtual and Augmented Reality
environments. Graph Convolutional networks (GCNs) al-
low for the preservation of the topologies of hands poses
and shapes by modeling them as a graph. In this work,
we propose the THOR-Net which combines the power of
GCNs, Transformer, and self-supervision to realistically re-
construct two hands and an object from a single RGB im-
age. Our network comprises two stages; namely the fea-
tures extraction stage and the reconstruction stage. In the
features extraction stage, a Keypoint RCNN is used to ex-
tract 2D poses, features maps, heatmaps, and bounding
boxes from a monocular RGB image. Thereafter, this 2D
information is modeled as two graphs and passed to the
two branches of the reconstruction stage. The shape re-
construction branch estimates meshes of two hands and
an object using our novel coarse-to-fine GraFormer shape
network. The 3D poses of the hands and objects are re-
constructed by the other branch using a GraFormer net-
work. Finally, a self-supervised photometric loss is used
to directly regress the realistic textured of each vertex in
the hands’ meshes. Our approach achieves State-of-the-
art results in Hand shape estimation on the HO-3D dataset
(10.0mm) exceeding ArtiBoost (10.8mm). It also surpasses
other methods in hand pose estimation on the challeng-
ing two hands and object (H2O) dataset by 5mm on the
left-hand pose and 1mm on the right-hand pose. THOR-
Net code will be available at https://github.com/
ATAboukhadra/THOR-Net.
1. Introduction
Realistic hands-object shape reconstruction is crucial for
many Augmented Reality (AR) and Virtual Reality (VR)
applications in order to create a more immersive, person-
alized experience for the users. Moreover, the hand pose
Figure 1. Our Graformer-based algorithm jointly reconstructs up
to two hands Poses and textured shapes together with a shape of
one object from a monocular RGB image. Note that the hands’
textures of the above shapes were directly regressed for each ver-
tex based on self-supervision training.
is useful for human-computer interaction, action recogni-
tion, human behavior analysis, and gesture recognition ap-
plications [9, 28, 2, 6, 1, 26]. The recent advancements in
hand, body and object pose estimation [17, 30, 14, 15, 28]
are promising. However, few attention is given to the
joint reconstruction of two hands interacting with an ob-
ject [15, 17, 2, 9, 36]. This is a challenging problem
due to varying hand shapes, texture, many degrees of free-
dom (DOF), self-similarity of hands parts, two-hands self-
occlusions, and hand-object mutual occlusion, especially
from a monocular RGB image as it only contains 2D in-
formation.
By utilizing the recent advances in deep learning (e.g.,
GCNs, Transformers, and self-supervised learning), several
algorithms for simultaneous hand pose and shape estima-
tion have been introduced. Recently, many researchers used
Graph Convolutional Networks (GCNs) [39] to address the
challenges of pose estimation [9, 41, 43, 4] and shape recon-
struction [2, 28, 38]. GCNs preserve the inherent kinematic
and graphical structure of hand pose and shape. This feature
allows GCNs to handle depth ambiguity and occlusions as it
correlates the visible parts of the hand with the non-visible
parts [9]. Transformer networks [37] have also shown great
arXiv:2210.13853v1 [cs.CV] 25 Oct 2022
abilities in many domains such as NLP [8]. Transformers
have shown to be highly effective in many Computer Vi-
sion domains [10]. Many researchers have studied the effec-
tiveness of Transformers in hand pose and shape estimation
[20, 30, 43, 14, 40, 23].
In this paper, we propose the first —to the best of our
knowledge— approach with GCNs, Transformers, and self-
supervision which simultaneously estimates the 3D shape
and the 3D pose of two hands interacting with an object
together with the texture of each vertex of the hands given
a monocular RGB image as shown in Figure 1.
THOR-Net is based on Keypoint RCNN which extracts
several 2D features (i.e., heatmaps, bounding boxes, fea-
tures maps, and 2D pose) from the monocular RGB image.
To benefit from the power of the GCNs we model all this 2D
information as two graphs. One graph is passed through our
novel coarse-to-fine GraFormer shape generator network to
estimate meshes for the hands and the object. This network
gradually increases the number of nodes in the graph start-
ing from the pose until reaching the shape while gradually
decreasing the size of the features to only 3 values (x,y,z)
that correspond to each vertex location in 3D space. The
other graph is passed through a 2D-to-3D pose estimation
network which is based on GraFormer to estimate 3D poses
for the hands and object.
The hands’ textures of the meshes are directly regressed
by using a self-supervision photometric loss. To this end,
the texture of each vertex is learned by orthographic pro-
jection to the input image. In contrast to HTML [31] which
learns the statistical hand texture model from a limited set of
hand texture samples, our photometric loss approach allows
for learning hand textures from a huge set of RGB images
of any hands dataset.
To summarize, we make the following contributions:
A novel pipeline to reconstruct a realistic 3D shape for
two hands and objects from RGB images with the fol-
lowing novelties:
Utilizing heatmaps and features produced by the
Keypoint RCNN to build graphs that help our
GraFormer-based networks to estimate 3D pose
and shape.
Proposing a coarse-to-fine GraFormer for two
hands and object reconstruction.
Applying self-supervision based on photometric loss
to give a more realistic view of hands.
Our method achieves state-of-the-art results for hand
mesh estimation on HO-3D (v3) and hand pose esti-
mation on the H2O dataset as shown in Section 4.
2. Related Work
Although most of the existing works focus on the recon-
struction of a single interacting hand, our work addresses a
more challenging problem of two hands and object recon-
struction. Here, we briefly describe the most related works.
2.1. GCNs for Pose Estimation
Recently, 3D pose estimation from 2D pose using Graph
Convolutional Networks (GCNs) showed very promising
results [9, 43]. Using a single Keypoint from the 2D pose
to estimate its counterpart in 3D is a nondeterministic prob-
lem. However, using the information about other 2D key-
points and their relation to the target keypoint can be useful
to estimate its 3D location. The authors of the HopeNet [9]
introduced an adaptive GraphUNet that pools the 2D pose
in five stages, and then unpools it to get the 3D pose while
having skip connections between the corresponding pooling
and unpooling layers.
The GraFormer [43] transforms 2D poses to 3D, how-
ever, it shows a much better performance than the HopeNet
because of combining Graph Convolutional layers with the
Transformer [37] and attention mechanism. The GraFormer
is able to extract local features from the nodes using graph
convolutional layers and also extract global information
about the entire graph using the attention layers.
The spatiotemporal graph solves the depth ambiguity
and severe occlusion challenges in 3D pose estimation
[4, 41]. Temporal continuity in videos imposes temporal
constraints [15]. Therefore, Cai et al. [4] created a Spatio-
temporal graph from a few temporally adjacent 2D body
poses by creating additional edges between the joints and
their counterparts in neighboring frames.
2.2. Hand-Object Reconstruction
Most of the existing works focus on hand shape esti-
mation under interaction with an object without consider-
ing object shape reconstruction. The Keypoint Transformer
[14] achieves state-of-the-art results in hand pose estima-
tion from RGB images by extracting features from the im-
age for each keypoint and correlating those features using
self-attention layers. HandOccNet [30] is a very recent and
robust transformer-based model that solves the ambiguity of
occlusions between hands and objects by injecting features
from visible areas of the hand to areas where the hand is oc-
cluded by the object. ArtiBoost [42] aims to solve the lack
of diversity of hand-object poses within the 3D space in any
hand-object dataset by creating synthetic images. They use
both synthetic and real images to train a CNN regression
model that estimates the pose. Liu et al. [24] leveraged spa-
tiotemporal consistency in RGB videos to generate labels
for semi-supervised training to estimate 3D pose.
Two-hands and object reconstruction does not get
enough attention compared to hand-object pose estimation
摘要:

THOR-Net:End-to-endGraformer-basedRealisticTwoHandsandObjectReconstructionwithSelf-supervisionAhmedTawfikAboukhadra1,2JameelMalik1,3AhmedElhayek4NadiaRobertini1DidierStricker1,21DFKI-AVKaiserslautern2TUKaiserslautern3NUST-SEECSPakistan4UPMSaudiArabiaAbstractRealisticreconstructionoftwohandsinteracti...

展开>> 收起<<
THOR-Net End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision Ahmed Tawfik Aboukhadra12Jameel Malik13Ahmed Elhayek4.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:4.77MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注