THOR-Net End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision Ahmed Tawfik Aboukhadra12Jameel Malik13Ahmed Elhayek4

2025-05-06 0 0 4.77MB 10 页 10玖币

侵权投诉

THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object

Reconstruction with Self-supervision

Ahmed Tawﬁk Aboukhadra1,2Jameel Malik1,3Ahmed Elhayek4

Nadia Robertini1Didier Stricker1,2

1DFKI-AV Kaiserslautern 2TU Kaiserslautern 3NUST-SEECS Pakistan 4UPM Saudi Arabia

Abstract

Realistic reconstruction of two hands interacting with

objects is a new and challenging problem that is essential

for building personalized Virtual and Augmented Reality

environments. Graph Convolutional networks (GCNs) al-

low for the preservation of the topologies of hands poses

and shapes by modeling them as a graph. In this work,

we propose the THOR-Net which combines the power of

GCNs, Transformer, and self-supervision to realistically re-

construct two hands and an object from a single RGB im-

age. Our network comprises two stages; namely the fea-

tures extraction stage and the reconstruction stage. In the

features extraction stage, a Keypoint RCNN is used to ex-

tract 2D poses, features maps, heatmaps, and bounding

boxes from a monocular RGB image. Thereafter, this 2D

information is modeled as two graphs and passed to the

two branches of the reconstruction stage. The shape re-

construction branch estimates meshes of two hands and

an object using our novel coarse-to-ﬁne GraFormer shape

network. The 3D poses of the hands and objects are re-

constructed by the other branch using a GraFormer net-

work. Finally, a self-supervised photometric loss is used

to directly regress the realistic textured of each vertex in

the hands’ meshes. Our approach achieves State-of-the-

art results in Hand shape estimation on the HO-3D dataset

(10.0mm) exceeding ArtiBoost (10.8mm). It also surpasses

other methods in hand pose estimation on the challeng-

ing two hands and object (H2O) dataset by 5mm on the

left-hand pose and 1mm on the right-hand pose. THOR-

Net code will be available at https://github.com/

ATAboukhadra/THOR-Net.

1. Introduction

Realistic hands-object shape reconstruction is crucial for

many Augmented Reality (AR) and Virtual Reality (VR)

applications in order to create a more immersive, person-

alized experience for the users. Moreover, the hand pose

Figure 1. Our Graformer-based algorithm jointly reconstructs up

to two hands Poses and textured shapes together with a shape of

one object from a monocular RGB image. Note that the hands’

textures of the above shapes were directly regressed for each ver-

tex based on self-supervision training.

is useful for human-computer interaction, action recogni-

tion, human behavior analysis, and gesture recognition ap-

plications [9, 28, 2, 6, 1, 26]. The recent advancements in

hand, body and object pose estimation [17, 30, 14, 15, 28]

are promising. However, few attention is given to the

joint reconstruction of two hands interacting with an ob-

ject [15, 17, 2, 9, 36]. This is a challenging problem

due to varying hand shapes, texture, many degrees of free-

dom (DOF), self-similarity of hands parts, two-hands self-

occlusions, and hand-object mutual occlusion, especially

from a monocular RGB image as it only contains 2D in-

formation.

By utilizing the recent advances in deep learning (e.g.,

GCNs, Transformers, and self-supervised learning), several

algorithms for simultaneous hand pose and shape estima-

tion have been introduced. Recently, many researchers used

Graph Convolutional Networks (GCNs) [39] to address the

challenges of pose estimation [9, 41, 43, 4] and shape recon-

struction [2, 28, 38]. GCNs preserve the inherent kinematic

and graphical structure of hand pose and shape. This feature

allows GCNs to handle depth ambiguity and occlusions as it

correlates the visible parts of the hand with the non-visible

parts [9]. Transformer networks [37] have also shown great

arXiv:2210.13853v1 [cs.CV] 25 Oct 2022

abilities in many domains such as NLP [8]. Transformers

have shown to be highly effective in many Computer Vi-

sion domains [10]. Many researchers have studied the effec-

tiveness of Transformers in hand pose and shape estimation

[20, 30, 43, 14, 40, 23].

In this paper, we propose the ﬁrst —to the best of our

knowledge— approach with GCNs, Transformers, and self-

supervision which simultaneously estimates the 3D shape

and the 3D pose of two hands interacting with an object

together with the texture of each vertex of the hands given

a monocular RGB image as shown in Figure 1.

THOR-Net is based on Keypoint RCNN which extracts

several 2D features (i.e., heatmaps, bounding boxes, fea-

tures maps, and 2D pose) from the monocular RGB image.

To beneﬁt from the power of the GCNs we model all this 2D

information as two graphs. One graph is passed through our

novel coarse-to-ﬁne GraFormer shape generator network to

estimate meshes for the hands and the object. This network

gradually increases the number of nodes in the graph start-

ing from the pose until reaching the shape while gradually

decreasing the size of the features to only 3 values (x,y,z)

that correspond to each vertex location in 3D space. The

other graph is passed through a 2D-to-3D pose estimation

network which is based on GraFormer to estimate 3D poses

for the hands and object.

The hands’ textures of the meshes are directly regressed

by using a self-supervision photometric loss. To this end,

the texture of each vertex is learned by orthographic pro-

jection to the input image. In contrast to HTML [31] which

learns the statistical hand texture model from a limited set of

hand texture samples, our photometric loss approach allows

for learning hand textures from a huge set of RGB images

of any hands dataset.

To summarize, we make the following contributions:

• A novel pipeline to reconstruct a realistic 3D shape for

two hands and objects from RGB images with the fol-

lowing novelties:

–Utilizing heatmaps and features produced by the

Keypoint RCNN to build graphs that help our

GraFormer-based networks to estimate 3D pose

and shape.

–Proposing a coarse-to-ﬁne GraFormer for two

hands and object reconstruction.

• Applying self-supervision based on photometric loss

to give a more realistic view of hands.

• Our method achieves state-of-the-art results for hand

mesh estimation on HO-3D (v3) and hand pose esti-

mation on the H2O dataset as shown in Section 4.

2. Related Work

Although most of the existing works focus on the recon-

struction of a single interacting hand, our work addresses a

more challenging problem of two hands and object recon-

struction. Here, we brieﬂy describe the most related works.

2.1. GCNs for Pose Estimation

Recently, 3D pose estimation from 2D pose using Graph

Convolutional Networks (GCNs) showed very promising

results [9, 43]. Using a single Keypoint from the 2D pose

to estimate its counterpart in 3D is a nondeterministic prob-

lem. However, using the information about other 2D key-

points and their relation to the target keypoint can be useful

to estimate its 3D location. The authors of the HopeNet [9]

introduced an adaptive GraphUNet that pools the 2D pose

in ﬁve stages, and then unpools it to get the 3D pose while

having skip connections between the corresponding pooling

and unpooling layers.

The GraFormer [43] transforms 2D poses to 3D, how-

ever, it shows a much better performance than the HopeNet

because of combining Graph Convolutional layers with the

Transformer [37] and attention mechanism. The GraFormer

is able to extract local features from the nodes using graph

convolutional layers and also extract global information

about the entire graph using the attention layers.

The spatiotemporal graph solves the depth ambiguity

and severe occlusion challenges in 3D pose estimation

[4, 41]. Temporal continuity in videos imposes temporal

constraints [15]. Therefore, Cai et al. [4] created a Spatio-

temporal graph from a few temporally adjacent 2D body

poses by creating additional edges between the joints and

their counterparts in neighboring frames.

2.2. Hand-Object Reconstruction

Most of the existing works focus on hand shape esti-

mation under interaction with an object without consider-

ing object shape reconstruction. The Keypoint Transformer

[14] achieves state-of-the-art results in hand pose estima-

tion from RGB images by extracting features from the im-

age for each keypoint and correlating those features using

self-attention layers. HandOccNet [30] is a very recent and

robust transformer-based model that solves the ambiguity of

occlusions between hands and objects by injecting features

from visible areas of the hand to areas where the hand is oc-

cluded by the object. ArtiBoost [42] aims to solve the lack

of diversity of hand-object poses within the 3D space in any

hand-object dataset by creating synthetic images. They use

both synthetic and real images to train a CNN regression

model that estimates the pose. Liu et al. [24] leveraged spa-

tiotemporal consistency in RGB videos to generate labels

for semi-supervised training to estimate 3D pose.

Two-hands and object reconstruction does not get

enough attention compared to hand-object pose estimation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

THOR-Net:End-to-endGraformer-basedRealisticTwoHandsandObjectReconstructionwithSelf-supervisionAhmedTawfikAboukhadra1,2JameelMalik1,3AhmedElhayek4NadiaRobertini1DidierStricker1,21DFKI-AVKaiserslautern2TUKaiserslautern3NUST-SEECSPakistan4UPMSaudiArabiaAbstractRealisticreconstructionoftwohandsinteracti...

展开>> 收起<<

THOR-Net End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision Ahmed Tawfik Aboukhadra12Jameel Malik13Ahmed Elhayek4.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

THOR-Net End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision Ahmed Tawfik Aboukhadra12Jameel Malik13Ahmed Elhayek4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: