DCL-Net Deep Correspondence Learning Network for 6D Pose Estimation Hongyang Li1 Jiehong Lin12 and Kui Jia13

2025-05-06 0 0 2.64MB 18 页 10玖币

侵权投诉

DCL-Net: Deep Correspondence Learning

Network for 6D Pose Estimation

Hongyang Li1∗, Jiehong Lin1,2∗, and Kui Jia1,3†

1South China University of Technology

2DexForce Co. Ltd.

3Peng Cheng Laboratory

{eeli.hongyang,lin.jiehong}@mail.scut.edu.cn, kuijia@scut.edu.cn

Abstract. Establishment of point correspondence between camera and

object coordinate systems is a promising way to solve 6D object poses.

However, surrogate objectives of correspondence learning in 3D space are

a step away from the true ones of object pose estimation, making the

learning suboptimal for the end task. In this paper, we address this short-

coming by introducing a new method of Deep Correspondence Learning

Network for direct 6D object pose estimation, shortened as DCL-Net.

Speciﬁcally, DCL-Net employs dual newly proposed Feature Disengage-

ment and Alignment (FDA) modules to establish, in the feature space,

partial-to-partial correspondence and complete-to-complete one for par-

tial object observation and its complete CAD model, respectively, which

result in aggregated pose and match feature pairs from two coordinate

systems; these two FDA modules thus bring complementary advantages.

The match feature pairs are used to learn conﬁdence scores for measur-

ing the qualities of deep correspondence, while the pose feature pairs

are weighted by conﬁdence scores for direct object pose regression. A

conﬁdence-based pose reﬁnement network is also proposed to further

improve pose precision in an iterative manner. Extensive experiments

show that DCL-Net outperforms existing methods on three benchmark-

ing datasets, including YCB-Video, LineMOD, and Oclussion-LineMOD;

ablation studies also conﬁrm the eﬃcacy of our novel designs. Our code is

released publicly at https://github.com/Gorilla-Lab-SCUT/DCL-Net.

Keywords: 6D Pose Estimation, Correspondence Learning

1 Introduction

6D object pose estimation is a fundamental task of 3D semantic analysis with

many real-world applications, such as robotic grasping [7,44], augmented reality

[27], and autonomous driving [8,9,21,42]. Non-linearity of the rotation space of

SO(3) makes it hard to handle this nontrivial task through direct pose regression

from object observations [6,11,15,18,24–26,39,45,47]. Many of the data-driven

methods [3,14,20,23,28,31,33,34,38,41] thus achieve the estimation by learning

point correspondence between camera and object coordinate systems.

∗Equal contribution

†Corresponding author

arXiv:2210.05232v1 [cs.CV] 11 Oct 2022

2 H. Li et al.

(a) Partial-to-Partial Correspondence

Partial

observation

(cam)

Partial

prediction

(obj)

(b) Complete-to-Complete Correspondence

Complete

CAD model

(obj)

Complete

prediction

(cam)

Fig. 1. Illustrations of two kinds of point correspondence between camera coordinate

system (cam) and object coordinate system (obj). Best view in the electronic version.

Given a partial object observation in camera coordinate system along with

its CAD model in object coordinate one, we show in Fig. 1two possible ways

to build point correspondence: i) inferring the observed points in object co-

ordinate system for partial-to-partial correspondence; ii) inferring the sampled

points of CAD model in camera coordinate system for complete-to-complete

correspondence. These two kinds of correspondence show diﬀerent advantages.

The partial-to-partial correspondence is of higher qualities than the complete-to-

complete one due to the diﬃculty in shape completion, while the latter is more

robust to ﬁgure out poses for objects with severe occlusions, which the former

can hardly handle with.

While these methods are promising by solving 6D poses from point corre-

spondence (e.g., via a PnP algorithm), their surrogate correspondence objec-

tives are a step away from the true ones of estimating 6D object poses, thus

making their learnings suboptimal for the end task [40]. To this end, we present

a novel method to realize the above two ways of correspondence establishment

in the feature space via dual newly proposed Feature Disengagement and Align-

ment (FDA) modules, and directly estimate object poses from feature pairs of

two coordinate systems, which are weighted by conﬁdence scores measuring the

qualities of deep correspondence. We term our method as Deep Correspondence

Learning Network, shortened as DCL-Net. Fig. 2gives the illustration.

For the partial object observation and its CAD model, DCL-Net ﬁrstly ex-

tracts their point-wise feature maps in parallel; then dual Feature Disengage-

ment and Alignment (FDA) modules are designed to establish, in feature space,

the partial-to-partial correspondence and the complete-to-complete one between

camera and object coordinate systems. Speciﬁcally, each FDA module takes as

inputs two point-wise feature maps, and disengages each feature map into indi-

vidual pose and match ones; the match feature maps of two systems are then

used to learn an attention map for building deep correspondence; ﬁnally, both

pose and match feature maps are aligned and paired across systems based on the

attention map, resulting in pose and match feature pairs, respectively. DCL-Net

aggregates two sets of correspondence together, since they bring complemen-

tary advantages, by fusing the respective pose and match feature pairs of two

FDA modules. The aggregated match feature pairs are used to learn conﬁdence

scores for measuring the qualities of deep correspondence, while the pose ones

are weighted by the scores to directly regress object poses. A conﬁdence-based

pose reﬁnement network is also proposed to further improve the results of DCL-

DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation 3

Net in an iterative manner. Extensive experiments show that DCL-Net outper-

forms existing methods for 6D object pose estimation on three well-acknowledged

datasets, including YCB-Video [4], LineMOD [16], and Occlusion-LineMOD [3];

remarkably, on the more challenging Occlusion-LineMOD, our DCL-Net outper-

forms the state-of-the-art method [13] with an improvement of 4.4% on the met-

ric of ADD(S), revealing the strength of DCL-Net on handling with occlusion.

Ablation studies also conﬁrm the eﬃcacy of individual components of DCL-Net.

Our technical contributions are summarized as follows:

–We design a novel Feature Disengagement and Alignment (FDA) module to

establish deep correspondence between two point-wise feature maps from

diﬀerent coordinate systems; more speciﬁcally, FDA module disengages each

feature map into individual pose and match ones, which are then aligned

across systems to generate pose and match feature pairs, respectively, such

that deep correspondence is established within the aligned feature pairs.

–We propose a new method of Deep Correspondence Learning Network for

direct regression of 6D object poses, termed as DCL-Net, which employs

dual FDA modules to establish, in feature space, partial-to-partial corre-

spondence and complete-to-complete one between camera and object coor-

dinate systems, respectively; these two FDA modules bring complementary

advantages.

–Match feature pairs of dual FDA modules are aggregated and used for learn-

ing of conﬁdence scores to measure the qualities of correspondence, while

pose feature pairs are weighted by the scores for estimation of 6D pose;

a conﬁdence-based pose reﬁnement network is also proposed to iteratively

improve pose precision.

2 Related Work

6D Pose Estimation from RGB Data This body of works can be broadly

categorized into three types: i) holistic methods [11,15,18] for directly estimating

object poses; ii) keypoint-based methods [28,33,34], which establish 2D-3D cor-

respondence via 2D keypoint detection, followed by a PnP/RANSAC algorithm

to solve the poses; iii) dense correspondence methods [3,20,23,31], which make

dense pixel-wise predictions and vote for the ﬁnal results.

Due to loss of geometry information, these methods are sensitive to lighting

conditions and appearance textures, and thus inferior to the RGB-D methods.

6D Pose Estimation from RGB-D Data Depth maps provide rich geometry

information complementary to appearance one from RGB images. Traditional

methods [3,16,32,37,43] solve object poses by extracting features from RGB-

D data and performing correspondence grouping and hypothesis veriﬁcation.

Earlier deep methods, such as PoseCNN [45] and SSD-6D [19], learn coarse poses

ﬁrstly from RGB images, and reﬁne the poses on point clouds by using ICP [2] or

MCN [22]. Recently, learning deep features of point clouds becomes an eﬃcient

4 H. Li et al.

way to improve pose precision, especially for methods [39,47] of direct regression,

which make eﬀorts to enhance pose embeddings from deep geometry features,

due to the diﬃculty in the learning of rotations from a nonlinear space. Wang et

al. present DenseFusion [39], which fuses local features of RGB images and point

clouds in a point-wise manner, and thus explicitly reasons about appearance

and geometry information to make the learning more discriminative; due to

the incomplete and noisy shape information, Zhou et al. propose PR-GCN [47]

to polish point clouds and enhance pose embeddings via Graph Convolutional

Network. On the other hand, dense correspondence methods show the advantages

of deep networks on building the point correspondence in Euclidean space; for

example, He et al. propose PVN3D [14] to regress dense keypoints, and achieve

remarkable results. While promising, these methods are usually trained with

surrogate objectives instead of the true ones of estimating 6D poses, making the

learning suboptimal for the end task.

Our proposed DCL-Net borrows the idea from dense correspondence meth-

ods by learning deep correspondence in feature space, and weights the feature

correspondence based on conﬁdence scores for direct estimation of object poses.

Besides, the learned correspondence is also utilized by an iterative pose reﬁne-

ment network for precision improvement.

3 Deep Correspondence Learning Network

Given the partial object observation Xcin the camera coordinate system, along

with the object CAD model Yoin the object coordinate one, our goal is to

estimate the 6D pose (R,t) between these two systems, where R∈SO(3) stands

for a rotation, and t∈R3for a translation.

Fig. 2gives the illustration of our proposed Deep Correspondence Learning

Network (dubbed DCL-Net). DCL-Net ﬁrstly extracts point-wise features of

Xcand Yo(cf. Sec. 3.1), then establishes correspondence in feature space via

dual Feature Disengagement and Alignment modules (cf. Sec. 3.2), and ﬁnally

regresses the object pose (R,t) with conﬁdence scores based on the learned

deep correspondence (cf. Sec. 3.3). The training objectives of DCL-Net are given

in Sec. 3.4. A conﬁdence-based pose reﬁnement network is also introduced to

iteratively improve pose precision (cf. Sec. 3.5).

3.1 Point-wise Feature Extraction

We represent the inputs of the object observation Xcand its CAD model Yoas

(IXc,PXc) and (IYo,PYo) with NXand NYsampled points, respectively, where

Pdenotes a point set, and Idenotes RGB values corresponding to points in P.

As shown in Fig. 2, we use two parallel backbones to extract their point-wise

features FXcand FYo, respectively. Following [12], both backbones are built

based on 3D Sparse Convolutions [10], of which the volumetric features are then

converted to point-level ones; more details about the architectures are given in

the supplementary material. Note that for each object instance, FYocan be

pre-computed during inference for eﬃciency.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DCL-Net:DeepCorrespondenceLearningNetworkfor6DPoseEstimationHongyangLi1∗,JiehongLin1,2∗,andKuiJia1,3†1SouthChinaUniversityofTechnology2DexForceCo.Ltd.3PengChengLaboratory{eeli.hongyang,lin.jiehong}@mail.scut.edu.cn,kuijia@scut.edu.cnAbstract.Establishmentofpointcorrespondencebetweencameraandobjectco...

展开>> 收起<<

DCL-Net Deep Correspondence Learning Network for 6D Pose Estimation Hongyang Li1 Jiehong Lin12 and Kui Jia13.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DCL-Net Deep Correspondence Learning Network for 6D Pose Estimation Hongyang Li1 Jiehong Lin12 and Kui Jia13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: