Learning to Estimate 3-D States of Deformable Linear Objects from Single-Frame Occluded Point Clouds Kangchen Lv Mingrui Yu Yifan Pu Xin Jiang Gao Huang and Xiang Li

2025-04-29 0 0 5.98MB 7 页 10玖币

侵权投诉

Learning to Estimate 3-D States of Deformable Linear Objects from

Single-Frame Occluded Point Clouds

Kangchen Lv, Mingrui Yu, Yifan Pu, Xin Jiang, Gao Huang, and Xiang Li

Abstract— Accurately and robustly estimating the state of

deformable linear objects (DLOs), such as ropes and wires, is

crucial for DLO manipulation and other applications. However,

it remains a challenging open issue due to the high dimen-

sionality of the state space, frequent occlusions, and noises.

This paper focuses on learning to robustly estimate the states

of DLOs from single-frame point clouds in the presence of

occlusions using a data-driven method. We propose a novel

two-branch network architecture to exploit global and local

information of input point cloud respectively and design a fusion

module to effectively leverage the advantages of both methods.

Simulation and real-world experimental results demonstrate

that our method can generate globally smooth and locally

precise DLO state estimation results even with heavily occluded

point clouds, which can be directly applied to real-world robotic

manipulation of DLOs in 3-D space.

I. INTRODUCTION

Robotic manipulation of deformable linear objects

(DLOs), such as ropes and wires, has a wide variety of

applications in industrial, service, and health-care sectors

[1], [2]. An accurate and robust state estimator for DLOs

is obviously the prerequisite for subsequent manipulations.

Compared to rigid objects, the inﬁnite dimensional DLO

state space makes it very challenging to perceive deforma-

tions. Besides, occlusions and noises occur frequently in

unstructured environments, resulting in higher requirements

for robust DLO state estimation.

Commonly used representations to describe DLO states

include Fourier-based parameterization [3], implicit latent

descriptors learned by neural networks [4], [5], a chain

of uniformly distributed nodes [6]–[9], etc. Among these

methods, representing a DLO as a chain of 3-D nodes (see

Fig. 1) is general in various manipulation tasks and will be

adopted in this work.

A complete processing stream to estimate the DLO state

can be roughly divided into three procedures: segmentation

(i.e., segmenting the DLO from environment), detection (i.e.,

estimating the DLO state in a single frame), and tracking

(i.e., tracking the deformation across several frames). As

sensors are RGB or RGB-D cameras in most cases, seg-

menting the DLO region in image space is the essential

ﬁrst step for consecutive processing. [10]–[13] focus on

how to obtain pixel-level DLO masks of high quality using

K. Lv, M. Yu, Y. Pu, G. Huang, and X. Li are with the Department of Au-

tomation, Tsinghua University, China. X. Jiang is with the Beijing Academy

of Artiﬁcial Intelligence, Beijing, China. This work was supported in part by

the National Key R&D Program of China under Grant 2020AAA0105200,

in part by the Institute for Guo Qiang, Tsinghua University, and in part by

the National Natural Science Foundation of China under Grant U21A20517

and 52075290. Corresponding author: Xiang Li (xiangli@tsinghua.edu.cn)

Input unordered point cloud of DLO

Output estimated ordered nodes

Fig. 1. Illustration of our task: 3-D occlusion-robust DLO state estimation

from a single-frame point cloud. Red points are the unordered incomplete

point cloud of the occluded rope and blue connected dots represent our

estimated ordered node sequence as its current state.

traditional image processing or data-driven methods. As for

detection, this step aims at estimating the positions of nodes

along the DLO in one frame with the cleaned sensory data

as input. For example, [14], [15] use neural networks to

encode the DLO into several sequential key-points in the 2-

D image space; [16] estimates a skeleton line and 3-D joint

positions on it from point cloud to represent the DLO, but

not robust against occlusions and different DLO types. As

for tracking, various works have also been proposed to track

the correspondence of point cloud across video frames in

the presence of occlusions and self-intersections [17]–[23].

These works model DLO tracking task as a GMM-based

non-rigid point registration problem with some geometric

constraints. However, these pure tracking-based methods rely

on an accurate initial state which requires manual setting or

speciﬁc initial conditions. Besides, there are few effective

ways to rectify the accumulated drift errors or re-initialize

for tracking failure. Therefore, it is necessary to develop an

accurate and robust 3-D state estimation method for DLOs

from a single frame, which can be independently applied

to estimate the DLO state in each frame or combined with

tracking methods above to utilize temporal information.

In this paper, we focus on estimating a sequence of ordered

and uniformly distributed nodes from single-frame point

cloud occlusion-robustly to represent the state of DLO, as

shown in Fig. 1. Note that we only use point cloud as our

input without any auxiliary physical simulation and robot

conﬁgurations. The challenges of this task are as follows:

1) there are few distinguishable features in the point cloud

of DLOs; 2) occlusions and noises are common in the

environment; 3) generalization ability for different DLOs

is required. To deal with challenges above, we propose a

novel two-branch network architecture to leverage both the

global geometry information for guaranteeing smooth and

occlusion-robust shape, and local geometry information for

arXiv:2210.01433v2 [cs.RO] 2 May 2023

PointNet++ encoder Point-wise feature

𝑭(𝑿) ∈ 𝑹𝑁×𝐶𝑜𝑢𝑡

Fusion

Input point cloud

𝑿 ∈ 𝑅𝑁×3 Estimated node

sequence 𝒀 ∈ 𝑹𝑀×𝟑

Point-wise

MLP

Point-to-Point Voting

Point-wise

Heatmap

Point-wise

Unit Offset

Voting

MaxPool

global feature

MLP

End-to-End Regression

Fig. 2. Overview of the proposed method for occlusion-robustly estimating the 3-D states of DLOs. The input point cloud which might be fragmented

due to occlusions is ﬁrst fed into a PointNet++ encoder and the extracted features are then processed by two parallel branches: End-to-End Regression and

Point-to-Point Voting. The estimation results of these two branches are ﬁnally fused with a fusion module to obtain the ﬁnal output node sequence.

precise estimations. To the best of our knowledge, we are

the ﬁrst to realize accurate and robust 3-D state estimation

of DLOs from single-frame point cloud input even with

heavy occlusions. Speciﬁcally, we ﬁrst exploit a PointNet++

encoder [24] to extract deep features of the input point cloud

and then feed the features into two branches: End-to-End Re-

gression and Point-to-Point Voting. We encourage these two

branches to focus on global and local geometry information

respectively and ﬁnally fuse their estimations to combine

their advantages. The whole framework is trained on syn-

thetic dataset generated in simulation without collecting real-

world data. Experiments suggest our method achieves high

performance on occlusion-robust state estimation of DLOs

and can be directly applied in real-world scenarios.

II. PROBLEM STATEMENT

The goal of our method is to estimate the 3-D states of

DLOs from point cloud obtained by an RGB-D camera. In

this work, we focus on the state estimation problem and

assume that the point cloud of the DLO has already been

segmented out of the raw full point cloud by RGB image

segmentation. We represent the DLO state as a sequence of

Mnodes uniformly distributed, where Mis a pre-deﬁned

number of nodes that can sufﬁciently describe the DLO state.

The problem is to estimate the coordinates of the nodes Y=

[y1,y2,· · · ,yM]T∈RM×3from the input point cloud X=

[x1,x2,· · · ,xN]T∈RN×3where Nis the number of the

points in the segmented point cloud. Note that the input point

cloud Xis unordered, while the order of estimated nodes in

Yfrom one end to another end has been represented by the

index 1,2,· · · , M. In addition, the point cloud of the DLO

may be fragmentary and noisy because of the occlusions,

imperfect segmentation, and depth images of low quality.

III. METHOD

As shown in Fig. 2, our proposed method contains two

branches: an End-to-End Regression branch and a Point-to-

Point Voting branch, which focuses on the global and the

local geometry information, respectively. Then, a deformable

registration module is designed to leverage the advantages of

both branches and fuse the two predictions to output the ﬁnal

estimated node sequence.

A. End-to-End Regression

The most straightforward approach is to train an end-to-

end network with the point cloud X∈RN×3as input and

the node sequence Y∈RM×3as output, which is indicated

as End-to-End Regression. We exploit a PointNet++ [24]

encoder denoted as F(·)to extract deep latent features

F(X)∈RN×Cout of input point cloud X, which means

that each point in input point cloud has a Cout-dimensional

feature vector. A max pooling layer is then applied to get the

global feature MaxPool(F(X)) ∈RCout which is irrelevant

to the input point order. Finally, a fully-connected layer F C1

predicts the node sequence Ypred

reg . The whole regression

network is deﬁned as

Ypred

reg =F C1(MaxPool(F(X))).(1)

With the ground-truth node coordinates Ygt, the training loss

function for each sample is

Lreg =kYpred

reg −Ygtk2.(2)

It is experimentally found that such an end-to-end network

can ensure that the estimated DLO shapes are smooth and

look like real DLOs even using heavily occluded point cloud

input, which suggests that this network can learn the key

global characteristic of DLOs well. However, the predictions

are often slightly different from the actual states such that

they are not sufﬁciently accurate for applications (see Fig.

5). This phenomenon is believed to be brought about by the

feature max pooling operation, which neglects crucial local

information for precise estimation.

B. Point-to-Point Voting

To make up for the shortcomings of the end-to-end regres-

sion method, we design a point-to-point voting framework

to utilize local geometry information, which is inspired

by early works [25], [26]. Instead of using max-pooling

layers for direct regression, this method generates point-

wise predictions Ypred,1

vot ,Ypred,2

vot ,· · · ,Ypred,N

vot from each

input point x1,x2,· · · ,xNand then uses a point-to-point

voting scheme to get the ﬁnal estimation. Speciﬁcally, we

can regress an offset vector Oij which predicts the vector

beginning from input point xiand ending at node yj.

During inference, the ypred,i

jcan be calculated by adding

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningtoEstimate3-DStatesofDeformableLinearObjectsfromSingle-FrameOccludedPointCloudsKangchenLv,MingruiYu,YifanPu,XinJiang,GaoHuang,andXiangLiAbstractAccuratelyandrobustlyestimatingthestateofdeformablelinearobjects(DLOs),suchasropesandwires,iscrucialforDLOmanipulationandotherapplications.However,...

展开>> 收起<<

Learning to Estimate 3-D States of Deformable Linear Objects from Single-Frame Occluded Point Clouds Kangchen Lv Mingrui Yu Yifan Pu Xin Jiang Gao Huang and Xiang Li.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning to Estimate 3-D States of Deformable Linear Objects from Single-Frame Occluded Point Clouds Kangchen Lv Mingrui Yu Yifan Pu Xin Jiang Gao Huang and Xiang Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: