Learning to Estimate 3-D States of Deformable Linear Objects from Single-Frame Occluded Point Clouds Kangchen Lv Mingrui Yu Yifan Pu Xin Jiang Gao Huang and Xiang Li

2025-04-29 0 0 5.98MB 7 页 10玖币
侵权投诉
Learning to Estimate 3-D States of Deformable Linear Objects from
Single-Frame Occluded Point Clouds
Kangchen Lv, Mingrui Yu, Yifan Pu, Xin Jiang, Gao Huang, and Xiang Li
Abstract Accurately and robustly estimating the state of
deformable linear objects (DLOs), such as ropes and wires, is
crucial for DLO manipulation and other applications. However,
it remains a challenging open issue due to the high dimen-
sionality of the state space, frequent occlusions, and noises.
This paper focuses on learning to robustly estimate the states
of DLOs from single-frame point clouds in the presence of
occlusions using a data-driven method. We propose a novel
two-branch network architecture to exploit global and local
information of input point cloud respectively and design a fusion
module to effectively leverage the advantages of both methods.
Simulation and real-world experimental results demonstrate
that our method can generate globally smooth and locally
precise DLO state estimation results even with heavily occluded
point clouds, which can be directly applied to real-world robotic
manipulation of DLOs in 3-D space.
I. INTRODUCTION
Robotic manipulation of deformable linear objects
(DLOs), such as ropes and wires, has a wide variety of
applications in industrial, service, and health-care sectors
[1], [2]. An accurate and robust state estimator for DLOs
is obviously the prerequisite for subsequent manipulations.
Compared to rigid objects, the infinite dimensional DLO
state space makes it very challenging to perceive deforma-
tions. Besides, occlusions and noises occur frequently in
unstructured environments, resulting in higher requirements
for robust DLO state estimation.
Commonly used representations to describe DLO states
include Fourier-based parameterization [3], implicit latent
descriptors learned by neural networks [4], [5], a chain
of uniformly distributed nodes [6]–[9], etc. Among these
methods, representing a DLO as a chain of 3-D nodes (see
Fig. 1) is general in various manipulation tasks and will be
adopted in this work.
A complete processing stream to estimate the DLO state
can be roughly divided into three procedures: segmentation
(i.e., segmenting the DLO from environment), detection (i.e.,
estimating the DLO state in a single frame), and tracking
(i.e., tracking the deformation across several frames). As
sensors are RGB or RGB-D cameras in most cases, seg-
menting the DLO region in image space is the essential
first step for consecutive processing. [10]–[13] focus on
how to obtain pixel-level DLO masks of high quality using
K. Lv, M. Yu, Y. Pu, G. Huang, and X. Li are with the Department of Au-
tomation, Tsinghua University, China. X. Jiang is with the Beijing Academy
of Artificial Intelligence, Beijing, China. This work was supported in part by
the National Key R&D Program of China under Grant 2020AAA0105200,
in part by the Institute for Guo Qiang, Tsinghua University, and in part by
the National Natural Science Foundation of China under Grant U21A20517
and 52075290. Corresponding author: Xiang Li (xiangli@tsinghua.edu.cn)
Input unordered point cloud of DLO
Output estimated ordered nodes
Fig. 1. Illustration of our task: 3-D occlusion-robust DLO state estimation
from a single-frame point cloud. Red points are the unordered incomplete
point cloud of the occluded rope and blue connected dots represent our
estimated ordered node sequence as its current state.
traditional image processing or data-driven methods. As for
detection, this step aims at estimating the positions of nodes
along the DLO in one frame with the cleaned sensory data
as input. For example, [14], [15] use neural networks to
encode the DLO into several sequential key-points in the 2-
D image space; [16] estimates a skeleton line and 3-D joint
positions on it from point cloud to represent the DLO, but
not robust against occlusions and different DLO types. As
for tracking, various works have also been proposed to track
the correspondence of point cloud across video frames in
the presence of occlusions and self-intersections [17]–[23].
These works model DLO tracking task as a GMM-based
non-rigid point registration problem with some geometric
constraints. However, these pure tracking-based methods rely
on an accurate initial state which requires manual setting or
specific initial conditions. Besides, there are few effective
ways to rectify the accumulated drift errors or re-initialize
for tracking failure. Therefore, it is necessary to develop an
accurate and robust 3-D state estimation method for DLOs
from a single frame, which can be independently applied
to estimate the DLO state in each frame or combined with
tracking methods above to utilize temporal information.
In this paper, we focus on estimating a sequence of ordered
and uniformly distributed nodes from single-frame point
cloud occlusion-robustly to represent the state of DLO, as
shown in Fig. 1. Note that we only use point cloud as our
input without any auxiliary physical simulation and robot
configurations. The challenges of this task are as follows:
1) there are few distinguishable features in the point cloud
of DLOs; 2) occlusions and noises are common in the
environment; 3) generalization ability for different DLOs
is required. To deal with challenges above, we propose a
novel two-branch network architecture to leverage both the
global geometry information for guaranteeing smooth and
occlusion-robust shape, and local geometry information for
arXiv:2210.01433v2 [cs.RO] 2 May 2023
PointNet++ encoder Point-wise feature
𝑭(𝑿) ∈ 𝑹𝑁×𝐶𝑜𝑢𝑡
Fusion
Input point cloud
𝑿 ∈ 𝑅𝑁×3 Estimated node
sequence 𝒀 ∈ 𝑹𝑀×𝟑
Point-wise
MLP
Point-to-Point Voting
Point-wise
Heatmap
Point-wise
Unit Offset
Voting
MaxPool
global feature
MLP
End-to-End Regression
Fig. 2. Overview of the proposed method for occlusion-robustly estimating the 3-D states of DLOs. The input point cloud which might be fragmented
due to occlusions is first fed into a PointNet++ encoder and the extracted features are then processed by two parallel branches: End-to-End Regression and
Point-to-Point Voting. The estimation results of these two branches are finally fused with a fusion module to obtain the final output node sequence.
precise estimations. To the best of our knowledge, we are
the first to realize accurate and robust 3-D state estimation
of DLOs from single-frame point cloud input even with
heavy occlusions. Specifically, we first exploit a PointNet++
encoder [24] to extract deep features of the input point cloud
and then feed the features into two branches: End-to-End Re-
gression and Point-to-Point Voting. We encourage these two
branches to focus on global and local geometry information
respectively and finally fuse their estimations to combine
their advantages. The whole framework is trained on syn-
thetic dataset generated in simulation without collecting real-
world data. Experiments suggest our method achieves high
performance on occlusion-robust state estimation of DLOs
and can be directly applied in real-world scenarios.
II. PROBLEM STATEMENT
The goal of our method is to estimate the 3-D states of
DLOs from point cloud obtained by an RGB-D camera. In
this work, we focus on the state estimation problem and
assume that the point cloud of the DLO has already been
segmented out of the raw full point cloud by RGB image
segmentation. We represent the DLO state as a sequence of
Mnodes uniformly distributed, where Mis a pre-defined
number of nodes that can sufficiently describe the DLO state.
The problem is to estimate the coordinates of the nodes Y=
[y1,y2,· · · ,yM]TRM×3from the input point cloud X=
[x1,x2,· · · ,xN]TRN×3where Nis the number of the
points in the segmented point cloud. Note that the input point
cloud Xis unordered, while the order of estimated nodes in
Yfrom one end to another end has been represented by the
index 1,2,· · · , M. In addition, the point cloud of the DLO
may be fragmentary and noisy because of the occlusions,
imperfect segmentation, and depth images of low quality.
III. METHOD
As shown in Fig. 2, our proposed method contains two
branches: an End-to-End Regression branch and a Point-to-
Point Voting branch, which focuses on the global and the
local geometry information, respectively. Then, a deformable
registration module is designed to leverage the advantages of
both branches and fuse the two predictions to output the final
estimated node sequence.
A. End-to-End Regression
The most straightforward approach is to train an end-to-
end network with the point cloud XRN×3as input and
the node sequence YRM×3as output, which is indicated
as End-to-End Regression. We exploit a PointNet++ [24]
encoder denoted as F(·)to extract deep latent features
F(X)RN×Cout of input point cloud X, which means
that each point in input point cloud has a Cout-dimensional
feature vector. A max pooling layer is then applied to get the
global feature MaxPool(F(X)) RCout which is irrelevant
to the input point order. Finally, a fully-connected layer F C1
predicts the node sequence Ypred
reg . The whole regression
network is defined as
Ypred
reg =F C1(MaxPool(F(X))).(1)
With the ground-truth node coordinates Ygt, the training loss
function for each sample is
Lreg =kYpred
reg Ygtk2.(2)
It is experimentally found that such an end-to-end network
can ensure that the estimated DLO shapes are smooth and
look like real DLOs even using heavily occluded point cloud
input, which suggests that this network can learn the key
global characteristic of DLOs well. However, the predictions
are often slightly different from the actual states such that
they are not sufficiently accurate for applications (see Fig.
5). This phenomenon is believed to be brought about by the
feature max pooling operation, which neglects crucial local
information for precise estimation.
B. Point-to-Point Voting
To make up for the shortcomings of the end-to-end regres-
sion method, we design a point-to-point voting framework
to utilize local geometry information, which is inspired
by early works [25], [26]. Instead of using max-pooling
layers for direct regression, this method generates point-
wise predictions Ypred,1
vot ,Ypred,2
vot ,· · · ,Ypred,N
vot from each
input point x1,x2,· · · ,xNand then uses a point-to-point
voting scheme to get the final estimation. Specifically, we
can regress an offset vector Oij which predicts the vector
beginning from input point xiand ending at node yj.
During inference, the ypred,i
jcan be calculated by adding
摘要:

LearningtoEstimate3-DStatesofDeformableLinearObjectsfromSingle-FrameOccludedPointCloudsKangchenLv,MingruiYu,YifanPu,XinJiang,GaoHuang,andXiangLiAbstract—Accuratelyandrobustlyestimatingthestateofdeformablelinearobjects(DLOs),suchasropesandwires,iscrucialforDLOmanipulationandotherapplications.However,...

展开>> 收起<<
Learning to Estimate 3-D States of Deformable Linear Objects from Single-Frame Occluded Point Clouds Kangchen Lv Mingrui Yu Yifan Pu Xin Jiang Gao Huang and Xiang Li.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:5.98MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注