GMA3D Local-Global Attention Learning to Estimate Occluded Motions of Scene Flow Zhiyang Lu and Ming Cheng

2025-04-29 0 0 2.53MB 14 页 10玖币
侵权投诉
GMA3D: Local-Global Attention Learning to
Estimate Occluded Motions of Scene Flow
Zhiyang Lu and Ming Cheng
Xiamen University
Abstract. Scene flow represents the motion information of each point
in the 3D point clouds. It is a vital downstream method applied to
many tasks, such as motion segmentation and object tracking. How-
ever, there are always occlusion points between two consecutive point
clouds, whether from the sparsity data sampling or real-world occlusion.
In this paper, we focus on addressing occlusion issues in scene flow by
the semantic self-similarity and motion consistency of the moving ob-
jects. We propose a GMA3D module based on the transformer frame-
work, which utilizes local and global semantic similarity to infer the
motion information of occluded points from the motion information of
local and global non-occluded points respectively, and then uses an off-
set aggregator to aggregate them. Our module is the first to apply the
transformer-based architecture to gauge the scene flow occlusion problem
on point clouds. Experiments show that our GMA3D can solve the occlu-
sion problem in the scene flow, especially in the real scene. We evaluated
the proposed method on the occluded version of point cloud datasets
and get state-of-the-art results on the real scene KITTI dataset. To tes-
tify that GMA3D is still beneficial to non-occluded scene flow, we also
conducted experiments on non-occluded version datasets and achieved
promising performance on FlyThings3D and KITTI. The code is avail-
able at https://anonymous.4open.science/r/GMA3D-E100.
Keywords: Scene flow estimation ·Deep learning ·Point clouds ·Local-
global attention
1 Introduction
It is significant to capture object motion information in dynamic scenes. Scene
flow [5] calculates the motion field of two consecutive frames of the 3D scenes and
obtains collections of directions and distances of object movements. Scene flow
is the underlying motion information, which is serviceable in many applications,
such as robotic path planning, object tracking, and augmented reality. The pre-
vious methods use RGB images [11,19,20,12,22,24,31,33,34] to estimate the scene
flow, but 2D methods cannot accurately consider the 3D information in the real
scenes. Instead, with the advances in 3D sensors, it is easy to obtain point cloud
data. PointNet [6] and PointNet++ [7] pioneered the direct extraction of features
for raw point clouds, and then the deep learning networks [27,9,6,7,21,10,16] in
arXiv:2210.03296v2 [cs.CV] 23 Jul 2023
2 Zhiyang Lu and Ming Cheng
the field of point clouds continued to emerge. These works provide the neces-
sary conditions for the scene flow task. Combined with these frameworks, many
neural network architectures suitable for scene flow [4,32,23,1,17] are proposed,
which have better performance than the traditional optimization-based methods.
Although these methods have gained good results on the non-occluded datasets,
they fail to infer the motion information of the occluded objects, which will lead
to the scene flow deviation in large-scale occlusion scenes, such as in large-scale
traffic jams.
In the scene flow task, the occluded points exist in the first frame (source)
point cloud. We define it as a set of points without corresponding points and/or
corresponding patches in the second frame (target). Furthermore, we divide oc-
cluded points into two categories: the first category has non-occluded points in
local areas of the first frame point cloud, and these points are called local oc-
cluded points. The second kind of point is global occlusion points, where there
are no non-occluded points in their local areas. The previous method calculates
the corresponding scene flow through the feature matching between two frames,
which can well infer the scene flow of non-occluded points in the first frame,
because such points have corresponding matching patches in the second frame,
and the motion information can be deduced through the cross-correlation be-
tween the point clouds of the two frames. However, the occluded points have no
corresponding matching patches in the second point cloud, so it is incapable to
infer the motion information by cross-correlation. In contrast, humans often em-
ploy self-correlation when deducing the motion of occluded objects in dynamic
scenes. For example, without considering collision, we can infer the motion in-
formation of the occluded head of the same vehicle from the tail. Therefore, the
self-correlation of motion is very significant to solve the occlusion problem in
scene flow.
Previously, Ouyang et al. combined scene flow estimation task with occlu-
sion detection task [13], and optimized the two target tasks to infer the motion
information of occluded points. Such a method can effectively cure local small-
scale occlusion issues, but it still cannot resolve the problem of local large-scale
occlusion and global occlusion. Jiang et al. [3] designed a transformer-based
global motion aggregation (GMA) module to conclude the motion information
of occluded pixels in optical flow. Inspired by this, we propose GMA3D, which
integrates transformer [2] framework into scene flow tasks, utilizing the self-
similarity of point clouds features to aggregate motion features and obtain the
motion information of occluded points. Unfortunately, previous works only con-
sider motion features from the global perspective without regarding the local
consistency of motion, which may lead to error motion of local occlusion points.
To address these issues, we present a local-global semantic similarity map
(LGSM) module to calculate the local-global semantic similarity map and then
employ an offset aggregator (OA) to aggregate motion information based on
self-similarity matrices. For local occlusion, we deduce the motion information
of the occluded points from their local non-occluded neighbors based on local
motion consistency. As far as global occlusion points, we apply the global seman-
Title Suppressed Due to Excessive Length 3
Point-Voxel Neighbors
Point Features
Point Features
Source frame
Target frame
Truncated
Correlation Matrix
GRU
Feature Encoder
Feature Encoder
Context Encoder
Correlation Features
Context Features
Motion Encoder Motion Features
GMA3D
Aggregated
Motion Features
Concatenate
Refine
Module
Coarse Scene Flow
Refined Scene Flow
Source frame
Fig. 1. The overall pipeline of our proposed framework. Our network is based on the
successful PV-RAFT [1] architecture. The input of the GMA3D module is the context
features and motion features of the point cloud in the first frame, and the output is
the motion features aggregated locally and globally. These aggregated motion features
are concatenated with context features and the original motion features, and then the
concatenated features are fed into GRU for residual flow estimation, which is finally
refined by the refine module.
tic features to aggregate motion features from non-occluded points. We utilize
these local and global aggregated motion features to augment the successful PV-
RAFT [1] framework and achieve state-of-the-art results in occluded scene flow
estimation.
The key contributions of our paper are as follows. We propose a transformer-
based framework GMA3D to address the problem of motion occlusion in scene
flow, in which we designed the LGSM module to leverage the self-consistency of
motion information from both local and global perspectives, and then apply the
offset aggregator to aggregate the motion features of the non-occluded points
with self-similarity to the occluded points. Moreover, we demonstrate that the
GMA3D module reduces the local motion bias by aggregating local and global
motion features, which is also beneficial to non-occluded points. Experiments
have shown that our GMA3D module has attained exceptional results in scene
flow tasks, whether in the case of occluded or non-occluded datasets.
2 Related Work
2.1 Motion Occlusion of Scene Flow
There are few techniques to address the occlusion problem of scene flow. Self-
Mono-SF [47] utilizes self-supervised learning with 3D loss function and occlusion
reasoning to infer the motion information of occlusion points in monocular scene
flow. [38] combines occlusion detection, depth, and motion boundary estimation
to infer occlusion points and scene flow. PWOC-3D [8] constructs a compact
CNN architecture to predict scene flow in stereo image sequences and proposes a
self-supervised strategy to produce the occlusion map for improving the accuracy
摘要:

GMA3D:Local-GlobalAttentionLearningtoEstimateOccludedMotionsofSceneFlowZhiyangLuandMingChengXiamenUniversityAbstract.Sceneflowrepresentsthemotioninformationofeachpointinthe3Dpointclouds.Itisavitaldownstreammethodappliedtomanytasks,suchasmotionsegmentationandobjecttracking.How-ever,therearealwaysoccl...

展开>> 收起<<
GMA3D Local-Global Attention Learning to Estimate Occluded Motions of Scene Flow Zhiyang Lu and Ming Cheng.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.53MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注