GMA3D Local-Global Attention Learning to Estimate Occluded Motions of Scene Flow Zhiyang Lu and Ming Cheng

2025-04-29 0 0 2.53MB 14 页 10玖币

侵权投诉

GMA3D: Local-Global Attention Learning to

Estimate Occluded Motions of Scene Flow

Zhiyang Lu and Ming Cheng

Xiamen University

Abstract. Scene ﬂow represents the motion information of each point

in the 3D point clouds. It is a vital downstream method applied to

many tasks, such as motion segmentation and object tracking. How-

ever, there are always occlusion points between two consecutive point

clouds, whether from the sparsity data sampling or real-world occlusion.

In this paper, we focus on addressing occlusion issues in scene ﬂow by

the semantic self-similarity and motion consistency of the moving ob-

jects. We propose a GMA3D module based on the transformer frame-

work, which utilizes local and global semantic similarity to infer the

motion information of occluded points from the motion information of

local and global non-occluded points respectively, and then uses an oﬀ-

set aggregator to aggregate them. Our module is the ﬁrst to apply the

transformer-based architecture to gauge the scene ﬂow occlusion problem

on point clouds. Experiments show that our GMA3D can solve the occlu-

sion problem in the scene ﬂow, especially in the real scene. We evaluated

the proposed method on the occluded version of point cloud datasets

and get state-of-the-art results on the real scene KITTI dataset. To tes-

tify that GMA3D is still beneﬁcial to non-occluded scene ﬂow, we also

conducted experiments on non-occluded version datasets and achieved

promising performance on FlyThings3D and KITTI. The code is avail-

able at https://anonymous.4open.science/r/GMA3D-E100.

Keywords: Scene ﬂow estimation ·Deep learning ·Point clouds ·Local-

global attention

1 Introduction

It is signiﬁcant to capture object motion information in dynamic scenes. Scene

ﬂow [5] calculates the motion ﬁeld of two consecutive frames of the 3D scenes and

obtains collections of directions and distances of object movements. Scene ﬂow

is the underlying motion information, which is serviceable in many applications,

such as robotic path planning, object tracking, and augmented reality. The pre-

vious methods use RGB images [11,19,20,12,22,24,31,33,34] to estimate the scene

ﬂow, but 2D methods cannot accurately consider the 3D information in the real

scenes. Instead, with the advances in 3D sensors, it is easy to obtain point cloud

data. PointNet [6] and PointNet++ [7] pioneered the direct extraction of features

for raw point clouds, and then the deep learning networks [27,9,6,7,21,10,16] in

arXiv:2210.03296v2 [cs.CV] 23 Jul 2023

2 Zhiyang Lu and Ming Cheng

the ﬁeld of point clouds continued to emerge. These works provide the neces-

sary conditions for the scene ﬂow task. Combined with these frameworks, many

neural network architectures suitable for scene ﬂow [4,32,23,1,17] are proposed,

which have better performance than the traditional optimization-based methods.

Although these methods have gained good results on the non-occluded datasets,

they fail to infer the motion information of the occluded objects, which will lead

to the scene ﬂow deviation in large-scale occlusion scenes, such as in large-scale

traﬃc jams.

In the scene ﬂow task, the occluded points exist in the ﬁrst frame (source)

point cloud. We deﬁne it as a set of points without corresponding points and/or

corresponding patches in the second frame (target). Furthermore, we divide oc-

cluded points into two categories: the ﬁrst category has non-occluded points in

local areas of the ﬁrst frame point cloud, and these points are called local oc-

cluded points. The second kind of point is global occlusion points, where there

are no non-occluded points in their local areas. The previous method calculates

the corresponding scene ﬂow through the feature matching between two frames,

which can well infer the scene ﬂow of non-occluded points in the ﬁrst frame,

because such points have corresponding matching patches in the second frame,

and the motion information can be deduced through the cross-correlation be-

tween the point clouds of the two frames. However, the occluded points have no

corresponding matching patches in the second point cloud, so it is incapable to

infer the motion information by cross-correlation. In contrast, humans often em-

ploy self-correlation when deducing the motion of occluded objects in dynamic

scenes. For example, without considering collision, we can infer the motion in-

formation of the occluded head of the same vehicle from the tail. Therefore, the

self-correlation of motion is very signiﬁcant to solve the occlusion problem in

scene ﬂow.

Previously, Ouyang et al. combined scene ﬂow estimation task with occlu-

sion detection task [13], and optimized the two target tasks to infer the motion

information of occluded points. Such a method can eﬀectively cure local small-

scale occlusion issues, but it still cannot resolve the problem of local large-scale

occlusion and global occlusion. Jiang et al. [3] designed a transformer-based

global motion aggregation (GMA) module to conclude the motion information

of occluded pixels in optical ﬂow. Inspired by this, we propose GMA3D, which

integrates transformer [2] framework into scene ﬂow tasks, utilizing the self-

similarity of point clouds features to aggregate motion features and obtain the

motion information of occluded points. Unfortunately, previous works only con-

sider motion features from the global perspective without regarding the local

consistency of motion, which may lead to error motion of local occlusion points.

To address these issues, we present a local-global semantic similarity map

(LGSM) module to calculate the local-global semantic similarity map and then

employ an oﬀset aggregator (OA) to aggregate motion information based on

self-similarity matrices. For local occlusion, we deduce the motion information

of the occluded points from their local non-occluded neighbors based on local

motion consistency. As far as global occlusion points, we apply the global seman-

Title Suppressed Due to Excessive Length 3

Point-Voxel Neighbors

Point Features

Source frame

Target frame

Truncated

Correlation Matrix

GRU

Feature Encoder

Context Encoder

Correlation Features

Context Features

Motion Encoder Motion Features

GMA3D

Aggregated

Motion Features

Concatenate

Refine

Module

Coarse Scene Flow

Refined Scene Flow

Source frame

Fig. 1. The overall pipeline of our proposed framework. Our network is based on the

successful PV-RAFT [1] architecture. The input of the GMA3D module is the context

features and motion features of the point cloud in the ﬁrst frame, and the output is

the motion features aggregated locally and globally. These aggregated motion features

are concatenated with context features and the original motion features, and then the

concatenated features are fed into GRU for residual ﬂow estimation, which is ﬁnally

reﬁned by the reﬁne module.

tic features to aggregate motion features from non-occluded points. We utilize

these local and global aggregated motion features to augment the successful PV-

RAFT [1] framework and achieve state-of-the-art results in occluded scene ﬂow

estimation.

The key contributions of our paper are as follows. We propose a transformer-

based framework GMA3D to address the problem of motion occlusion in scene

ﬂow, in which we designed the LGSM module to leverage the self-consistency of

motion information from both local and global perspectives, and then apply the

oﬀset aggregator to aggregate the motion features of the non-occluded points

with self-similarity to the occluded points. Moreover, we demonstrate that the

GMA3D module reduces the local motion bias by aggregating local and global

motion features, which is also beneﬁcial to non-occluded points. Experiments

have shown that our GMA3D module has attained exceptional results in scene

ﬂow tasks, whether in the case of occluded or non-occluded datasets.

2 Related Work

2.1 Motion Occlusion of Scene Flow

There are few techniques to address the occlusion problem of scene ﬂow. Self-

Mono-SF [47] utilizes self-supervised learning with 3D loss function and occlusion

reasoning to infer the motion information of occlusion points in monocular scene

ﬂow. [38] combines occlusion detection, depth, and motion boundary estimation

to infer occlusion points and scene ﬂow. PWOC-3D [8] constructs a compact

CNN architecture to predict scene ﬂow in stereo image sequences and proposes a

self-supervised strategy to produce the occlusion map for improving the accuracy

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GMA3D:Local-GlobalAttentionLearningtoEstimateOccludedMotionsofSceneFlowZhiyangLuandMingChengXiamenUniversityAbstract.Sceneflowrepresentsthemotioninformationofeachpointinthe3Dpointclouds.Itisavitaldownstreammethodappliedtomanytasks,suchasmotionsegmentationandobjecttracking.How-ever,therearealwaysoccl...

展开>> 收起<<

GMA3D Local-Global Attention Learning to Estimate Occluded Motions of Scene Flow Zhiyang Lu and Ming Cheng.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GMA3D Local-Global Attention Learning to Estimate Occluded Motions of Scene Flow Zhiyang Lu and Ming Cheng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: