2 Zhiyang Lu and Ming Cheng
the field of point clouds continued to emerge. These works provide the neces-
sary conditions for the scene flow task. Combined with these frameworks, many
neural network architectures suitable for scene flow [4,32,23,1,17] are proposed,
which have better performance than the traditional optimization-based methods.
Although these methods have gained good results on the non-occluded datasets,
they fail to infer the motion information of the occluded objects, which will lead
to the scene flow deviation in large-scale occlusion scenes, such as in large-scale
traffic jams.
In the scene flow task, the occluded points exist in the first frame (source)
point cloud. We define it as a set of points without corresponding points and/or
corresponding patches in the second frame (target). Furthermore, we divide oc-
cluded points into two categories: the first category has non-occluded points in
local areas of the first frame point cloud, and these points are called local oc-
cluded points. The second kind of point is global occlusion points, where there
are no non-occluded points in their local areas. The previous method calculates
the corresponding scene flow through the feature matching between two frames,
which can well infer the scene flow of non-occluded points in the first frame,
because such points have corresponding matching patches in the second frame,
and the motion information can be deduced through the cross-correlation be-
tween the point clouds of the two frames. However, the occluded points have no
corresponding matching patches in the second point cloud, so it is incapable to
infer the motion information by cross-correlation. In contrast, humans often em-
ploy self-correlation when deducing the motion of occluded objects in dynamic
scenes. For example, without considering collision, we can infer the motion in-
formation of the occluded head of the same vehicle from the tail. Therefore, the
self-correlation of motion is very significant to solve the occlusion problem in
scene flow.
Previously, Ouyang et al. combined scene flow estimation task with occlu-
sion detection task [13], and optimized the two target tasks to infer the motion
information of occluded points. Such a method can effectively cure local small-
scale occlusion issues, but it still cannot resolve the problem of local large-scale
occlusion and global occlusion. Jiang et al. [3] designed a transformer-based
global motion aggregation (GMA) module to conclude the motion information
of occluded pixels in optical flow. Inspired by this, we propose GMA3D, which
integrates transformer [2] framework into scene flow tasks, utilizing the self-
similarity of point clouds features to aggregate motion features and obtain the
motion information of occluded points. Unfortunately, previous works only con-
sider motion features from the global perspective without regarding the local
consistency of motion, which may lead to error motion of local occlusion points.
To address these issues, we present a local-global semantic similarity map
(LGSM) module to calculate the local-global semantic similarity map and then
employ an offset aggregator (OA) to aggregate motion information based on
self-similarity matrices. For local occlusion, we deduce the motion information
of the occluded points from their local non-occluded neighbors based on local
motion consistency. As far as global occlusion points, we apply the global seman-