Using Detection Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios Xingyu Chen1 Jianru Xue1y Jianwu Fang12 Yuxin Pan1and Nanning Zheng1

2025-05-06 0 0 3.74MB 6 页 10玖币
侵权投诉
Using Detection, Tracking and Prediction in Visual SLAM to Achieve
Real-time Semantic Mapping of Dynamic Scenarios
Xingyu Chen1, Jianru Xue1,, Jianwu Fang1,2, Yuxin Pan1and Nanning Zheng1
Abstract In this paper, we propose a lightweight system,
RDS-SLAM, based on ORB-SLAM2, which can accurately esti-
mate poses and build semantic maps at object level for dynamic
scenarios in real time using only one commonly used Intel Core
i7 CPU. In RDS-SLAM, three major improvements, as well as
major architectural modifications, are proposed to overcome
the limitations of ORB-SLAM2. Firstly, it adopts a lightweight
object detection neural network in key frames. Secondly, an
efficient tracking and prediction mechanism is embedded into
the system to remove the feature points belonging to movable
objects in all incoming frames. Thirdly, a semantic octree map is
built by probabilistic fusion of detection and tracking results,
which enables a robot to maintain a semantic description at
object level for potential interactions in dynamic scenarios. We
evaluate RDS-SLAM in TUM RGB-D dataset, and experimental
results show that RDS-SLAM can run with 30.3 ms per frame
in dynamic scenarios using only an Intel Core i7 CPU, and
achieves comparable accuracy compared with the state-of-the-
art SLAM systems which heavily rely on both Intel Core i7
CPUs and powerful GPUs.
I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) [1] is
an important technique of perception and navigation for
intelligent mobile systems, such as robots and autonomous
vehicles. Due to the low cost, high resolution, and rich
color information of camera, visual SLAM (vSLAM) has
become an important research topic over the last years. Some
excellent vSLAM systems have been established, such as
ORB-SLAM2 [2], ElasticFusion [3], RTAB-Map [4].
However, classical vSLAM systems commonly assume
that scenes are rigid and static, and this assumption leads to
frequent failures of vSLAM systems in dynamic scenarios,
where there are movable objects, such as people and cars.
Even ORB-SLAM2 [2], one of the state-of-the-art vSLAM
systems, may frequently fail in dynamic scenarios, and can
only provide a map with incomplete descriptions. Its local-
ization accuracy is also dramatically degraded. Obviously,
these limitations are caused by movable objects in dynamic
scenarios.
To overcome the effects of movable objects in dynamic
scenarios to vSLAM systems, we propose three major im-
provements for ORB-SLAM2, and implement a robust and
real-time vSLAM framework, RDS-SLAM, for mapping dy-
namic scenarios. The proposed RDS-SLAM can effectively
*This work is partially supported by NSFC Projects 61751308 and
U1713217.
1The authors are with the Institute of Artificial Intelligence and Robotics,
Xi’an Jiaotong University. Xi’an, P.R. China.
2The author is with the School of Electronic and Control Engineering,
Chang’an University. Xi’an, P.R. China.
Corresponding author’s email: jrxue@mail.xjtu.edu.cn
Fig. 1. The framework of RDS-SLAM. The threads filled with gray color
are original threads of ORB-SLAM2. We denote the improvements over
ORB-SLAM2 with red color. The threads in red color are the parallel
improvements proposed in this paper. Additionally, a Prediction module
is inserted into the Tracking thread of ORB-SLAM2.
remove the feature points belonging to movable objects, and
build a semantic octree map at object level for complete
description of dynamic scenarios.
More specifically, the proposed improvements, as well as
major architectural modifications, are illustrated in Fig. 1.
Firstly, we adopt a 2D object detection network as a parallel
thread, which is denoted as Detection in Fig. 1, and the
technical details are presented in Sect. IV-A. Instead of
detecting in all frames as other dynamic SLAM systems do,
we run it only in key frames to get the 2D movable objects.
Secondly, we propose an efficient prediction mechanism,
which is denoted as Transformation and Prediction in Fig. 1.
We transform the local 2D bounding box to global 3D
coordinate and extend the classic local 2D tracking algorithm
SORT [5] to global 3D coordinate to track 3D movable
objects in key frames, and the constant velocity model is
taken to predict other frames (Sect. IV-B). The running time
of each frame of the prediction mechanism that we test on
Intel i7 CPU is only 5ms.
Finally, we build Semantic Octree Map Creation as a
parallel thread shown in Fig. 1 for both removing dynamic
objects and creating a complete semantic map at object
level. Instead of raising probability threshold of octree map
like other state-of-the-arts do in dynamic scenarios, we use
semantic information to distinguish whether the point clouds
arXiv:2210.04562v1 [cs.CV] 10 Oct 2022
are movable or not, then insert octree maps with different
probabilities to remove the movable object (Sect. IV-C).
The rest of the paper is structured as follows: Section II
discusses the related works. Section III presents an overview
of RDS-SLAM. Three major improvements are detailed in
Section IV, which are followed by experimental results in
Section V. Finally, the paper is concluded with discussions
and lines for the future in Section VI.
II. RELATED WORKS
There are many excellent vSLAM systems in literature
for mapping scenarios [2], [3], [4] by using RGB-D data,
and a comprehensive survey can be found in [1]. However,
they often fail in dynamic scenarios, and this leads to many
research works in recent years. In this section, we present a
brief survey for these research efforts.
The core idea of improving vSLAM systems is to distin-
guish the dynamic parts of scenarios. For this purpose, it is
straightforward to introduce segmentation [6], [7], [8], [9].
McCormac et al. [6] estimated poses and created a dense
map through ElasticFusion [3], then built a single-frame
map through a convolutional neural network (CNN) and
finally merged two maps to generate a dense semantic
map with higher classification accuracy than single-frame
CNN. However, it cannot handle the dynamic scenarios.
StaticFusion [7], Co-Fusion [8], and MaskFusion [9] had
been proposed to deal with the dynamic scenarios. They
focused on using segmentation information to directly build
an accurate dense map that can distinguish the dynamic
objects and static scenarios. However, these works have
relatively low localization accuracy and heavily rely on
intensive computation efficiency.
Among many vSLAM works, ORB-SLAM2 [2] is widely
accepted as the best open source vSLAM system with high
localization accuracy and map reusability, but it also fails
in dynamic scenarios. The situation has been significantly
improved by DynaSLAM [10] and DS-SLAM [11], which
are two important variants of ORB-SLAM2. To remove the
ORB [12] feature points of dynamic objects, DynaSLAM
serially added Mask R-CNN [13], Low-Cost Tracking and
Multi-view Geometry to the front of ORB-SLAM2 before
extracting the ORB feature points. However, since it serially
added three modules to the front of ORB-SLAM2, the
average time it took per frame using CPU+GPU is about 500
ms. Similar to DynaSLAM, DS-SLAM also serially added
Moving Consistency Check module and Remove Outliers
module to the Tracking thread of ORB-SLAM2. Different
from DynaSLAM, DS-SLAM parallel added SegNet [14]
thread and Dense Map Creation thread to ORB-SLAM2. It
finally combines the results of parallel SegNet thread and
the serial Moving Consistency Check module in each frame.
Even with such a parallel architecture, its average time of
processing a frame using CPU+GPU is about 59.4ms. In
summary, neither DynaSLAM [10] nor DS-SLAM [11] can
work in real time without GPUs, and thus cannot meet with
lightweight applications.
Motivated by the aforementioned works, we propose the
real time RDS-SLAM, which can build a complete semantic
octree map of dynamic scenario without using GPUs as well
as the competitive accuracy compared with DynaSLAM [10]
and DS-SLAM [11].
III. SYSTEM OVERVIEW
We propose a real-time and lightweight RGB-D vSLAM
system in dynamic scenarios based on ORB-SLAM2 [2]. We
use object detection and object tracking only in key frames,
and use low-cost prediction in other frames to reduce the
computational cost, as shown in Fig. 1.
In addition to Tracking,Local Mapping and Loop Closing,
three parallel threads of original ORB-SLAM2, we add De-
tection,Transformation and Semantic Octree Map Creation,
three parallel threads into the system. And we also insert a
new module named Prediction into the Tracking thread.
After the processing of Extract ORB,Pose Prediction
or Relocalization and Track Local Map in tracking thread,
ORB-SLAM2 has realized a visual odometry that can es-
timate the pose transformation between frames in a static
scenario. In order to build the map and optimize the pose,
ORB-SLAM2 proposes New KeyFrame Decision module to
select key frames from visual sequence and put them into
Local Mapping thread. The mechanism of New KeyFrame
Decision emphasizes that when the scenario changes, a key
frame will be inserted after a certain time interval is met,
and when the scenario changes quickly, a key frame will be
directly inserted regardless of the time interval.
In RDS-SLAM, we believe that detecting objects in all
frames without a selection will cost too many computational
resources, because the scenario does not always change
during localization and mapping of robots. In other words,
we should use Detection thread only when scenario changes
and increase the frequency of detection when scenario
changes quickly. Thus we can utilize the mechanism of New
KeyFrame Decision as the mechanism of Detection to realize
the adaptive computational resource allocation of Detection
thread by using Detection only in key frames instead of all
the frames.
After New KeyFrame Decision putting the key frames
into the Local Mapping thread, ORB-SLAM2 will check the
recently added feature points on the map (map points) by
Recent MapPoints Culling module, as shown in Fig. 1. It
emphasizes that if a map point is constructed, it must be
observed by the next three key frames. ORB-SLAM2 ef-
fectively eliminates the incorrect map points through Recent
MapPoints Culling, but it cannot effectively remove the map
points on movable objects.
On the contrary, in RDS-SLAM, the map points on
movable objects of the latest key frame were temporarily
built into the map. After RDS-SLAM detecting the latest
key frame using an object detection network and putting
the results into each of the future frames, the map points
on movable objects built by the latest key frame will no
longer be observed in the next key frame. Thus RDS-SLAM
摘要:

UsingDetection,TrackingandPredictioninVisualSLAMtoAchieveReal-timeSemanticMappingofDynamicScenariosXingyuChen1,JianruXue1;y,JianwuFang1;2,YuxinPan1andNanningZheng1Abstract—Inthispaper,weproposealightweightsystem,RDS-SLAM,basedonORB-SLAM2,whichcanaccuratelyesti-mateposesandbuildsemanticmapsatobjectle...

展开>> 收起<<
Using Detection Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios Xingyu Chen1 Jianru Xue1y Jianwu Fang12 Yuxin Pan1and Nanning Zheng1.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:3.74MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注