are movable or not, then insert octree maps with different
probabilities to remove the movable object (Sect. IV-C).
The rest of the paper is structured as follows: Section II
discusses the related works. Section III presents an overview
of RDS-SLAM. Three major improvements are detailed in
Section IV, which are followed by experimental results in
Section V. Finally, the paper is concluded with discussions
and lines for the future in Section VI.
II. RELATED WORKS
There are many excellent vSLAM systems in literature
for mapping scenarios [2], [3], [4] by using RGB-D data,
and a comprehensive survey can be found in [1]. However,
they often fail in dynamic scenarios, and this leads to many
research works in recent years. In this section, we present a
brief survey for these research efforts.
The core idea of improving vSLAM systems is to distin-
guish the dynamic parts of scenarios. For this purpose, it is
straightforward to introduce segmentation [6], [7], [8], [9].
McCormac et al. [6] estimated poses and created a dense
map through ElasticFusion [3], then built a single-frame
map through a convolutional neural network (CNN) and
finally merged two maps to generate a dense semantic
map with higher classification accuracy than single-frame
CNN. However, it cannot handle the dynamic scenarios.
StaticFusion [7], Co-Fusion [8], and MaskFusion [9] had
been proposed to deal with the dynamic scenarios. They
focused on using segmentation information to directly build
an accurate dense map that can distinguish the dynamic
objects and static scenarios. However, these works have
relatively low localization accuracy and heavily rely on
intensive computation efficiency.
Among many vSLAM works, ORB-SLAM2 [2] is widely
accepted as the best open source vSLAM system with high
localization accuracy and map reusability, but it also fails
in dynamic scenarios. The situation has been significantly
improved by DynaSLAM [10] and DS-SLAM [11], which
are two important variants of ORB-SLAM2. To remove the
ORB [12] feature points of dynamic objects, DynaSLAM
serially added Mask R-CNN [13], Low-Cost Tracking and
Multi-view Geometry to the front of ORB-SLAM2 before
extracting the ORB feature points. However, since it serially
added three modules to the front of ORB-SLAM2, the
average time it took per frame using CPU+GPU is about 500
ms. Similar to DynaSLAM, DS-SLAM also serially added
Moving Consistency Check module and Remove Outliers
module to the Tracking thread of ORB-SLAM2. Different
from DynaSLAM, DS-SLAM parallel added SegNet [14]
thread and Dense Map Creation thread to ORB-SLAM2. It
finally combines the results of parallel SegNet thread and
the serial Moving Consistency Check module in each frame.
Even with such a parallel architecture, its average time of
processing a frame using CPU+GPU is about 59.4ms. In
summary, neither DynaSLAM [10] nor DS-SLAM [11] can
work in real time without GPUs, and thus cannot meet with
lightweight applications.
Motivated by the aforementioned works, we propose the
real time RDS-SLAM, which can build a complete semantic
octree map of dynamic scenario without using GPUs as well
as the competitive accuracy compared with DynaSLAM [10]
and DS-SLAM [11].
III. SYSTEM OVERVIEW
We propose a real-time and lightweight RGB-D vSLAM
system in dynamic scenarios based on ORB-SLAM2 [2]. We
use object detection and object tracking only in key frames,
and use low-cost prediction in other frames to reduce the
computational cost, as shown in Fig. 1.
In addition to Tracking,Local Mapping and Loop Closing,
three parallel threads of original ORB-SLAM2, we add De-
tection,Transformation and Semantic Octree Map Creation,
three parallel threads into the system. And we also insert a
new module named Prediction into the Tracking thread.
After the processing of Extract ORB,Pose Prediction
or Relocalization and Track Local Map in tracking thread,
ORB-SLAM2 has realized a visual odometry that can es-
timate the pose transformation between frames in a static
scenario. In order to build the map and optimize the pose,
ORB-SLAM2 proposes New KeyFrame Decision module to
select key frames from visual sequence and put them into
Local Mapping thread. The mechanism of New KeyFrame
Decision emphasizes that when the scenario changes, a key
frame will be inserted after a certain time interval is met,
and when the scenario changes quickly, a key frame will be
directly inserted regardless of the time interval.
In RDS-SLAM, we believe that detecting objects in all
frames without a selection will cost too many computational
resources, because the scenario does not always change
during localization and mapping of robots. In other words,
we should use Detection thread only when scenario changes
and increase the frequency of detection when scenario
changes quickly. Thus we can utilize the mechanism of New
KeyFrame Decision as the mechanism of Detection to realize
the adaptive computational resource allocation of Detection
thread by using Detection only in key frames instead of all
the frames.
After New KeyFrame Decision putting the key frames
into the Local Mapping thread, ORB-SLAM2 will check the
recently added feature points on the map (map points) by
Recent MapPoints Culling module, as shown in Fig. 1. It
emphasizes that if a map point is constructed, it must be
observed by the next three key frames. ORB-SLAM2 ef-
fectively eliminates the incorrect map points through Recent
MapPoints Culling, but it cannot effectively remove the map
points on movable objects.
On the contrary, in RDS-SLAM, the map points on
movable objects of the latest key frame were temporarily
built into the map. After RDS-SLAM detecting the latest
key frame using an object detection network and putting
the results into each of the future frames, the map points
on movable objects built by the latest key frame will no
longer be observed in the next key frame. Thus RDS-SLAM