Using Detection Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios Xingyu Chen1 Jianru Xue1y Jianwu Fang12 Yuxin Pan1and Nanning Zheng1

2025-05-06 1 0 3.74MB 6 页 10玖币

侵权投诉

Using Detection, Tracking and Prediction in Visual SLAM to Achieve

Real-time Semantic Mapping of Dynamic Scenarios

Xingyu Chen1, Jianru Xue1,†, Jianwu Fang1,2, Yuxin Pan1and Nanning Zheng1

Abstract— In this paper, we propose a lightweight system,

RDS-SLAM, based on ORB-SLAM2, which can accurately esti-

mate poses and build semantic maps at object level for dynamic

scenarios in real time using only one commonly used Intel Core

i7 CPU. In RDS-SLAM, three major improvements, as well as

major architectural modiﬁcations, are proposed to overcome

the limitations of ORB-SLAM2. Firstly, it adopts a lightweight

object detection neural network in key frames. Secondly, an

efﬁcient tracking and prediction mechanism is embedded into

the system to remove the feature points belonging to movable

objects in all incoming frames. Thirdly, a semantic octree map is

built by probabilistic fusion of detection and tracking results,

which enables a robot to maintain a semantic description at

object level for potential interactions in dynamic scenarios. We

evaluate RDS-SLAM in TUM RGB-D dataset, and experimental

results show that RDS-SLAM can run with 30.3 ms per frame

in dynamic scenarios using only an Intel Core i7 CPU, and

achieves comparable accuracy compared with the state-of-the-

art SLAM systems which heavily rely on both Intel Core i7

CPUs and powerful GPUs.

I. INTRODUCTION

Simultaneous Localization and Mapping (SLAM) [1] is

an important technique of perception and navigation for

intelligent mobile systems, such as robots and autonomous

vehicles. Due to the low cost, high resolution, and rich

color information of camera, visual SLAM (vSLAM) has

become an important research topic over the last years. Some

excellent vSLAM systems have been established, such as

ORB-SLAM2 [2], ElasticFusion [3], RTAB-Map [4].

However, classical vSLAM systems commonly assume

that scenes are rigid and static, and this assumption leads to

frequent failures of vSLAM systems in dynamic scenarios,

where there are movable objects, such as people and cars.

Even ORB-SLAM2 [2], one of the state-of-the-art vSLAM

systems, may frequently fail in dynamic scenarios, and can

only provide a map with incomplete descriptions. Its local-

ization accuracy is also dramatically degraded. Obviously,

these limitations are caused by movable objects in dynamic

scenarios.

To overcome the effects of movable objects in dynamic

scenarios to vSLAM systems, we propose three major im-

provements for ORB-SLAM2, and implement a robust and

real-time vSLAM framework, RDS-SLAM, for mapping dy-

namic scenarios. The proposed RDS-SLAM can effectively

*This work is partially supported by NSFC Projects 61751308 and

U1713217.

1The authors are with the Institute of Artiﬁcial Intelligence and Robotics,

Xi’an Jiaotong University. Xi’an, P.R. China.

2The author is with the School of Electronic and Control Engineering,

Chang’an University. Xi’an, P.R. China.

†Corresponding author’s email: jrxue@mail.xjtu.edu.cn

Fig. 1. The framework of RDS-SLAM. The threads ﬁlled with gray color

are original threads of ORB-SLAM2. We denote the improvements over

ORB-SLAM2 with red color. The threads in red color are the parallel

improvements proposed in this paper. Additionally, a Prediction module

is inserted into the Tracking thread of ORB-SLAM2.

remove the feature points belonging to movable objects, and

build a semantic octree map at object level for complete

description of dynamic scenarios.

More speciﬁcally, the proposed improvements, as well as

major architectural modiﬁcations, are illustrated in Fig. 1.

Firstly, we adopt a 2D object detection network as a parallel

thread, which is denoted as Detection in Fig. 1, and the

technical details are presented in Sect. IV-A. Instead of

detecting in all frames as other dynamic SLAM systems do,

we run it only in key frames to get the 2D movable objects.

Secondly, we propose an efﬁcient prediction mechanism,

which is denoted as Transformation and Prediction in Fig. 1.

We transform the local 2D bounding box to global 3D

coordinate and extend the classic local 2D tracking algorithm

SORT [5] to global 3D coordinate to track 3D movable

objects in key frames, and the constant velocity model is

taken to predict other frames (Sect. IV-B). The running time

of each frame of the prediction mechanism that we test on

Intel i7 CPU is only 5ms.

Finally, we build Semantic Octree Map Creation as a

parallel thread shown in Fig. 1 for both removing dynamic

objects and creating a complete semantic map at object

level. Instead of raising probability threshold of octree map

like other state-of-the-arts do in dynamic scenarios, we use

semantic information to distinguish whether the point clouds

arXiv:2210.04562v1 [cs.CV] 10 Oct 2022

are movable or not, then insert octree maps with different

probabilities to remove the movable object (Sect. IV-C).

The rest of the paper is structured as follows: Section II

discusses the related works. Section III presents an overview

of RDS-SLAM. Three major improvements are detailed in

Section IV, which are followed by experimental results in

Section V. Finally, the paper is concluded with discussions

and lines for the future in Section VI.

II. RELATED WORKS

There are many excellent vSLAM systems in literature

for mapping scenarios [2], [3], [4] by using RGB-D data,

and a comprehensive survey can be found in [1]. However,

they often fail in dynamic scenarios, and this leads to many

research works in recent years. In this section, we present a

brief survey for these research efforts.

The core idea of improving vSLAM systems is to distin-

guish the dynamic parts of scenarios. For this purpose, it is

straightforward to introduce segmentation [6], [7], [8], [9].

McCormac et al. [6] estimated poses and created a dense

map through ElasticFusion [3], then built a single-frame

map through a convolutional neural network (CNN) and

ﬁnally merged two maps to generate a dense semantic

map with higher classiﬁcation accuracy than single-frame

CNN. However, it cannot handle the dynamic scenarios.

StaticFusion [7], Co-Fusion [8], and MaskFusion [9] had

been proposed to deal with the dynamic scenarios. They

focused on using segmentation information to directly build

an accurate dense map that can distinguish the dynamic

objects and static scenarios. However, these works have

relatively low localization accuracy and heavily rely on

intensive computation efﬁciency.

Among many vSLAM works, ORB-SLAM2 [2] is widely

accepted as the best open source vSLAM system with high

localization accuracy and map reusability, but it also fails

in dynamic scenarios. The situation has been signiﬁcantly

improved by DynaSLAM [10] and DS-SLAM [11], which

are two important variants of ORB-SLAM2. To remove the

ORB [12] feature points of dynamic objects, DynaSLAM

serially added Mask R-CNN [13], Low-Cost Tracking and

Multi-view Geometry to the front of ORB-SLAM2 before

extracting the ORB feature points. However, since it serially

added three modules to the front of ORB-SLAM2, the

average time it took per frame using CPU+GPU is about 500

ms. Similar to DynaSLAM, DS-SLAM also serially added

Moving Consistency Check module and Remove Outliers

module to the Tracking thread of ORB-SLAM2. Different

from DynaSLAM, DS-SLAM parallel added SegNet [14]

thread and Dense Map Creation thread to ORB-SLAM2. It

ﬁnally combines the results of parallel SegNet thread and

the serial Moving Consistency Check module in each frame.

Even with such a parallel architecture, its average time of

processing a frame using CPU+GPU is about 59.4ms. In

summary, neither DynaSLAM [10] nor DS-SLAM [11] can

work in real time without GPUs, and thus cannot meet with

lightweight applications.

Motivated by the aforementioned works, we propose the

real time RDS-SLAM, which can build a complete semantic

octree map of dynamic scenario without using GPUs as well

as the competitive accuracy compared with DynaSLAM [10]

and DS-SLAM [11].

III. SYSTEM OVERVIEW

We propose a real-time and lightweight RGB-D vSLAM

system in dynamic scenarios based on ORB-SLAM2 [2]. We

use object detection and object tracking only in key frames,

and use low-cost prediction in other frames to reduce the

computational cost, as shown in Fig. 1.

In addition to Tracking,Local Mapping and Loop Closing,

three parallel threads of original ORB-SLAM2, we add De-

tection,Transformation and Semantic Octree Map Creation,

three parallel threads into the system. And we also insert a

new module named Prediction into the Tracking thread.

After the processing of Extract ORB,Pose Prediction

or Relocalization and Track Local Map in tracking thread,

ORB-SLAM2 has realized a visual odometry that can es-

timate the pose transformation between frames in a static

scenario. In order to build the map and optimize the pose,

ORB-SLAM2 proposes New KeyFrame Decision module to

select key frames from visual sequence and put them into

Local Mapping thread. The mechanism of New KeyFrame

Decision emphasizes that when the scenario changes, a key

frame will be inserted after a certain time interval is met,

and when the scenario changes quickly, a key frame will be

directly inserted regardless of the time interval.

In RDS-SLAM, we believe that detecting objects in all

frames without a selection will cost too many computational

resources, because the scenario does not always change

during localization and mapping of robots. In other words,

we should use Detection thread only when scenario changes

and increase the frequency of detection when scenario

changes quickly. Thus we can utilize the mechanism of New

KeyFrame Decision as the mechanism of Detection to realize

the adaptive computational resource allocation of Detection

thread by using Detection only in key frames instead of all

the frames.

After New KeyFrame Decision putting the key frames

into the Local Mapping thread, ORB-SLAM2 will check the

recently added feature points on the map (map points) by

Recent MapPoints Culling module, as shown in Fig. 1. It

emphasizes that if a map point is constructed, it must be

observed by the next three key frames. ORB-SLAM2 ef-

fectively eliminates the incorrect map points through Recent

MapPoints Culling, but it cannot effectively remove the map

points on movable objects.

On the contrary, in RDS-SLAM, the map points on

movable objects of the latest key frame were temporarily

built into the map. After RDS-SLAM detecting the latest

key frame using an object detection network and putting

the results into each of the future frames, the map points

on movable objects built by the latest key frame will no

longer be observed in the next key frame. Thus RDS-SLAM

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UsingDetection,TrackingandPredictioninVisualSLAMtoAchieveReal-timeSemanticMappingofDynamicScenariosXingyuChen1,JianruXue1;y,JianwuFang1;2,YuxinPan1andNanningZheng1AbstractInthispaper,weproposealightweightsystem,RDS-SLAM,basedonORB-SLAM2,whichcanaccuratelyesti-mateposesandbuildsemanticmapsatobjectle...

展开>> 收起<<

Using Detection Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios Xingyu Chen1 Jianru Xue1y Jianwu Fang12 Yuxin Pan1and Nanning Zheng1.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Using Detection Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios Xingyu Chen1 Jianru Xue1y Jianwu Fang12 Yuxin Pan1and Nanning Zheng1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: