MOTSLAM MOT-assisted monocular dynamic SLAM using single-view depth estimation Hanwei Zhang1 Hideaki Uchiyama2 Shintaro Ono34and Hiroshi Kawasaki5

2025-05-02 0 0 5.96MB 8 页 10玖币

侵权投诉

MOTSLAM: MOT-assisted monocular dynamic SLAM

using single-view depth estimation

Hanwei Zhang1, Hideaki Uchiyama2, Shintaro Ono3,4and Hiroshi Kawasaki5

Abstract— Visual SLAM systems targeting static scenes have

been developed with satisfactory accuracy and robustness.

Dynamic 3D object tracking has then become a signiﬁcant

capability in visual SLAM with the requirement of under-

standing dynamic surroundings in various scenarios including

autonomous driving, augmented and virtual reality. However,

performing dynamic SLAM solely with monocular images

remains a challenging problem due to the difﬁculty of asso-

ciating dynamic features and estimating their positions. In this

paper, we present MOTSLAM, a dynamic visual SLAM system

with the monocular conﬁguration that tracks both poses and

bounding boxes of dynamic objects. MOTSLAM ﬁrst performs

multiple object tracking (MOT) with associated both 2D and

3D bounding box detection to create initial 3D objects. Then,

neural-network-based monocular depth estimation is applied

to fetch the depth of dynamic features. Finally, camera poses,

object poses, and both static, as well as dynamic map points,

are jointly optimized using a novel bundle adjustment. Our

experiments on the KITTI dataset demonstrate that our system

has reached best performance on both camera ego-motion and

object tracking on monocular dynamic SLAM.

I. INTRODUCTION

Simultaneous Localization And Mapping (SLAM) states

a problem that localizes the ego-motion of an agent while

simultaneously building the map of an unknown environ-

ment. By using cameras as sensors and leveraging their visual

information, the above task then becomes visual SLAM. In

the past few decades, many visual SLAM methods have

achieved both high robustness and performance [1], [2], [3].

Especially, monocular-camera-based techniques are widely

used in many robotics systems because of their great advan-

tages in simplicity and cost-effectiveness. However, those

frameworks usually assume that scenes are static and do

not deal with dynamically-moving objects. Therefore, they

cannot recognize the existence of dynamic objects. In addi-

tion, their ego-motion estimation fails when dynamic objects

exist in the scene. Theoretically, 3D positions of dynamic

features cannot be computed with monocular-camera-based

motion-stereo triangulation. With increasing applications of

SLAM in various ﬁelds, such as Augmented Reality (AR),

Virtual Reality (VR), and autonomous driving, the capability

of understanding surrounding dynamic objects in the scene

has signiﬁcantly become more essential.

1Graduate School of Information Science and Electrical Engineering,

Kyushu University, Fukuoka, Japan

2Graduate School of Science and Technology, Nara Institute of Science

and Technology, Nara, Japan

3Faculty of Engineering, Fukuoka University, Fukuoka, Japan

4Institute of Industrial Science, The University of Tokyo, Tokyo, Japan

5Faculty of Information Science and Electrical Engineering, Kyushu

University, Fukuoka, Japan

Fig. 1: Overview of our contributions. We propose a monoc-

ular dynamic SLAM, MOTSLAM, that performs MOT as

a high-level object association to achieve robust low-level

dynamic feature association. We use 3D object detection

to acquire the accurate initialization of 3D object structure,

which is optimized in the backend bundle adjustment.

Recently, combinations of visual SLAM algorithms with

deep learning techniques such as object detection and se-

mantic segmentation have brought new possibilities to deal

with dynamic objects. In the literature, one approach is to

detect possible dynamic objects and remove feature points

on them as outliers during the whole procedure [4], [5], [6],

[7]. Another one is to simultaneously track moving objects

and perform camera tracking as well as mapping [8], [9],

[10], [11], [12], [13], [14], as referred to as dynamic SLAM.

Existing methods for the latter purpose mostly adopt stereo or

RGB-D conﬁgurations to directly acquire dynamic 3D struc-

ture, which cannot be achieved under the monocular setup.

Furthermore, sparse 3D features cannot provide enough clues

for estimating dynamic objects. Particularly, an insufﬁcient

number of features may degrade the accuracy of estimated

poses and shapes of dynamic objects due to the difﬁculty of

data association.

In this paper, we propose MOTSLAM, a dynamic visual

SLAM with monocular frames that can track 6-DoF camera

poses as well as 3D bounding boxes of surrounding objects

without any additional prior for motions and objects. To

tackle the aforementioned problems, we propose to use

monocular depth estimation to solve the 3D ambiguity under

the monocular setup. For the accurate and robust 3D structure

of dynamic objects, we then incorporate multi object track-

ing (MOT) into our framework. Especially, the detected 2D

and 3D bounding boxes are associated between frames via a

2D-based MOT technique. In other words, our MOTSLAM

arXiv:2210.02038v1 [cs.CV] 5 Oct 2022

efﬁciently combines several deep learning techniques, in-

cluding 2D/3D object detection, semantic segmentation, and

single-view depth estimation. We ﬁrst use deep monocular

depth for possible dynamic features. Then, we apply MOT

based on 2D detection, which provides high-level association

ﬁrst and makes the low-level association of feature points

more simple and robust even for non-consecutive frames.

This is different from the existing methods that track objects

according to associated dynamic features, as illustrated in

Fig. 1. Furthermore, the objects tracked by MOT with associ-

ated 3D detection are initialized robustly with accurate poses

and shapes. They can be quickly optimized by using our

proposed object bundle adjustment, while existing methods

utilize associated 3D features to create initial states, which

performance may be easily affected by outliers. Experiments

are conducted with the KITTI [15] dataset on both odometry

and MOT to show the effectiveness of our method compared

to previous methods.

In summary, our MOTSLAM solves the ambiguity of

dynamic 3D structure under the monocular conﬁguration

with single-view depth estimation. Meanwhile, MOTSLAM

obtains more accurate 6-DoF poses and shapes of surround-

ing objects compared to existing non-monocular methods by

using MOT with 3D object detection. Our main contributions

are listed as follows.

•The ﬁrst visual SLAM system that can simultaneously

track surrounding 6-DoF dynamic objects only with

monocular frames as input without any prior on motions

and objects.

•The proposed method performs the high-level associa-

tion of objects ﬁrst assisted by MOT before the low-

level association of features, making low-level associa-

tion have better performance and robustness.

•Poses and shapes of objects are accurately initialized

with associated 3D object detection from MOT and are

reﬁned with backend object bundle adjustment.

In the rest part of this paper, we ﬁrst introduce existing re-

search related to dynamic visual odometry / SLAM and deep

techniques used in them in Sec. II. The proposed method

is then explained in detail in Sec. III. The comprehensive

experiments compared to previous methods are presented in

Sec. IV and we conclude our work in Sec. V.

II. RELATED WORK

A. Object-aware SLAM

Recent SLAM systems have made effort to realize the

existence of objects in the environment instead of sparse

point clouds. DS-SLAM [4], DynamicSLAM, [5], Dynam-

icDSO [6], and DynaSLAM [7] treat dynamic objects as out-

liers and remove them by using object detection or semantic

segmentation. Existing methods that track surrounding ob-

jects simultaneously usually use stereo or RGB-D sequences

as their input to acquire 3D features. SLAM++ [16] uses

RGB-D input, detects objects with a 3D object detector,

and builds an object graph reﬁning by the pose-graph op-

timization, making use of the mapping results generated

by SLAM. Both Maskfusion [11] and Mid-fusion [17] also

acquire RGB-D inputs, and recognize and track arbitrary

multiple dynamic objects with fused semantic segmentation.

DynSLAM [9] is a dynamic stereo visual SLAM sys-

tem that simultaneously and densely reconstructs the static

background, moving objects, and the potentially moving but

currently stationary objects separately. Li et al.’s work [10]

is a stereo system speciﬁcally designed for vehicle detection

by adopting a driving kinetic model and produces poses and

velocities of moving cars during the visual odometry.

ClusterVO [14] and DynaSLAM II [18] are state-of-

the-art stereo-camera-based sparse visual odometry systems

that can track dynamic objects. ClusterVO [14] performs

a clustering algorithm to aggregate 3D features as objects,

while DynaSLAM II [18] classiﬁes features via semantic

segmentation similar to ours. Compared with other methods,

they do not need a dense mapping of the stereo frame pair

and do not make any assumptions of objects. However, their

3D reconstruction of objects depends on the associated low-

level 3D features heavily. Also, they use an identity matrix

with the center of mass of classiﬁed features as initial object

poses, which is not always suitable in the bundle adjustment.

On the other hand, our method performs the high-level

object association ﬁrst using MOT with robust 3D object

detection as initial states. This loosens the heavy dependency

between low-level features and object states. Associated low-

level features help optimize the state of objects, and object

reconstruction still keeps its robustness if features have bad

quality or low quantity.

CubeSLAM [8] is a monocular SLAM that estimates 3D

boxes of objects. It utilizes 2D object detection, semantic

segmentation, and a vanish-point-based algorithm to estimate

and track 3D boxes from 2D detected boxes in frames.

Also, it tracks those 3D boxes with semantic segmentation.

However, there is one constraint that a planar car motion

is assumed. Gokul et al. [13] proposes a similar system to

CubeSLAM, which aims at vehicle tracking and has a more

precise prior shape and motion model.

While the above systems have shown signiﬁcant progress

of object-aware visual SLAM systems with different sensors,

we notice that the existing literature under the monocular

setup usually adopts strong priors on motions and objects

to deal with dynamic objects properly due to the ambiguity

of depth. To solve these problems, our proposed system is

capable of tracking dynamic objects robustly and accurately

without any prior by maximizing the effectiveness of deep

neural networks on monocular depth estimation and MOT.

B. Deep learning modules in SLAM

With the rapid development of deep learning, the efﬁ-

ciency cost of incorporating deep learning into visual SLAM

has been signiﬁcantly decreased. Object detection, semantic

segmentation, and monocular depth estimation are commonly

used tools in modern dynamic SLAM systems.

To extract the high-level information of objects, object

detection and semantic segmentation are widely adopted in

object-aware visual SLAM systems. Particularly, Segnet [19]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MOTSLAM:MOT-assistedmonoculardynamicSLAMusingsingle-viewdepthestimationHanweiZhang1,HideakiUchiyama2,ShintaroOno3;4andHiroshiKawasaki5AbstractVisualSLAMsystemstargetingstaticsceneshavebeendevelopedwithsatisfactoryaccuracyandrobustness.Dynamic3Dobjecttrackinghasthenbecomeasignicantcapabilityinvisua...

展开>> 收起<<

MOTSLAM MOT-assisted monocular dynamic SLAM using single-view depth estimation Hanwei Zhang1 Hideaki Uchiyama2 Shintaro Ono34and Hiroshi Kawasaki5.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MOTSLAM MOT-assisted monocular dynamic SLAM using single-view depth estimation Hanwei Zhang1 Hideaki Uchiyama2 Shintaro Ono34and Hiroshi Kawasaki5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: