MOTSLAM MOT-assisted monocular dynamic SLAM using single-view depth estimation Hanwei Zhang1 Hideaki Uchiyama2 Shintaro Ono34and Hiroshi Kawasaki5

2025-05-02 0 0 5.96MB 8 页 10玖币
侵权投诉
MOTSLAM: MOT-assisted monocular dynamic SLAM
using single-view depth estimation
Hanwei Zhang1, Hideaki Uchiyama2, Shintaro Ono3,4and Hiroshi Kawasaki5
Abstract Visual SLAM systems targeting static scenes have
been developed with satisfactory accuracy and robustness.
Dynamic 3D object tracking has then become a significant
capability in visual SLAM with the requirement of under-
standing dynamic surroundings in various scenarios including
autonomous driving, augmented and virtual reality. However,
performing dynamic SLAM solely with monocular images
remains a challenging problem due to the difficulty of asso-
ciating dynamic features and estimating their positions. In this
paper, we present MOTSLAM, a dynamic visual SLAM system
with the monocular configuration that tracks both poses and
bounding boxes of dynamic objects. MOTSLAM first performs
multiple object tracking (MOT) with associated both 2D and
3D bounding box detection to create initial 3D objects. Then,
neural-network-based monocular depth estimation is applied
to fetch the depth of dynamic features. Finally, camera poses,
object poses, and both static, as well as dynamic map points,
are jointly optimized using a novel bundle adjustment. Our
experiments on the KITTI dataset demonstrate that our system
has reached best performance on both camera ego-motion and
object tracking on monocular dynamic SLAM.
I. INTRODUCTION
Simultaneous Localization And Mapping (SLAM) states
a problem that localizes the ego-motion of an agent while
simultaneously building the map of an unknown environ-
ment. By using cameras as sensors and leveraging their visual
information, the above task then becomes visual SLAM. In
the past few decades, many visual SLAM methods have
achieved both high robustness and performance [1], [2], [3].
Especially, monocular-camera-based techniques are widely
used in many robotics systems because of their great advan-
tages in simplicity and cost-effectiveness. However, those
frameworks usually assume that scenes are static and do
not deal with dynamically-moving objects. Therefore, they
cannot recognize the existence of dynamic objects. In addi-
tion, their ego-motion estimation fails when dynamic objects
exist in the scene. Theoretically, 3D positions of dynamic
features cannot be computed with monocular-camera-based
motion-stereo triangulation. With increasing applications of
SLAM in various fields, such as Augmented Reality (AR),
Virtual Reality (VR), and autonomous driving, the capability
of understanding surrounding dynamic objects in the scene
has significantly become more essential.
1Graduate School of Information Science and Electrical Engineering,
Kyushu University, Fukuoka, Japan
2Graduate School of Science and Technology, Nara Institute of Science
and Technology, Nara, Japan
3Faculty of Engineering, Fukuoka University, Fukuoka, Japan
4Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
5Faculty of Information Science and Electrical Engineering, Kyushu
University, Fukuoka, Japan
Fig. 1: Overview of our contributions. We propose a monoc-
ular dynamic SLAM, MOTSLAM, that performs MOT as
a high-level object association to achieve robust low-level
dynamic feature association. We use 3D object detection
to acquire the accurate initialization of 3D object structure,
which is optimized in the backend bundle adjustment.
Recently, combinations of visual SLAM algorithms with
deep learning techniques such as object detection and se-
mantic segmentation have brought new possibilities to deal
with dynamic objects. In the literature, one approach is to
detect possible dynamic objects and remove feature points
on them as outliers during the whole procedure [4], [5], [6],
[7]. Another one is to simultaneously track moving objects
and perform camera tracking as well as mapping [8], [9],
[10], [11], [12], [13], [14], as referred to as dynamic SLAM.
Existing methods for the latter purpose mostly adopt stereo or
RGB-D configurations to directly acquire dynamic 3D struc-
ture, which cannot be achieved under the monocular setup.
Furthermore, sparse 3D features cannot provide enough clues
for estimating dynamic objects. Particularly, an insufficient
number of features may degrade the accuracy of estimated
poses and shapes of dynamic objects due to the difficulty of
data association.
In this paper, we propose MOTSLAM, a dynamic visual
SLAM with monocular frames that can track 6-DoF camera
poses as well as 3D bounding boxes of surrounding objects
without any additional prior for motions and objects. To
tackle the aforementioned problems, we propose to use
monocular depth estimation to solve the 3D ambiguity under
the monocular setup. For the accurate and robust 3D structure
of dynamic objects, we then incorporate multi object track-
ing (MOT) into our framework. Especially, the detected 2D
and 3D bounding boxes are associated between frames via a
2D-based MOT technique. In other words, our MOTSLAM
arXiv:2210.02038v1 [cs.CV] 5 Oct 2022
efficiently combines several deep learning techniques, in-
cluding 2D/3D object detection, semantic segmentation, and
single-view depth estimation. We first use deep monocular
depth for possible dynamic features. Then, we apply MOT
based on 2D detection, which provides high-level association
first and makes the low-level association of feature points
more simple and robust even for non-consecutive frames.
This is different from the existing methods that track objects
according to associated dynamic features, as illustrated in
Fig. 1. Furthermore, the objects tracked by MOT with associ-
ated 3D detection are initialized robustly with accurate poses
and shapes. They can be quickly optimized by using our
proposed object bundle adjustment, while existing methods
utilize associated 3D features to create initial states, which
performance may be easily affected by outliers. Experiments
are conducted with the KITTI [15] dataset on both odometry
and MOT to show the effectiveness of our method compared
to previous methods.
In summary, our MOTSLAM solves the ambiguity of
dynamic 3D structure under the monocular configuration
with single-view depth estimation. Meanwhile, MOTSLAM
obtains more accurate 6-DoF poses and shapes of surround-
ing objects compared to existing non-monocular methods by
using MOT with 3D object detection. Our main contributions
are listed as follows.
The first visual SLAM system that can simultaneously
track surrounding 6-DoF dynamic objects only with
monocular frames as input without any prior on motions
and objects.
The proposed method performs the high-level associa-
tion of objects first assisted by MOT before the low-
level association of features, making low-level associa-
tion have better performance and robustness.
Poses and shapes of objects are accurately initialized
with associated 3D object detection from MOT and are
refined with backend object bundle adjustment.
In the rest part of this paper, we first introduce existing re-
search related to dynamic visual odometry / SLAM and deep
techniques used in them in Sec. II. The proposed method
is then explained in detail in Sec. III. The comprehensive
experiments compared to previous methods are presented in
Sec. IV and we conclude our work in Sec. V.
II. RELATED WORK
A. Object-aware SLAM
Recent SLAM systems have made effort to realize the
existence of objects in the environment instead of sparse
point clouds. DS-SLAM [4], DynamicSLAM, [5], Dynam-
icDSO [6], and DynaSLAM [7] treat dynamic objects as out-
liers and remove them by using object detection or semantic
segmentation. Existing methods that track surrounding ob-
jects simultaneously usually use stereo or RGB-D sequences
as their input to acquire 3D features. SLAM++ [16] uses
RGB-D input, detects objects with a 3D object detector,
and builds an object graph refining by the pose-graph op-
timization, making use of the mapping results generated
by SLAM. Both Maskfusion [11] and Mid-fusion [17] also
acquire RGB-D inputs, and recognize and track arbitrary
multiple dynamic objects with fused semantic segmentation.
DynSLAM [9] is a dynamic stereo visual SLAM sys-
tem that simultaneously and densely reconstructs the static
background, moving objects, and the potentially moving but
currently stationary objects separately. Li et al.s work [10]
is a stereo system specifically designed for vehicle detection
by adopting a driving kinetic model and produces poses and
velocities of moving cars during the visual odometry.
ClusterVO [14] and DynaSLAM II [18] are state-of-
the-art stereo-camera-based sparse visual odometry systems
that can track dynamic objects. ClusterVO [14] performs
a clustering algorithm to aggregate 3D features as objects,
while DynaSLAM II [18] classifies features via semantic
segmentation similar to ours. Compared with other methods,
they do not need a dense mapping of the stereo frame pair
and do not make any assumptions of objects. However, their
3D reconstruction of objects depends on the associated low-
level 3D features heavily. Also, they use an identity matrix
with the center of mass of classified features as initial object
poses, which is not always suitable in the bundle adjustment.
On the other hand, our method performs the high-level
object association first using MOT with robust 3D object
detection as initial states. This loosens the heavy dependency
between low-level features and object states. Associated low-
level features help optimize the state of objects, and object
reconstruction still keeps its robustness if features have bad
quality or low quantity.
CubeSLAM [8] is a monocular SLAM that estimates 3D
boxes of objects. It utilizes 2D object detection, semantic
segmentation, and a vanish-point-based algorithm to estimate
and track 3D boxes from 2D detected boxes in frames.
Also, it tracks those 3D boxes with semantic segmentation.
However, there is one constraint that a planar car motion
is assumed. Gokul et al. [13] proposes a similar system to
CubeSLAM, which aims at vehicle tracking and has a more
precise prior shape and motion model.
While the above systems have shown significant progress
of object-aware visual SLAM systems with different sensors,
we notice that the existing literature under the monocular
setup usually adopts strong priors on motions and objects
to deal with dynamic objects properly due to the ambiguity
of depth. To solve these problems, our proposed system is
capable of tracking dynamic objects robustly and accurately
without any prior by maximizing the effectiveness of deep
neural networks on monocular depth estimation and MOT.
B. Deep learning modules in SLAM
With the rapid development of deep learning, the effi-
ciency cost of incorporating deep learning into visual SLAM
has been significantly decreased. Object detection, semantic
segmentation, and monocular depth estimation are commonly
used tools in modern dynamic SLAM systems.
To extract the high-level information of objects, object
detection and semantic segmentation are widely adopted in
object-aware visual SLAM systems. Particularly, Segnet [19]
摘要:

MOTSLAM:MOT-assistedmonoculardynamicSLAMusingsingle-viewdepthestimationHanweiZhang1,HideakiUchiyama2,ShintaroOno3;4andHiroshiKawasaki5Abstract—VisualSLAMsystemstargetingstaticsceneshavebeendevelopedwithsatisfactoryaccuracyandrobustness.Dynamic3Dobjecttrackinghasthenbecomeasignicantcapabilityinvisua...

展开>> 收起<<
MOTSLAM MOT-assisted monocular dynamic SLAM using single-view depth estimation Hanwei Zhang1 Hideaki Uchiyama2 Shintaro Ono34and Hiroshi Kawasaki5.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:5.96MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注