efficiently combines several deep learning techniques, in-
cluding 2D/3D object detection, semantic segmentation, and
single-view depth estimation. We first use deep monocular
depth for possible dynamic features. Then, we apply MOT
based on 2D detection, which provides high-level association
first and makes the low-level association of feature points
more simple and robust even for non-consecutive frames.
This is different from the existing methods that track objects
according to associated dynamic features, as illustrated in
Fig. 1. Furthermore, the objects tracked by MOT with associ-
ated 3D detection are initialized robustly with accurate poses
and shapes. They can be quickly optimized by using our
proposed object bundle adjustment, while existing methods
utilize associated 3D features to create initial states, which
performance may be easily affected by outliers. Experiments
are conducted with the KITTI [15] dataset on both odometry
and MOT to show the effectiveness of our method compared
to previous methods.
In summary, our MOTSLAM solves the ambiguity of
dynamic 3D structure under the monocular configuration
with single-view depth estimation. Meanwhile, MOTSLAM
obtains more accurate 6-DoF poses and shapes of surround-
ing objects compared to existing non-monocular methods by
using MOT with 3D object detection. Our main contributions
are listed as follows.
•The first visual SLAM system that can simultaneously
track surrounding 6-DoF dynamic objects only with
monocular frames as input without any prior on motions
and objects.
•The proposed method performs the high-level associa-
tion of objects first assisted by MOT before the low-
level association of features, making low-level associa-
tion have better performance and robustness.
•Poses and shapes of objects are accurately initialized
with associated 3D object detection from MOT and are
refined with backend object bundle adjustment.
In the rest part of this paper, we first introduce existing re-
search related to dynamic visual odometry / SLAM and deep
techniques used in them in Sec. II. The proposed method
is then explained in detail in Sec. III. The comprehensive
experiments compared to previous methods are presented in
Sec. IV and we conclude our work in Sec. V.
II. RELATED WORK
A. Object-aware SLAM
Recent SLAM systems have made effort to realize the
existence of objects in the environment instead of sparse
point clouds. DS-SLAM [4], DynamicSLAM, [5], Dynam-
icDSO [6], and DynaSLAM [7] treat dynamic objects as out-
liers and remove them by using object detection or semantic
segmentation. Existing methods that track surrounding ob-
jects simultaneously usually use stereo or RGB-D sequences
as their input to acquire 3D features. SLAM++ [16] uses
RGB-D input, detects objects with a 3D object detector,
and builds an object graph refining by the pose-graph op-
timization, making use of the mapping results generated
by SLAM. Both Maskfusion [11] and Mid-fusion [17] also
acquire RGB-D inputs, and recognize and track arbitrary
multiple dynamic objects with fused semantic segmentation.
DynSLAM [9] is a dynamic stereo visual SLAM sys-
tem that simultaneously and densely reconstructs the static
background, moving objects, and the potentially moving but
currently stationary objects separately. Li et al.’s work [10]
is a stereo system specifically designed for vehicle detection
by adopting a driving kinetic model and produces poses and
velocities of moving cars during the visual odometry.
ClusterVO [14] and DynaSLAM II [18] are state-of-
the-art stereo-camera-based sparse visual odometry systems
that can track dynamic objects. ClusterVO [14] performs
a clustering algorithm to aggregate 3D features as objects,
while DynaSLAM II [18] classifies features via semantic
segmentation similar to ours. Compared with other methods,
they do not need a dense mapping of the stereo frame pair
and do not make any assumptions of objects. However, their
3D reconstruction of objects depends on the associated low-
level 3D features heavily. Also, they use an identity matrix
with the center of mass of classified features as initial object
poses, which is not always suitable in the bundle adjustment.
On the other hand, our method performs the high-level
object association first using MOT with robust 3D object
detection as initial states. This loosens the heavy dependency
between low-level features and object states. Associated low-
level features help optimize the state of objects, and object
reconstruction still keeps its robustness if features have bad
quality or low quantity.
CubeSLAM [8] is a monocular SLAM that estimates 3D
boxes of objects. It utilizes 2D object detection, semantic
segmentation, and a vanish-point-based algorithm to estimate
and track 3D boxes from 2D detected boxes in frames.
Also, it tracks those 3D boxes with semantic segmentation.
However, there is one constraint that a planar car motion
is assumed. Gokul et al. [13] proposes a similar system to
CubeSLAM, which aims at vehicle tracking and has a more
precise prior shape and motion model.
While the above systems have shown significant progress
of object-aware visual SLAM systems with different sensors,
we notice that the existing literature under the monocular
setup usually adopts strong priors on motions and objects
to deal with dynamic objects properly due to the ambiguity
of depth. To solve these problems, our proposed system is
capable of tracking dynamic objects robustly and accurately
without any prior by maximizing the effectiveness of deep
neural networks on monocular depth estimation and MOT.
B. Deep learning modules in SLAM
With the rapid development of deep learning, the effi-
ciency cost of incorporating deep learning into visual SLAM
has been significantly decreased. Object detection, semantic
segmentation, and monocular depth estimation are commonly
used tools in modern dynamic SLAM systems.
To extract the high-level information of objects, object
detection and semantic segmentation are widely adopted in
object-aware visual SLAM systems. Particularly, Segnet [19]