in sparse maps with abstract and hierarchical semantic infor-
mation exists to our knowledge.
The main contributions of this paper is a global localiza-
tion system in floor plan maps that integrates semantic cues.
We propose to leverage semantic cues to break the symmetry
and distinguish between locations that appear similar or
identical in the nondescript maps. Semantic information is
commonly available in the form of furniture, machinery and
textual cues and can be used to distinguish between spaces
with similar layout. To avoid the complexity of building a
3D map from scans and to enable easy updates to semantic
information, we present a 2D, high level semantic map.
Thus, we present a format for abstract semantic map with
an editing application and a sensor model for semantic infor-
mation that complements LiDAR-based observation models.
Additionally, we provide a way to incorporate hierarchical
semantic information. Unlike most modern semantic-based
SLAM approaches [6][20][31][37][38], our approach does
not require a GPU and can run online on an onboard
computer. In our experiments, we show that our approach
is able to: (i) localize in sparse floor plan-like map with
high symmetry using semantic cues, (ii) localize long-term
without updating the map, (iii) localize in previously unseen
environment. (iv) localize the robot online using an onboard
computer. These claims are backed up by the paper and our
experimental evaluation.
II. RELATED WORK
Localization in 2D maps has been thoroughly re-
searched [5][35][36][40]. Among the most robust and
commonly-used approaches, are the probabilistic methods
for pose estimation, including Markov localization by Fox et
al. [11], the extended Kalman filter (EKF) [16] and particle
filters, also known as Monte Carlo localization (MCL) by
Dellaert et al. [8]. These works laid the foundation for
localization using range sensors and cameras.
Localization in detailed, feature-rich maps, usually con-
structed by range sensors, is extensively-studied [23], but
few works address the problem of localization in sparse,
floor plan-like maps, despite their benefits. Floor plans are
readily-available in many facilities, and therefore do not
depend on prior mapping. As they only include information
on permanent structures, they do not require frequent updates
when objects, such as furniture, are relocated. Their main
drawback comes from their sparse nature, and the lack of
detailed geometric information can results in global local-
ization failures when multiple rooms look alike. Another
concern is the possible mismatch between the floor plans
and the constructed building [3]. Li et al. [17] address the
scale difference between constructed structure and floor plans
by introducing a new state variable. Boniardi et al. [4] uses
cameras to infer the room layout via edge extraction and
match it against the floor plan. In the evaluation, the authors
initialized the pose within 10 cm and 15◦from the ground-
truth pose, and did not evaluate global localization. We spec-
ulate that edge extraction of the walls is not sufficient in a
highly repetitive indoor environment where many rooms have
the same size. Both approaches provide tracking capabilities,
but not global localization.
Recent works in extracting semantic information with
deep learning models showed significant improvement in
performance for both text spotting [18][33] and object de-
tection [2][41]. The use of textual cues for localization is
surprisingly uncommon, with notable works by Cui et al. [7]
and Zimmerman et al. [43]. Both works considered using
textual information within an MCL framework, but used
different approaches to integrate it. In our approach, we
expand our previous work [43] to consider semantic cues
via object detection, not only textual ones.
The use of semantic information for localization and place
recognition is applied to a variety of sensors, including 2D
and 3D LiDARs, RGB and RGB-D cameras. Rottmann et
al. [30] use AdaBoost features from 2D LiDAR scans to infer
semantic labels such as office, corridor and kitchen. They
combine the semantic information with occupancy grid map
in an MCL framework. Unlike our approach, their method
requires a detailed map and manually assigning a semantic
label to every grid cell. Hendrikx et al. [13] utilize available
building information model to extract both geometric and
semantic information, and localize by matching 2D LiDAR-
based features corresponding to walls, corners and columns.
While the automatic extraction of semantic and geometric
maps from a BIM is promising, the approach is not suitable
for global localization as it cannot overcome the challenges
of a repetitively-structured environment.
Atanasov et al. [1] treat semantic objects as landmarks
that include their 3D pose, semantic label and possible
shape priors. They detect objects using a deformable part
model [9], and use their semantic observation model in an
MCL framework. The results they report do not outperform
LiDAR-based localization. An alternative representation for
semantic information is a constellation model, as suggested
by Ranganathan et al. [29]. In their approach, they use
stereo cameras, exploiting depth information. They rely on
hand-crafted features including SIFT [19] to detect objects.
Places are associated with constellations of objects, where
every object has shape and appearance distribution and a
relative transformation to the base location. Unlike these two
approaches, our approach does not require exact poses for the
semantic objects. A more flexible representation is proposed
by Yi et al. [39], who use topological-semantic graphs to
represent the environment. They extract topological nodes
from an occupancy grid map, and characterize each node by
the semantic objects in its vicinity. It suffers when objects
are far from the camera and can easily diverge when objects
cannot be detected, while our approach is more robust as it
relied additionally on LiDAR observations and textual cues.
Similarly to the above mentioned approaches, we also use
sparse representation for semantic objects. However, by using
deep learning to detect objects, we are able to detect a larger
variety of objects with greater confidence, and localize in
previously unseen places.
S¨
underhauf et al. [34] construct semantic maps from
camera by assigning a place category to each occupancy