datasets, and evaluation metrics. We present our results in
Section V and draw conclusions on this work and future
directions in Section VI.
II. RELATED WORK
Several visual place recognition techniques have been
proposed in the literature over the last decades. One popular
approach is to compute local image descriptors from the most
informative features in images from a training set [16], [17],
building an internal feature map. The map is then coupled
with a retrieval algorithm, such as Bag-of-Words (BoW),
which searches the feature map for the best match for an
input image at runtime. While this pipeline has been heavily
explored for VPR [18], [19], it faces important challenges
such as consistently selecting the most distinctive regions of
an image and the long-term efficient storing and matching
of the computed descriptors. The use of convolutional neural
networks (CNNs) as feature extractors [20] addresses the first
issue by having the network learn what information to extract
from the image and techniques based on CNN feature extrac-
tion have recently achieved state-of-the-art VPR performance
[21], [22], [23]. However, the use of often large CNNs
for feature extraction followed by searching the developed
feature map for an optimal match makes these algorithms
highly computationally intensive, and hence unsuitable for
resource constrained robotic platforms.
A possible alternative is to treat VPR as an image clas-
sification problem, where each place is treated as a class
and each image is a training example belonging to a class.
This approach has the advantage of encoding an image and
inferring the place match in a single step, which can be de-
signed to be highly efficient. Both DrosoNet [24] and FlyNet
[25] are proposed classifier models designed for extremely
lightweight VPR, with the downside of reduced performance.
These small classifiers seem to present inconsistent perfor-
mance, with two models of the same architecture and trained
on the same environment sequence often outputting different
place predictions for the same input image. [24] exploits
this observation by using a large number of classifiers and
combining their output with a handcrafted voting mechanism,
trying to cancel out the weak spots of individual units.
Combining the estimations of multiple classifiers to im-
prove classification performance has been heavily researched
in the broader machine learning (ML) field [26], [27],
leading to powerful and popular ensemble algorithms [28],
[29]. For a combination of classifiers to perform well, it
is important that their class estimations for a given input
are, to some degree, different [30]. Several approaches are
possible to choose a set of baseline classifiers that meets this
requirement. The most obvious of which is to simply use
different classifier algorithms as baseline. If using multiple
of the same baseline classifier, training the different models
on different training sets will also generate variation in the
predicted scores. When combining multiple identical neural
networks, it is also possible to rely on the random weight
initializations [31], [32] to achieve different estimations,
even when trained on the same training data. Despite of
the chosen approach, it is also important that the baseline
classifiers do not overfit the training data. Focusing on
efficiency, using multiple fully binary neural networks has
been shown to improve performance in image classification
tasks [33] while retaining the benefits of a low computational
footprint and power usage. Binary neural networks seem
particularly attractive to be combined in an ensemble, as each
individual network is a weak but efficient classifier [34]. The
combination method itself has also been thoroughly explored
in ML literature [26], [35].The most common approach is to
use a standard mathematical operation, such as multiplication
or summation, on the outputs of the different classifiers. In
contrast, it is also possible to train an output classifier using
the scores of the baseline classifiers as inputs [36].
The combination of multiple baseline techniques to im-
prove place matching performance has also been proposed
for visual place recognition. [13] fuses the features obtained
by multiple state-of-the-art (SOTA) VPR techniques in a
Hidden Markov Model that also incorporates sequential
information. Rather than parallel fusion, [14] proposes the
use of an hierarchical usage of the baseline techniques. Mea-
suring how different baseline techniques complement each
other’s weaknesses has also been a recent research topic [37],
with [12] proposing a frame-by-frame selection of optimal
techniques to combine. While fusion based techniques often
achieve SOTA performance, they are demanding from a
computational resource perspective, not being suitable for
hardware constrained robotic applications. [24] shows that
it is possible to design lightweight VPR systems based on
merging classifiers as long as both the baseline unit and
the combination method are efficient. Finally, to the best of
our knowledge, all of the so far proposed fusing approaches
for VPR are based on variations of the fixed mathematical
operations discussed above, leaving the usage of a learned
method for merging models for VPR largely unexplored.
In manuscript proposes using a learned classifier merging
neural network for lightweight visual place recognition. We
start from a compact baseline classifier with binary weights
whose efficiency and compactness allow for the use of
multiple units in parallel and whose outputs are then treated
as inputs to a neural network, the merger. The merger
network is based on a one-dimensional convolutional layer
which combines the outputs of the baseline classifiers and
relates the scores of nearby places in a single operation.
We design an end-to-end training scheme based on data
augmentation, allowing the baseline classifiers and merger
network to be trained with a single environment traversal.
III. METHODOLOGY
The overall system consists of two main components:
a binary-weighted neural network classifier used as the
baseline unit and a convolutional neural network that merges
multiple one-dimensional score vectors to achieve better clas-
sification performance. In this section, we start by explaining
the two networks’ architecture and then the merger net-
works’ training process. The last two subsections cover the
datasets that we utilize in our experiments and performance