Merging Classification Predictions with Sequential Information for Lightweight Visual Place Recognition in Changing Environments Bruno Arcanjo1 Bruno Ferrarini1 Michael Milford2 Klaus D. McDonald-Maier1and Shoaib Ehsan1

2025-05-02 0 0 4.49MB 8 页 10玖币
侵权投诉
Merging Classification Predictions with Sequential Information for
Lightweight Visual Place Recognition in Changing Environments
Bruno Arcanjo1, Bruno Ferrarini1, Michael Milford2, Klaus D. McDonald-Maier1and Shoaib Ehsan1
Abstract Low-overhead visual place recognition (VPR) is a
highly active research topic. Mobile robotics applications often
operate under low-end hardware, and even more hardware
capable systems can still benefit from freeing up onboard system
resources for other navigation tasks. This work addresses
lightweight VPR by proposing a novel system based on the
combination of binary-weighted classifier networks with a one-
dimensional convolutional network, dubbed merger. Recent
work in fusing multiple VPR techniques has mainly focused on
increasing VPR performance, with computational efficiency not
being highly prioritized. In contrast, we design our technique
prioritizing low inference times, taking inspiration from the
machine learning literature where the efficient combination of
classifiers is a heavily researched topic. Our experiments show
that the merger achieves inference times as low as 1 millisecond,
being significantly faster than other well-established lightweight
VPR techniques, while achieving comparable or superior VPR
performance on several visual changes such as seasonal varia-
tions and viewpoint lateral shifts.
I. INTRODUCTION
Visual place recognition (VPR) allows a system to localize
itself in its operating environment using visual information,
matching the currently observed place to a previously seen
one. The localization information can then be used in down-
stream tasks such as Simultaneous Localization and Mapping
(SLAM), allowing for internal map correction in mobile
robotics systems [1].
VPR remains a difficult task with an array of non-trivial
challenges. The same place can appear drastically different
due to changes in illumination [2], seasonal variations [3],
dynamic agents [4] and viewpoint variations [5]. In contrast,
perceptual aliasing errors occur when two different places
are identified as being the same due to visual similarities.
Developing VPR techniques which are resilient to all or even
just a subset of these errors is made harder when considering
that many mobile robotic applications operate on hardware
restricted platforms [6], [7], making computational efficiency
an added important consideration. Due to size, payload
limitation or cost, these platforms cannot carry powerful
hardware such as graphic process units (GPUs) and hence
rely on the development of efficient localization algorithms to
1B. Arcanjo, B. Ferrarini, K. D. McDonald-Maier and S.
Ehsan are with the School of Computer Science and Electronic
Engineering, University of Essex, United Kingdom (email:
bq17319@essex.ac.uk; bferra@essex.ac.uk;
kdm@essex.ac.uk; sehsan@essex.ac.uk)
2M. Milford is with the School of Electrical Engineering and Computer
Science, Queensland University of Technology, Brisbane, QLD 4000, Aus-
tralia (email: michael.milford@qut.edu.au)
autonomously navigate their environment. Moreover, devel-
oping lightweight techniques also benefits applications with
more capable hardware, freeing up resources to be used in
other important navigation tasks [8], [9].
The different challenges that VPR presents have driven the
development of several techniques, some being more resilient
to certain visual variations than others [10], [11], [12] and
with different computational efficiencies. The availability
of multiple VPR approaches with different characteristics
has led to research in fusing different techniques into a
single performing algorithm, enabling more accurate VPR
[13]. However, most of the current fusion-based techniques
[13], [14] heavily focus on VPR performance, with little
attention given to computational efficiency, and might hence
be unsuitable for resource constrained robotic platforms.
This work aims to achieve highly efficient VPR, with low
inference times. without relying on a dedicated graphics unit
or any other dedicated computational hardware. We propose
a novel lightweight system based on combining multiple
compact classifiers using a one-dimensional convolutional
neural network, dubbed merger. We start by introducing an
efficient binary-weighted neural network [15] as the baseline
classifying unit. The merger network is then trained on score
vectors obtained by multiple units, where the convolutional
layer learns to relate the estimations of baseline models while
incorporating sequential information. The proposed training
schema relies on data augmentation to allow training both
the baseline stage and the merger with a single environment
sequence in an end-to-end fashion.
To the best of our knowledge, this is the first VPR
technique to introduce the fusion of multiple baseline models
with a learned network. We claim the following contributions
in this manuscript:
a lightweight, neural network with binary weights to be
used as the baseline classifier of our VPR system.
an one-dimensional convolutional neural network which
efficiently combines the outputs of several baseline
classifiers.
a data augmentation based training scheme for the
proposed VPR system, allowing to train both the base-
line classifiers and the merger network with a single
environment traversal sequence.
The paper is structured as follows. In Section II we give
an overview of the state-of-the-art VPR techniques and prior
work on combining multiple models into one system. In
Section III we detail our proposed system, from the binary-
weighted classifier to the merger network and its training.
Section IV details our implementation settings, the usage of
arXiv:2210.00834v1 [cs.CV] 3 Oct 2022
datasets, and evaluation metrics. We present our results in
Section V and draw conclusions on this work and future
directions in Section VI.
II. RELATED WORK
Several visual place recognition techniques have been
proposed in the literature over the last decades. One popular
approach is to compute local image descriptors from the most
informative features in images from a training set [16], [17],
building an internal feature map. The map is then coupled
with a retrieval algorithm, such as Bag-of-Words (BoW),
which searches the feature map for the best match for an
input image at runtime. While this pipeline has been heavily
explored for VPR [18], [19], it faces important challenges
such as consistently selecting the most distinctive regions of
an image and the long-term efficient storing and matching
of the computed descriptors. The use of convolutional neural
networks (CNNs) as feature extractors [20] addresses the first
issue by having the network learn what information to extract
from the image and techniques based on CNN feature extrac-
tion have recently achieved state-of-the-art VPR performance
[21], [22], [23]. However, the use of often large CNNs
for feature extraction followed by searching the developed
feature map for an optimal match makes these algorithms
highly computationally intensive, and hence unsuitable for
resource constrained robotic platforms.
A possible alternative is to treat VPR as an image clas-
sification problem, where each place is treated as a class
and each image is a training example belonging to a class.
This approach has the advantage of encoding an image and
inferring the place match in a single step, which can be de-
signed to be highly efficient. Both DrosoNet [24] and FlyNet
[25] are proposed classifier models designed for extremely
lightweight VPR, with the downside of reduced performance.
These small classifiers seem to present inconsistent perfor-
mance, with two models of the same architecture and trained
on the same environment sequence often outputting different
place predictions for the same input image. [24] exploits
this observation by using a large number of classifiers and
combining their output with a handcrafted voting mechanism,
trying to cancel out the weak spots of individual units.
Combining the estimations of multiple classifiers to im-
prove classification performance has been heavily researched
in the broader machine learning (ML) field [26], [27],
leading to powerful and popular ensemble algorithms [28],
[29]. For a combination of classifiers to perform well, it
is important that their class estimations for a given input
are, to some degree, different [30]. Several approaches are
possible to choose a set of baseline classifiers that meets this
requirement. The most obvious of which is to simply use
different classifier algorithms as baseline. If using multiple
of the same baseline classifier, training the different models
on different training sets will also generate variation in the
predicted scores. When combining multiple identical neural
networks, it is also possible to rely on the random weight
initializations [31], [32] to achieve different estimations,
even when trained on the same training data. Despite of
the chosen approach, it is also important that the baseline
classifiers do not overfit the training data. Focusing on
efficiency, using multiple fully binary neural networks has
been shown to improve performance in image classification
tasks [33] while retaining the benefits of a low computational
footprint and power usage. Binary neural networks seem
particularly attractive to be combined in an ensemble, as each
individual network is a weak but efficient classifier [34]. The
combination method itself has also been thoroughly explored
in ML literature [26], [35].The most common approach is to
use a standard mathematical operation, such as multiplication
or summation, on the outputs of the different classifiers. In
contrast, it is also possible to train an output classifier using
the scores of the baseline classifiers as inputs [36].
The combination of multiple baseline techniques to im-
prove place matching performance has also been proposed
for visual place recognition. [13] fuses the features obtained
by multiple state-of-the-art (SOTA) VPR techniques in a
Hidden Markov Model that also incorporates sequential
information. Rather than parallel fusion, [14] proposes the
use of an hierarchical usage of the baseline techniques. Mea-
suring how different baseline techniques complement each
other’s weaknesses has also been a recent research topic [37],
with [12] proposing a frame-by-frame selection of optimal
techniques to combine. While fusion based techniques often
achieve SOTA performance, they are demanding from a
computational resource perspective, not being suitable for
hardware constrained robotic applications. [24] shows that
it is possible to design lightweight VPR systems based on
merging classifiers as long as both the baseline unit and
the combination method are efficient. Finally, to the best of
our knowledge, all of the so far proposed fusing approaches
for VPR are based on variations of the fixed mathematical
operations discussed above, leaving the usage of a learned
method for merging models for VPR largely unexplored.
In manuscript proposes using a learned classifier merging
neural network for lightweight visual place recognition. We
start from a compact baseline classifier with binary weights
whose efficiency and compactness allow for the use of
multiple units in parallel and whose outputs are then treated
as inputs to a neural network, the merger. The merger
network is based on a one-dimensional convolutional layer
which combines the outputs of the baseline classifiers and
relates the scores of nearby places in a single operation.
We design an end-to-end training scheme based on data
augmentation, allowing the baseline classifiers and merger
network to be trained with a single environment traversal.
III. METHODOLOGY
The overall system consists of two main components:
a binary-weighted neural network classifier used as the
baseline unit and a convolutional neural network that merges
multiple one-dimensional score vectors to achieve better clas-
sification performance. In this section, we start by explaining
the two networks’ architecture and then the merger net-
works’ training process. The last two subsections cover the
datasets that we utilize in our experiments and performance
摘要:

MergingClassicationPredictionswithSequentialInformationforLightweightVisualPlaceRecognitioninChangingEnvironmentsBrunoArcanjo1,BrunoFerrarini1,MichaelMilford2,KlausD.McDonald-Maier1andShoaibEhsan1Abstract—Low-overheadvisualplacerecognition(VPR)isahighlyactiveresearchtopic.Mobileroboticsapplications...

展开>> 收起<<
Merging Classification Predictions with Sequential Information for Lightweight Visual Place Recognition in Changing Environments Bruno Arcanjo1 Bruno Ferrarini1 Michael Milford2 Klaus D. McDonald-Maier1and Shoaib Ehsan1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:4.49MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注