Fast and Efficient Scene Categorization for Autonomous Driving using V AEs Saravanabalagi Ramachandran1 Jonathan Horgan2 Ganesh Sistu2 and John McDonald1

2025-05-06 0 0 7.68MB 8 页 10玖币
侵权投诉
Fast and Efficient Scene Categorization
for Autonomous Driving using VAEs
Saravanabalagi Ramachandran 1, Jonathan Horgan2, Ganesh Sistu2, and John McDonald 1
1Department of Computer Science, Maynooth University, Ireland
2Valeo Vision Systems, Ireland
Abstract
Scene categorization is a useful precursor task that provides prior knowledge for many advanced com-
puter vision tasks with a broad range of applications in content-based image indexing and retrieval systems.
Despite the success of data driven approaches in the field of computer vision such as object detection, se-
mantic segmentation, etc., their application in learning high-level features for scene recognition has not
achieved the same level of success. We propose to generate a fast and efficient intermediate interpretable
generalized global descriptor that captures coarse features from the image and use a classification head to
map the descriptors to 3 scene categories: Rural, Urban and Suburban. We train a Variational Autoencoder
in an unsupervised manner and map images to a constrained multi-dimensional latent space and use the
latent vectors as compact embeddings that serve as global descriptors for images. The experimental results
evidence that the VAE latent vectors capture coarse information from the image, supporting their usage as
global descriptors. The proposed global descriptor is very compact with an embedding length of 128, sig-
nificantly faster to compute, and is robust to seasonal and illuminational changes, while capturing sufficient
scene information required for scene categorization.
Keywords: Scene Categorization, Image Embeddings, Coarse Features, Variational Autoencoders
1 Introduction
Figure 1: Images from the Scene Categorization
Dataset. Top Left: Rural (Utah), Top Right: Urban
(Toronto), Bottom Left: Rural (Stockport to Buxton),
Bottom Right: Suburban (Melbourne)
.
Scene categorization is a precursor task with a broad
range of applications in content-based image index-
ing and retrieval systems. Content Based Image Re-
trieval (CBIR) uses the visual content of a given
query image to find the closest match in a large im-
age database [Aliajni and Rahtu, 2020]. The retrieval
accuracy of CBIR depends on both the feature repre-
sentation and the similarity metric. The retrieval pro-
cess can be accelerated by selectively searching based
on certain scene categories e.g. given a query image
with multiple high-rise buildings, searching the rural
regions would not be beneficial and can be skipped.
The knowledge about the scene category can also as-
sist in context-aware object detection, action recog-
nition, and scene understanding and provides prior
knowledge for other advanced computer vision tasks [Khan et al., 2016,Xiao et al., 2010].
1{saravanabalagi.ramachandran, john.mcdonald}@mu.ie.2{jonathan.horgan, ganesh.sistu}@valeo.com. This research was supported by
Science Foundation Ireland grant 13/RC/2094 to Lero - the Irish Software Research Centre and grant 16/RI/3399.
arXiv:2210.14981v1 [cs.CV] 26 Oct 2022
In autonomous driving scenarios, location context provides an important prior for parameterising autonomous
behaviour. Generally GPS data is used to determine if the vehicle has entered the city limits, where additional
caution is required e.g. to set the pedestrian detection threshold to watch out for pedestrians in populated re-
gions. However, such an approach requires apriori labelling of the environment and due to rapid development
of regions around the cities and suburbs, it has become increasingly hard to distinguish such regions of interest
only using GPS coordinates. A more scalable and lower cost approach would be to automatically determine the
scene type at the edge using locally sensed data.
We present a deep learning based unsupervised holistic approach that directly encodes coarse information
in the multi-dimensional latent space without explicitly recognizing objects, their semantics or capturing fine
details. Models equipped with intermediate representations train faster, achieve higher task performance, and
generalize better to previously unseen environments [Zhou et al., 2019]. To this end, rather than directly
mapping the input image to the required scene categories as with classic data-driven classification solutions, we
propose to generate an intermediate generalized global descriptor that captures coarse features from the image
and use a separate classification head to map the descriptors to scene categories. More specifically, we use an
unsupervised convolutional Variational Autoencoder (VAE) to map images to a multi-dimensional latent space.
We propose to use the latent vectors directly as global descriptors, which are then mapped to 3 scene categories:
Rural, Urban and Suburban, using a supervised classification head that takes in these descriptors as input.
2 Background
The success of deep learning in the field of computer vision over the past decade has resulted in dramatic
improvements in performance in areas such as object recognition, detection, segmentation, etc. However,
the performance of scene recognition is still not sufficient to some extent because of complex configurations
[Xie et al., 2020]. Early work on scene categorization includes [Oliva and Torralba, 2001] where the authors
proposed a computational model of the recognition of real world scenes that bypasses the segmentation and
the processing of individual objects or regions. Notable early global image descriptor approaches include
aggregation of local keypoint descriptors through Bag of Words (BoW) [Csurka et al., 2004], Fisher Vectors
(FV) [Perronnin et al., 2010,Sanchez et al., 2013] and Vector of Locally Aggregated Descriptors (VLAD)
[Jégou et al., 2010]. More recently, researchers have also used Histogram of Oriented Gradients (HOG) and
its extensions such as Pyramid HOG (PHOG) for mapping and localization [Garcia-Fidalgo and Ortiz, 2017].
Although these approaches have shown strong performance in constrained settings, they lack the repeatability
and robustness required to deal with the challenging variability that occurs in natural scenes caused due to
different times of the day, weather, lighting and seasons [Ramachandran and McDonald, 2019].
To overcome these issues recent research has focussed on the use of learned global descriptors. Probably
the most notable here is NetVLAD which reformulated VLAD through the use of a deep learning architec-
ture [Arandjelovic et al., 2016] resulting in a CNN based feature extractor using weak supervision to learn a
distance metric based on the triplet loss.
Variatonal Autoencoder (VAE), introduced by [Kingma and Welling, 2013], maps images to a multi-
dimensional standard normal latent space. Although since the introduction of the CelebA dataset [Liu et al.,
2015] multiple implementations of VAEs have shown success in generating human faces, VAEs often produce
blurry and less saturated reconstructions and have been shown to lack the ability to generalize and generate
high-resolution images for domains that exhibit multiple complex variations e.g. realistic natural landscape
images. Besides their use as generative models, VAEs have also been used to infer one or more scalar variables
from images in the context of Autonomous Driving such as for vehicle control [Amini et al., 2018].
A number of researchers have developed datasets to accelerate progress in general scene recognition. Ex-
amples include MIT Indoor67 [Quattoni and Torralba, 2009], SUN [Xiao et al., 2010], and Places 365 [Zhou
et al., 2017]. Whilst these datasets capture a very wide variety of scenes they lack suitability when developing
scene categorisation techniques that are specific to autonomous driving. Given this, in our research we choose
to use images from public driving datasets such as Oxford Robotcar [Maddern et al., 2017] in an unsupervised
manner and curate our own evaluation dataset targeted at our domain of interest.
摘要:

FastandEfcientSceneCategorizationforAutonomousDrivingusingVAEsSaravanabalagiRamachandran1,JonathanHorgan2,GaneshSistu2,andJohnMcDonald11DepartmentofComputerScience,MaynoothUniversity,Ireland2ValeoVisionSystems,IrelandAbstractScenecategorizationisausefulprecursortaskthatprovidespriorknowledgeformany...

展开>> 收起<<
Fast and Efficient Scene Categorization for Autonomous Driving using V AEs Saravanabalagi Ramachandran1 Jonathan Horgan2 Ganesh Sistu2 and John McDonald1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:7.68MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注