Fast and Efﬁcient Scene Categorization for Autonomous Driving using V AEs Saravanabalagi Ramachandran1 Jonathan Horgan2 Ganesh Sistu2 and John McDonald1

2025-05-06 0 0 7.68MB 8 页 10玖币

侵权投诉

Fast and Efﬁcient Scene Categorization

for Autonomous Driving using VAEs

Saravanabalagi Ramachandran 1, Jonathan Horgan2, Ganesh Sistu2, and John McDonald 1

1Department of Computer Science, Maynooth University, Ireland

2Valeo Vision Systems, Ireland

Abstract

Scene categorization is a useful precursor task that provides prior knowledge for many advanced com-

puter vision tasks with a broad range of applications in content-based image indexing and retrieval systems.

Despite the success of data driven approaches in the ﬁeld of computer vision such as object detection, se-

mantic segmentation, etc., their application in learning high-level features for scene recognition has not

achieved the same level of success. We propose to generate a fast and efﬁcient intermediate interpretable

generalized global descriptor that captures coarse features from the image and use a classiﬁcation head to

map the descriptors to 3 scene categories: Rural, Urban and Suburban. We train a Variational Autoencoder

in an unsupervised manner and map images to a constrained multi-dimensional latent space and use the

latent vectors as compact embeddings that serve as global descriptors for images. The experimental results

evidence that the VAE latent vectors capture coarse information from the image, supporting their usage as

global descriptors. The proposed global descriptor is very compact with an embedding length of 128, sig-

niﬁcantly faster to compute, and is robust to seasonal and illuminational changes, while capturing sufﬁcient

scene information required for scene categorization.

Keywords: Scene Categorization, Image Embeddings, Coarse Features, Variational Autoencoders

1 Introduction

Figure 1: Images from the Scene Categorization

Dataset. Top Left: Rural (Utah), Top Right: Urban

(Toronto), Bottom Left: Rural (Stockport to Buxton),

Bottom Right: Suburban (Melbourne)

Scene categorization is a precursor task with a broad

range of applications in content-based image index-

ing and retrieval systems. Content Based Image Re-

trieval (CBIR) uses the visual content of a given

query image to ﬁnd the closest match in a large im-

age database [Aliajni and Rahtu, 2020]. The retrieval

accuracy of CBIR depends on both the feature repre-

sentation and the similarity metric. The retrieval pro-

cess can be accelerated by selectively searching based

on certain scene categories e.g. given a query image

with multiple high-rise buildings, searching the rural

regions would not be beneﬁcial and can be skipped.

The knowledge about the scene category can also as-

sist in context-aware object detection, action recog-

nition, and scene understanding and provides prior

knowledge for other advanced computer vision tasks [Khan et al., 2016,Xiao et al., 2010].

1{saravanabalagi.ramachandran, john.mcdonald}@mu.ie.2{jonathan.horgan, ganesh.sistu}@valeo.com. This research was supported by

Science Foundation Ireland grant 13/RC/2094 to Lero - the Irish Software Research Centre and grant 16/RI/3399.

arXiv:2210.14981v1 [cs.CV] 26 Oct 2022

In autonomous driving scenarios, location context provides an important prior for parameterising autonomous

behaviour. Generally GPS data is used to determine if the vehicle has entered the city limits, where additional

caution is required e.g. to set the pedestrian detection threshold to watch out for pedestrians in populated re-

gions. However, such an approach requires apriori labelling of the environment and due to rapid development

of regions around the cities and suburbs, it has become increasingly hard to distinguish such regions of interest

only using GPS coordinates. A more scalable and lower cost approach would be to automatically determine the

scene type at the edge using locally sensed data.

We present a deep learning based unsupervised holistic approach that directly encodes coarse information

in the multi-dimensional latent space without explicitly recognizing objects, their semantics or capturing ﬁne

details. Models equipped with intermediate representations train faster, achieve higher task performance, and

generalize better to previously unseen environments [Zhou et al., 2019]. To this end, rather than directly

mapping the input image to the required scene categories as with classic data-driven classiﬁcation solutions, we

propose to generate an intermediate generalized global descriptor that captures coarse features from the image

and use a separate classiﬁcation head to map the descriptors to scene categories. More speciﬁcally, we use an

unsupervised convolutional Variational Autoencoder (VAE) to map images to a multi-dimensional latent space.

We propose to use the latent vectors directly as global descriptors, which are then mapped to 3 scene categories:

Rural, Urban and Suburban, using a supervised classiﬁcation head that takes in these descriptors as input.

2 Background

The success of deep learning in the ﬁeld of computer vision over the past decade has resulted in dramatic

improvements in performance in areas such as object recognition, detection, segmentation, etc. However,

the performance of scene recognition is still not sufﬁcient to some extent because of complex conﬁgurations

[Xie et al., 2020]. Early work on scene categorization includes [Oliva and Torralba, 2001] where the authors

proposed a computational model of the recognition of real world scenes that bypasses the segmentation and

the processing of individual objects or regions. Notable early global image descriptor approaches include

aggregation of local keypoint descriptors through Bag of Words (BoW) [Csurka et al., 2004], Fisher Vectors

(FV) [Perronnin et al., 2010,Sanchez et al., 2013] and Vector of Locally Aggregated Descriptors (VLAD)

[Jégou et al., 2010]. More recently, researchers have also used Histogram of Oriented Gradients (HOG) and

its extensions such as Pyramid HOG (PHOG) for mapping and localization [Garcia-Fidalgo and Ortiz, 2017].

Although these approaches have shown strong performance in constrained settings, they lack the repeatability

and robustness required to deal with the challenging variability that occurs in natural scenes caused due to

different times of the day, weather, lighting and seasons [Ramachandran and McDonald, 2019].

To overcome these issues recent research has focussed on the use of learned global descriptors. Probably

the most notable here is NetVLAD which reformulated VLAD through the use of a deep learning architec-

ture [Arandjelovic et al., 2016] resulting in a CNN based feature extractor using weak supervision to learn a

distance metric based on the triplet loss.

Variatonal Autoencoder (VAE), introduced by [Kingma and Welling, 2013], maps images to a multi-

dimensional standard normal latent space. Although since the introduction of the CelebA dataset [Liu et al.,

2015] multiple implementations of VAEs have shown success in generating human faces, VAEs often produce

blurry and less saturated reconstructions and have been shown to lack the ability to generalize and generate

high-resolution images for domains that exhibit multiple complex variations e.g. realistic natural landscape

images. Besides their use as generative models, VAEs have also been used to infer one or more scalar variables

from images in the context of Autonomous Driving such as for vehicle control [Amini et al., 2018].

A number of researchers have developed datasets to accelerate progress in general scene recognition. Ex-

amples include MIT Indoor67 [Quattoni and Torralba, 2009], SUN [Xiao et al., 2010], and Places 365 [Zhou

et al., 2017]. Whilst these datasets capture a very wide variety of scenes they lack suitability when developing

scene categorisation techniques that are speciﬁc to autonomous driving. Given this, in our research we choose

to use images from public driving datasets such as Oxford Robotcar [Maddern et al., 2017] in an unsupervised

manner and curate our own evaluation dataset targeted at our domain of interest.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FastandEfcientSceneCategorizationforAutonomousDrivingusingVAEsSaravanabalagiRamachandran1,JonathanHorgan2,GaneshSistu2,andJohnMcDonald11DepartmentofComputerScience,MaynoothUniversity,Ireland2ValeoVisionSystems,IrelandAbstractScenecategorizationisausefulprecursortaskthatprovidespriorknowledgeformany...

展开>> 收起<<

Fast and Efﬁcient Scene Categorization for Autonomous Driving using V AEs Saravanabalagi Ramachandran1 Jonathan Horgan2 Ganesh Sistu2 and John McDonald1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fast and Efﬁcient Scene Categorization for Autonomous Driving using V AEs Saravanabalagi Ramachandran1 Jonathan Horgan2 Ganesh Sistu2 and John McDonald1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: