In autonomous driving scenarios, location context provides an important prior for parameterising autonomous
behaviour. Generally GPS data is used to determine if the vehicle has entered the city limits, where additional
caution is required e.g. to set the pedestrian detection threshold to watch out for pedestrians in populated re-
gions. However, such an approach requires apriori labelling of the environment and due to rapid development
of regions around the cities and suburbs, it has become increasingly hard to distinguish such regions of interest
only using GPS coordinates. A more scalable and lower cost approach would be to automatically determine the
scene type at the edge using locally sensed data.
We present a deep learning based unsupervised holistic approach that directly encodes coarse information
in the multi-dimensional latent space without explicitly recognizing objects, their semantics or capturing fine
details. Models equipped with intermediate representations train faster, achieve higher task performance, and
generalize better to previously unseen environments [Zhou et al., 2019]. To this end, rather than directly
mapping the input image to the required scene categories as with classic data-driven classification solutions, we
propose to generate an intermediate generalized global descriptor that captures coarse features from the image
and use a separate classification head to map the descriptors to scene categories. More specifically, we use an
unsupervised convolutional Variational Autoencoder (VAE) to map images to a multi-dimensional latent space.
We propose to use the latent vectors directly as global descriptors, which are then mapped to 3 scene categories:
Rural, Urban and Suburban, using a supervised classification head that takes in these descriptors as input.
2 Background
The success of deep learning in the field of computer vision over the past decade has resulted in dramatic
improvements in performance in areas such as object recognition, detection, segmentation, etc. However,
the performance of scene recognition is still not sufficient to some extent because of complex configurations
[Xie et al., 2020]. Early work on scene categorization includes [Oliva and Torralba, 2001] where the authors
proposed a computational model of the recognition of real world scenes that bypasses the segmentation and
the processing of individual objects or regions. Notable early global image descriptor approaches include
aggregation of local keypoint descriptors through Bag of Words (BoW) [Csurka et al., 2004], Fisher Vectors
(FV) [Perronnin et al., 2010,Sanchez et al., 2013] and Vector of Locally Aggregated Descriptors (VLAD)
[Jégou et al., 2010]. More recently, researchers have also used Histogram of Oriented Gradients (HOG) and
its extensions such as Pyramid HOG (PHOG) for mapping and localization [Garcia-Fidalgo and Ortiz, 2017].
Although these approaches have shown strong performance in constrained settings, they lack the repeatability
and robustness required to deal with the challenging variability that occurs in natural scenes caused due to
different times of the day, weather, lighting and seasons [Ramachandran and McDonald, 2019].
To overcome these issues recent research has focussed on the use of learned global descriptors. Probably
the most notable here is NetVLAD which reformulated VLAD through the use of a deep learning architec-
ture [Arandjelovic et al., 2016] resulting in a CNN based feature extractor using weak supervision to learn a
distance metric based on the triplet loss.
Variatonal Autoencoder (VAE), introduced by [Kingma and Welling, 2013], maps images to a multi-
dimensional standard normal latent space. Although since the introduction of the CelebA dataset [Liu et al.,
2015] multiple implementations of VAEs have shown success in generating human faces, VAEs often produce
blurry and less saturated reconstructions and have been shown to lack the ability to generalize and generate
high-resolution images for domains that exhibit multiple complex variations e.g. realistic natural landscape
images. Besides their use as generative models, VAEs have also been used to infer one or more scalar variables
from images in the context of Autonomous Driving such as for vehicle control [Amini et al., 2018].
A number of researchers have developed datasets to accelerate progress in general scene recognition. Ex-
amples include MIT Indoor67 [Quattoni and Torralba, 2009], SUN [Xiao et al., 2010], and Places 365 [Zhou
et al., 2017]. Whilst these datasets capture a very wide variety of scenes they lack suitability when developing
scene categorisation techniques that are specific to autonomous driving. Given this, in our research we choose
to use images from public driving datasets such as Oxford Robotcar [Maddern et al., 2017] in an unsupervised
manner and curate our own evaluation dataset targeted at our domain of interest.