images. This applies to all other countries and is more no-
ticeable in developing countries where for the most part,
panoramic images are not available at all.
Due to the recent advancement of autonomous vehi-
cles and the Advanced Driving Assistance System (ADAS),
frontal street view videos are easily accessible from the
dash cams in current vehicles. Instead of using unpopular
panoramic images [31, 35, 30], expanding cross-view geo-
localization algorithms to work on sequences of images is
more practical and more acceptable in real-world scenar-
ios. On the other hand, current cross-view geo-localization
approaches [31, 35, 30, 16, 32, 43, 37] deal mainly with
a single image for geo-localization and cannot be used di-
rectly to capture the temporal structure that lies within a
sequence of FOV frames. Thus, it is a natural extension to
expand cross-view geo-localization methods on sequences
of limited FOV images named cross-view image sequence
geo-localization.
In this paper, a new cross-view geo-localization ap-
proach is proposed that works on sequences of limited FOV
images. Our model is trained end-to-end to capture tem-
poral feature representations that lie within the images for
better geo-localization. Although our model is trained on
fixed-length temporal sequence, it tackles the challenge of
variable length sequence during the inference phase through
a novel sequential dropout scheme. To the best of our
knowledge, we are the first one to propose end-to-end cross-
view geo-localization from sequences of images. We refer
to this task as cross-view image sequence geo-localization.
Furthermore, to facilitate future research in cross-view geo-
localization from sequences, we put forward a new dataset
and compare our proposed model with several recent base-
lines. In summary, our main contributions are as follows:
1) We propose a new end-to-end approach, cross-view im-
age sequence geo-localization, that geo-localizes a query
sequence of limited FOV ground images and its correspond-
ing aerial images.
2) We introduce the first large-scale cross-view image se-
quence geo-localization dataset.
3) We propose a novel temporal feature aggregation tech-
nique that learns an end-to-end feature representation from
a sequence of limited FOV images for sequence geo-
localization.
4) We propose a new sequence dropout method to predict
coherent features on sequences of different lengths. The
proposed dropout method helps in regularizing our model
and achieve more robust results.
2. Related Work
Cross-view Image Geo-localization: Before the deep
learning era, cross-view image geo-localization methods
were based on hand-crafted features [19, 8] such as
HoG [10], GIST [24], self-similarity [28], and color his-
tograms. Conventional methods struggled with matching
accuracy because of the quality of the features. Due to the
resurgence of deep learning in numerous computer vision
applications, several deep learning based geo-localization
methods [40, 20, 37] have been proposed to extract features
from fined-tuned CNN models to improve the cross-view
geo-localization accuracy. More recently, Hu et al. [16] pro-
posed to aggregate features by NetVLAD [3] layer which
achieved significant performance improvements. Shi et
al. [32] proposed a feature transport module for aligning
features from aerial view and street view images. Liu et
al. [21] explored fusing orientation information into the
model which boosted performance. With the development
of Generative Adversarial Networks (GANs) [14], Regmi
et al. [26] proposed a GAN-based cross-view image geo-
localization approach using a feature fusion training strat-
egy. Zhu et al. [43] recently proposed a new approach
(VIGOR) that does not require a one-to-one correspon-
dence between ground images and aerial images. It is also
worth mentioning that some methods [30, 31, 35] based
on ground-level panorama employ the polar transformation
which bridges the domain gap between reference images
and query images by prior geometric knowledge. By lever-
aging this prior geometric property, Shi et al. [30] pro-
posed Spatial Aware Feature Aggregation (SAFA) which
improves the results on CVUSA [40] and CVACT [21] by
a large margin. Similar to [26], Toker et al. [35] combined
SAFA [30] with a GAN. Their proposed method achieved
state-of-the-art results on CVUSA [40] and CVACT [21].
However, to perform the polar transformation, the query
image is assumed to be aligned at the center of its refer-
ence aerial image which is not always guaranteed in real-
world scenarios. The above-mentioned methods rely on
panoramic ground-level images. By contrast, our method
used more easily available limited FOV images.
We noticed that some previous works [31, 37, 34] studied
the cross-view image geo-localization problem using a sin-
gle limited FOV image as a query. Tian et al. [34] proposed
a graph-based method that matches the detected buildings
in both ground images and aerial images. This method was
only applicable in metropolitan areas which contain dense
buildings. DBL [37] proposed by Vo et al focused on geo-
localizing the scene in the image rather than the location of
the camera. Dynamic Similarity Matching proposed by Shi
et al. [31] required polar transformed aerial images as in-
put. Compared to these methods, we assume neither aligned
ground-level images nor that our method only works in
metropolitan areas. Furthermore, instead of geo-localizing
a single limited FOV image, our approach geo-localizes a
sequence of limited FOV images.
Recently, Regmi and Shah [27] proposed to geo-localize
video sequences in the same-view setting by using a geo-
temporal feature learning network and a trajectory smooth-