Cross-View Image Sequence Geo-localization Xiaohan Zhang1 Waqas Sultani2 and Safwan Wshah1 1Department of Computer Science University of Vermont USA

2025-05-06
0
0
9.47MB
15 页
10玖币
侵权投诉
Cross-View Image Sequence Geo-localization
Xiaohan Zhang1, Waqas Sultani2, and Safwan Wshah1
1Department of Computer Science, University of Vermont, USA
2Intelligent Machine Lab, Information Technology University, Pakistan
{Xiaohan.Zhang, Safwan.Wshah}@uvm.edu waqas.sultani@itu.edu.pk
Abstract
Cross-view geo-localization aims to estimate the GPS lo-
cation of a query ground-view image by matching it to im-
ages from a reference database of geo-tagged aerial images.
To address this challenging problem, recent approaches use
panoramic ground-view images to increase the range of
visibility. Although appealing, panoramic images are not
readily available compared to the videos of limited Field-
Of-View (FOV) images. In this paper, we present the first
cross-view geo-localization method that works on a se-
quence of limited FOV images. Our model is trained end-
to-end to capture the temporal structure that lies within
the frames using the attention-based temporal feature ag-
gregation module. To robustly tackle different sequences
length and GPS noises during inference, we propose to
use a sequential dropout scheme to simulate variant length
sequences. To evaluate the proposed approach in realis-
tic settings, we present a new large-scale dataset contain-
ing ground-view sequences along with the corresponding
aerial-view images. Extensive experiments and compar-
isons demonstrate the superiority of the proposed approach
compared to several competitive baselines.
1. Introduction
Cross-view image geo-localization aims to determine the
geospatial location from where an image was taken (also
known as the query image) in a database of geo-tagged
aerial images (also known as reference images) [40, 30, 19,
43]. Estimating geo-spatial locations from images has many
important applications such as autonomous driving [29],
robot navigation[4, 17], augmented reality (AR) [9], and
unmanned aerial vehicle (UAV) navigation [29].
Despite the huge research efforts that have been done
on this problem, image geo-localization remains far from
being solved and is considered one of the most challeng-
Figure 1: Comparison of the coverage area (green lines) of
user uploaded street view images between panoramas (left)
and limited FOV images (right) in San Francisco, USA from
Mapillary [2].
ing tasks in the computer vision field due to: 1) the drastic
appearance differences between the query images and ref-
erence images, 2) capturing time gaps between the query
image and the reference image results in different illumina-
tion conditions, weather, and objects and, 3) differences in
resolution at which ground and aerial images are captured.
Recent research in cross-view image geo-localization
has shown tremendous progress on large-scale datasets [40,
21, 43], but they heavily rely on panoramic query im-
ages [40, 16, 30, 26, 32, 6, 21, 42, 43, 39]. Even though
panoramic images provide richer contextual information
than normal limited Field-Of-View (FOV) images, in prac-
tice, limited FOV images are more common and easier to
capture from smartphones, dash cams, and digital single-
lens reflex (DSLR) cameras. Fig. 1 shows the comparison of
coverage area of users uploaded street view images between
panoramas and limited FOV images in San Francisco, USA
on Mapillary [2]. Moreover, even map platforms such as
Google Street View (GSV) provides panoramas only for a
few historic or tourist attraction places for several countries
such as China, Qatar, and Pakistan. However, limited FOV
street view images are available across 190 countries in the
most of the regions as shown on Mapillary [2]. Clearly, the
limited FOV images are much more popular than panoramic
arXiv:2210.14295v2 [cs.CV] 2 Nov 2022
images. This applies to all other countries and is more no-
ticeable in developing countries where for the most part,
panoramic images are not available at all.
Due to the recent advancement of autonomous vehi-
cles and the Advanced Driving Assistance System (ADAS),
frontal street view videos are easily accessible from the
dash cams in current vehicles. Instead of using unpopular
panoramic images [31, 35, 30], expanding cross-view geo-
localization algorithms to work on sequences of images is
more practical and more acceptable in real-world scenar-
ios. On the other hand, current cross-view geo-localization
approaches [31, 35, 30, 16, 32, 43, 37] deal mainly with
a single image for geo-localization and cannot be used di-
rectly to capture the temporal structure that lies within a
sequence of FOV frames. Thus, it is a natural extension to
expand cross-view geo-localization methods on sequences
of limited FOV images named cross-view image sequence
geo-localization.
In this paper, a new cross-view geo-localization ap-
proach is proposed that works on sequences of limited FOV
images. Our model is trained end-to-end to capture tem-
poral feature representations that lie within the images for
better geo-localization. Although our model is trained on
fixed-length temporal sequence, it tackles the challenge of
variable length sequence during the inference phase through
a novel sequential dropout scheme. To the best of our
knowledge, we are the first one to propose end-to-end cross-
view geo-localization from sequences of images. We refer
to this task as cross-view image sequence geo-localization.
Furthermore, to facilitate future research in cross-view geo-
localization from sequences, we put forward a new dataset
and compare our proposed model with several recent base-
lines. In summary, our main contributions are as follows:
1) We propose a new end-to-end approach, cross-view im-
age sequence geo-localization, that geo-localizes a query
sequence of limited FOV ground images and its correspond-
ing aerial images.
2) We introduce the first large-scale cross-view image se-
quence geo-localization dataset.
3) We propose a novel temporal feature aggregation tech-
nique that learns an end-to-end feature representation from
a sequence of limited FOV images for sequence geo-
localization.
4) We propose a new sequence dropout method to predict
coherent features on sequences of different lengths. The
proposed dropout method helps in regularizing our model
and achieve more robust results.
2. Related Work
Cross-view Image Geo-localization: Before the deep
learning era, cross-view image geo-localization methods
were based on hand-crafted features [19, 8] such as
HoG [10], GIST [24], self-similarity [28], and color his-
tograms. Conventional methods struggled with matching
accuracy because of the quality of the features. Due to the
resurgence of deep learning in numerous computer vision
applications, several deep learning based geo-localization
methods [40, 20, 37] have been proposed to extract features
from fined-tuned CNN models to improve the cross-view
geo-localization accuracy. More recently, Hu et al. [16] pro-
posed to aggregate features by NetVLAD [3] layer which
achieved significant performance improvements. Shi et
al. [32] proposed a feature transport module for aligning
features from aerial view and street view images. Liu et
al. [21] explored fusing orientation information into the
model which boosted performance. With the development
of Generative Adversarial Networks (GANs) [14], Regmi
et al. [26] proposed a GAN-based cross-view image geo-
localization approach using a feature fusion training strat-
egy. Zhu et al. [43] recently proposed a new approach
(VIGOR) that does not require a one-to-one correspon-
dence between ground images and aerial images. It is also
worth mentioning that some methods [30, 31, 35] based
on ground-level panorama employ the polar transformation
which bridges the domain gap between reference images
and query images by prior geometric knowledge. By lever-
aging this prior geometric property, Shi et al. [30] pro-
posed Spatial Aware Feature Aggregation (SAFA) which
improves the results on CVUSA [40] and CVACT [21] by
a large margin. Similar to [26], Toker et al. [35] combined
SAFA [30] with a GAN. Their proposed method achieved
state-of-the-art results on CVUSA [40] and CVACT [21].
However, to perform the polar transformation, the query
image is assumed to be aligned at the center of its refer-
ence aerial image which is not always guaranteed in real-
world scenarios. The above-mentioned methods rely on
panoramic ground-level images. By contrast, our method
used more easily available limited FOV images.
We noticed that some previous works [31, 37, 34] studied
the cross-view image geo-localization problem using a sin-
gle limited FOV image as a query. Tian et al. [34] proposed
a graph-based method that matches the detected buildings
in both ground images and aerial images. This method was
only applicable in metropolitan areas which contain dense
buildings. DBL [37] proposed by Vo et al focused on geo-
localizing the scene in the image rather than the location of
the camera. Dynamic Similarity Matching proposed by Shi
et al. [31] required polar transformed aerial images as in-
put. Compared to these methods, we assume neither aligned
ground-level images nor that our method only works in
metropolitan areas. Furthermore, instead of geo-localizing
a single limited FOV image, our approach geo-localizes a
sequence of limited FOV images.
Recently, Regmi and Shah [27] proposed to geo-localize
video sequences in the same-view setting by using a geo-
temporal feature learning network and a trajectory smooth-
ing network. On the other hand, in this paper, we incorpo-
rate aerial images and ground video sequences to address
the problem of cross-view image sequence geo-localization
by proposing a transformer-based model. Current cross-
view geo-localization approaches can be used for sequen-
tial cross-view geo-localization trivially by applying them
frame by frame as proposed in [17]. However, we pro-
pose an end-to-end approach that automatically processes a
whole sequence of images and correlates their features with
the corresponding aerial image by building a better feature
representation in both temporal and spatial domains. We
have compared our results with the best models in the liter-
ature that could be applied to our dataset as discussed in the
experiments section.
Transformer/multi-head attention: Recently, Vaswani et
al. [36] proposed the transformer module and demonstrated
its ability in catching the temporal correlation in time se-
ries data. Using the transformer, several works [22, 5, 12]
achieved remarkable results in natural language processing
tasks. In computer vision, transformers have been used for
image classification [13], video segmentation [38], object
detection [7], and same-view video geo-localization [27].
In this paper, we combined the transformer with the cross-
view image sequence geo-localization to effectively utilize
the full range of visibility from the sequential data. Our ex-
periments showed that the transformer can learn to fuse and
summarize several features from a sequence of images and
predict robust results.
3. Dataset
3.1. Previous Datasets
Many datasets have been proposed for cross-view image
geo-localization [40, 21, 43, 37]. Vo et al. [37] proposed
a large-scale cross-view geo-localization dataset consisting
of more than 1 million pairs of satellite-ground images.
The authors collected aerial images from Google Maps
and the corresponding ground images from Google Street
View from eleven different US cities. Workman et al. [40]
proposed a Cross-View USA (CVUSA) dataset containing
more than 1 million ground-level images across the whole
USA. Later, Zhai et al. [41] refined the CVUSA dataset by
pairing 44,416 aerial-ground images and this has become
one of the most popular datasets in this field. In this paper,
we refer to this refined version as CVUSA. CVACT [21]
followed the same structure as CVUSA and had the same
number of training samples as CVUSA but had 10 times
more testing pairs. Recently, Zhu et al. [43] proposed
the VIGOR dataset which is the first non one-to-one cor-
respondent cross-view image geo-localization dataset col-
lected randomly from four major US cities. In order to
have systems for practical scenarios in which the queries
and reference images pairs are not guaranteed to be always
perfectly aligned, VIGOR defined ‘positive’ and ‘semi-
positive’ ground images in one single aerial image. Note
that current cross-view geo-localization datasets cannot be
easily converted to sequential dataset. To the best of our
knowledge, there is no existing dataset that provides se-
quential ground-level images and their corresponding aerial
images for cross-view image geo-localization.
3.2. Proposed Dataset
Since existing cross-view geo-localization datasets [37,
40, 21, 43] contain only discrete ground images, we col-
lected a new cross-view image sequence geo-localization
dataset containing limited FOV images which are much
more available and applicable for real-world systems. Table
1 demonstrates the comparison of our proposed dataset with
the existing cross-view image geo-localization datasets. Be-
low, we first explain the procedures we followed to collect
the ground-level images and then describe the process of
capturing aerial imagery.
3.2.1 Ground-Level Imagery
Our data was collected using the Fugro Automatic Road
Analyzer (ARAN)1which is a road data capturing vehicle
capable of collecting different data modalities such as im-
age, LiDAR, and pavement laser. ARAN is also equipped
with a GPS and an inertial measurement unit (IMU) sen-
sor for providing precise GPS locations and camera poses.
The raw dataset contains over 5000km of urban and subur-
ban roads, and highways in both directions in the state of
Vermont, US. In our dataset, we only used the frontal cam-
era images with a resolution of 1920 ×1080. The distance
between each capture point is approximately 8mand the
FOV of the camera is around 120◦. GPS location and cam-
era heading (compass direction) are also provided for each
ground-level image. To represent more real-world scenar-
ios, our dataset contains approximately 70% of images from
suburban areas and 30% from urban areas which may be
collected from one or two-way driving directions. The ra-
tio of the collected two-way driving direction data is around
30% in which the same street images are captured from both
driving directions, for example, north-to-south and south-
to-north. The total number of ground images is 118,549 re-
sulting in 38,863 aerial pairs as explained in the following
sections. Our dataset covers around 500 kilometers of roads
in Vermont. Please refer to the supplementary material for
more information.
3.2.2 Sequence Formation
After obtaining the raw ground-level data as described in
section 3.2.1, long sequences of raw data should be seg-
1https://www.fugro.com/our-services/asset-
integrity/roadware/equipment-and-software
摘要:
展开>>
收起<<
Cross-ViewImageSequenceGeo-localizationXiaohanZhang1,WaqasSultani2,andSafwanWshah11DepartmentofComputerScience,UniversityofVermont,USA2IntelligentMachineLab,InformationTechnologyUniversity,Pakistan{Xiaohan.Zhang,Safwan.Wshah}@uvm.eduwaqas.sultani@itu.edu.pkAbstractCross-viewgeo-localizationaimstoest...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:图书资源
价格:10玖币
属性:15 页
大小:9.47MB
格式:PDF
时间:2025-05-06
作者详情
-
IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf10 玖币0人下载
-
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao210 玖币0人下载