Cross-View Image Sequence Geo-localization Xiaohan Zhang1 Waqas Sultani2 and Safwan Wshah1 1Department of Computer Science University of Vermont USA

2025-05-06 0 0 9.47MB 15 页 10玖币

侵权投诉

Cross-View Image Sequence Geo-localization

Xiaohan Zhang1, Waqas Sultani2, and Safwan Wshah1

1Department of Computer Science, University of Vermont, USA

2Intelligent Machine Lab, Information Technology University, Pakistan

{Xiaohan.Zhang, Safwan.Wshah}@uvm.edu waqas.sultani@itu.edu.pk

Abstract

Cross-view geo-localization aims to estimate the GPS lo-

cation of a query ground-view image by matching it to im-

ages from a reference database of geo-tagged aerial images.

To address this challenging problem, recent approaches use

panoramic ground-view images to increase the range of

visibility. Although appealing, panoramic images are not

readily available compared to the videos of limited Field-

Of-View (FOV) images. In this paper, we present the ﬁrst

cross-view geo-localization method that works on a se-

quence of limited FOV images. Our model is trained end-

to-end to capture the temporal structure that lies within

the frames using the attention-based temporal feature ag-

gregation module. To robustly tackle different sequences

length and GPS noises during inference, we propose to

use a sequential dropout scheme to simulate variant length

sequences. To evaluate the proposed approach in realis-

tic settings, we present a new large-scale dataset contain-

ing ground-view sequences along with the corresponding

aerial-view images. Extensive experiments and compar-

isons demonstrate the superiority of the proposed approach

compared to several competitive baselines.

1. Introduction

Cross-view image geo-localization aims to determine the

geospatial location from where an image was taken (also

known as the query image) in a database of geo-tagged

aerial images (also known as reference images) [40, 30, 19,

43]. Estimating geo-spatial locations from images has many

important applications such as autonomous driving [29],

robot navigation[4, 17], augmented reality (AR) [9], and

unmanned aerial vehicle (UAV) navigation [29].

Despite the huge research efforts that have been done

on this problem, image geo-localization remains far from

being solved and is considered one of the most challeng-

Figure 1: Comparison of the coverage area (green lines) of

user uploaded street view images between panoramas (left)

and limited FOV images (right) in San Francisco, USA from

Mapillary [2].

ing tasks in the computer vision ﬁeld due to: 1) the drastic

appearance differences between the query images and ref-

erence images, 2) capturing time gaps between the query

image and the reference image results in different illumina-

tion conditions, weather, and objects and, 3) differences in

resolution at which ground and aerial images are captured.

Recent research in cross-view image geo-localization

has shown tremendous progress on large-scale datasets [40,

21, 43], but they heavily rely on panoramic query im-

ages [40, 16, 30, 26, 32, 6, 21, 42, 43, 39]. Even though

panoramic images provide richer contextual information

than normal limited Field-Of-View (FOV) images, in prac-

tice, limited FOV images are more common and easier to

capture from smartphones, dash cams, and digital single-

lens reﬂex (DSLR) cameras. Fig. 1 shows the comparison of

coverage area of users uploaded street view images between

panoramas and limited FOV images in San Francisco, USA

on Mapillary [2]. Moreover, even map platforms such as

Google Street View (GSV) provides panoramas only for a

few historic or tourist attraction places for several countries

such as China, Qatar, and Pakistan. However, limited FOV

street view images are available across 190 countries in the

most of the regions as shown on Mapillary [2]. Clearly, the

limited FOV images are much more popular than panoramic

arXiv:2210.14295v2 [cs.CV] 2 Nov 2022

images. This applies to all other countries and is more no-

ticeable in developing countries where for the most part,

panoramic images are not available at all.

Due to the recent advancement of autonomous vehi-

cles and the Advanced Driving Assistance System (ADAS),

frontal street view videos are easily accessible from the

dash cams in current vehicles. Instead of using unpopular

panoramic images [31, 35, 30], expanding cross-view geo-

localization algorithms to work on sequences of images is

more practical and more acceptable in real-world scenar-

ios. On the other hand, current cross-view geo-localization

approaches [31, 35, 30, 16, 32, 43, 37] deal mainly with

a single image for geo-localization and cannot be used di-

rectly to capture the temporal structure that lies within a

sequence of FOV frames. Thus, it is a natural extension to

expand cross-view geo-localization methods on sequences

of limited FOV images named cross-view image sequence

geo-localization.

In this paper, a new cross-view geo-localization ap-

proach is proposed that works on sequences of limited FOV

images. Our model is trained end-to-end to capture tem-

poral feature representations that lie within the images for

better geo-localization. Although our model is trained on

ﬁxed-length temporal sequence, it tackles the challenge of

variable length sequence during the inference phase through

a novel sequential dropout scheme. To the best of our

knowledge, we are the ﬁrst one to propose end-to-end cross-

view geo-localization from sequences of images. We refer

to this task as cross-view image sequence geo-localization.

Furthermore, to facilitate future research in cross-view geo-

localization from sequences, we put forward a new dataset

and compare our proposed model with several recent base-

lines. In summary, our main contributions are as follows:

1) We propose a new end-to-end approach, cross-view im-

age sequence geo-localization, that geo-localizes a query

sequence of limited FOV ground images and its correspond-

ing aerial images.

2) We introduce the ﬁrst large-scale cross-view image se-

quence geo-localization dataset.

3) We propose a novel temporal feature aggregation tech-

nique that learns an end-to-end feature representation from

a sequence of limited FOV images for sequence geo-

localization.

4) We propose a new sequence dropout method to predict

coherent features on sequences of different lengths. The

proposed dropout method helps in regularizing our model

and achieve more robust results.

2. Related Work

Cross-view Image Geo-localization: Before the deep

learning era, cross-view image geo-localization methods

were based on hand-crafted features [19, 8] such as

HoG [10], GIST [24], self-similarity [28], and color his-

tograms. Conventional methods struggled with matching

accuracy because of the quality of the features. Due to the

resurgence of deep learning in numerous computer vision

applications, several deep learning based geo-localization

methods [40, 20, 37] have been proposed to extract features

from ﬁned-tuned CNN models to improve the cross-view

geo-localization accuracy. More recently, Hu et al. [16] pro-

posed to aggregate features by NetVLAD [3] layer which

achieved signiﬁcant performance improvements. Shi et

al. [32] proposed a feature transport module for aligning

features from aerial view and street view images. Liu et

al. [21] explored fusing orientation information into the

model which boosted performance. With the development

of Generative Adversarial Networks (GANs) [14], Regmi

et al. [26] proposed a GAN-based cross-view image geo-

localization approach using a feature fusion training strat-

egy. Zhu et al. [43] recently proposed a new approach

(VIGOR) that does not require a one-to-one correspon-

dence between ground images and aerial images. It is also

worth mentioning that some methods [30, 31, 35] based

on ground-level panorama employ the polar transformation

which bridges the domain gap between reference images

and query images by prior geometric knowledge. By lever-

aging this prior geometric property, Shi et al. [30] pro-

posed Spatial Aware Feature Aggregation (SAFA) which

improves the results on CVUSA [40] and CVACT [21] by

a large margin. Similar to [26], Toker et al. [35] combined

SAFA [30] with a GAN. Their proposed method achieved

state-of-the-art results on CVUSA [40] and CVACT [21].

However, to perform the polar transformation, the query

image is assumed to be aligned at the center of its refer-

ence aerial image which is not always guaranteed in real-

world scenarios. The above-mentioned methods rely on

panoramic ground-level images. By contrast, our method

used more easily available limited FOV images.

We noticed that some previous works [31, 37, 34] studied

the cross-view image geo-localization problem using a sin-

gle limited FOV image as a query. Tian et al. [34] proposed

a graph-based method that matches the detected buildings

in both ground images and aerial images. This method was

only applicable in metropolitan areas which contain dense

buildings. DBL [37] proposed by Vo et al focused on geo-

localizing the scene in the image rather than the location of

the camera. Dynamic Similarity Matching proposed by Shi

et al. [31] required polar transformed aerial images as in-

put. Compared to these methods, we assume neither aligned

ground-level images nor that our method only works in

metropolitan areas. Furthermore, instead of geo-localizing

a single limited FOV image, our approach geo-localizes a

sequence of limited FOV images.

Recently, Regmi and Shah [27] proposed to geo-localize

video sequences in the same-view setting by using a geo-

temporal feature learning network and a trajectory smooth-

ing network. On the other hand, in this paper, we incorpo-

rate aerial images and ground video sequences to address

the problem of cross-view image sequence geo-localization

by proposing a transformer-based model. Current cross-

view geo-localization approaches can be used for sequen-

tial cross-view geo-localization trivially by applying them

frame by frame as proposed in [17]. However, we pro-

pose an end-to-end approach that automatically processes a

whole sequence of images and correlates their features with

the corresponding aerial image by building a better feature

representation in both temporal and spatial domains. We

have compared our results with the best models in the liter-

ature that could be applied to our dataset as discussed in the

experiments section.

Transformer/multi-head attention: Recently, Vaswani et

al. [36] proposed the transformer module and demonstrated

its ability in catching the temporal correlation in time se-

ries data. Using the transformer, several works [22, 5, 12]

achieved remarkable results in natural language processing

tasks. In computer vision, transformers have been used for

image classiﬁcation [13], video segmentation [38], object

detection [7], and same-view video geo-localization [27].

In this paper, we combined the transformer with the cross-

view image sequence geo-localization to effectively utilize

the full range of visibility from the sequential data. Our ex-

periments showed that the transformer can learn to fuse and

summarize several features from a sequence of images and

predict robust results.

3. Dataset

3.1. Previous Datasets

Many datasets have been proposed for cross-view image

geo-localization [40, 21, 43, 37]. Vo et al. [37] proposed

a large-scale cross-view geo-localization dataset consisting

of more than 1 million pairs of satellite-ground images.

The authors collected aerial images from Google Maps

and the corresponding ground images from Google Street

View from eleven different US cities. Workman et al. [40]

proposed a Cross-View USA (CVUSA) dataset containing

more than 1 million ground-level images across the whole

USA. Later, Zhai et al. [41] reﬁned the CVUSA dataset by

pairing 44,416 aerial-ground images and this has become

one of the most popular datasets in this ﬁeld. In this paper,

we refer to this reﬁned version as CVUSA. CVACT [21]

followed the same structure as CVUSA and had the same

number of training samples as CVUSA but had 10 times

more testing pairs. Recently, Zhu et al. [43] proposed

the VIGOR dataset which is the ﬁrst non one-to-one cor-

respondent cross-view image geo-localization dataset col-

lected randomly from four major US cities. In order to

have systems for practical scenarios in which the queries

and reference images pairs are not guaranteed to be always

perfectly aligned, VIGOR deﬁned ‘positive’ and ‘semi-

positive’ ground images in one single aerial image. Note

that current cross-view geo-localization datasets cannot be

easily converted to sequential dataset. To the best of our

knowledge, there is no existing dataset that provides se-

quential ground-level images and their corresponding aerial

images for cross-view image geo-localization.

3.2. Proposed Dataset

Since existing cross-view geo-localization datasets [37,

40, 21, 43] contain only discrete ground images, we col-

lected a new cross-view image sequence geo-localization

dataset containing limited FOV images which are much

more available and applicable for real-world systems. Table

1 demonstrates the comparison of our proposed dataset with

the existing cross-view image geo-localization datasets. Be-

low, we ﬁrst explain the procedures we followed to collect

the ground-level images and then describe the process of

capturing aerial imagery.

3.2.1 Ground-Level Imagery

Our data was collected using the Fugro Automatic Road

Analyzer (ARAN)1which is a road data capturing vehicle

capable of collecting different data modalities such as im-

age, LiDAR, and pavement laser. ARAN is also equipped

with a GPS and an inertial measurement unit (IMU) sen-

sor for providing precise GPS locations and camera poses.

The raw dataset contains over 5000km of urban and subur-

ban roads, and highways in both directions in the state of

Vermont, US. In our dataset, we only used the frontal cam-

era images with a resolution of 1920 ×1080. The distance

between each capture point is approximately 8mand the

FOV of the camera is around 120◦. GPS location and cam-

era heading (compass direction) are also provided for each

ground-level image. To represent more real-world scenar-

ios, our dataset contains approximately 70% of images from

suburban areas and 30% from urban areas which may be

collected from one or two-way driving directions. The ra-

tio of the collected two-way driving direction data is around

30% in which the same street images are captured from both

driving directions, for example, north-to-south and south-

to-north. The total number of ground images is 118,549 re-

sulting in 38,863 aerial pairs as explained in the following

sections. Our dataset covers around 500 kilometers of roads

in Vermont. Please refer to the supplementary material for

more information.

3.2.2 Sequence Formation

After obtaining the raw ground-level data as described in

section 3.2.1, long sequences of raw data should be seg-

1https://www.fugro.com/our-services/asset-

integrity/roadware/equipment-and-software

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Cross-ViewImageSequenceGeo-localizationXiaohanZhang1,WaqasSultani2,andSafwanWshah11DepartmentofComputerScience,UniversityofVermont,USA2IntelligentMachineLab,InformationTechnologyUniversity,Pakistan{Xiaohan.Zhang,Safwan.Wshah}@uvm.eduwaqas.sultani@itu.edu.pkAbstractCross-viewgeo-localizationaimstoest...

展开>> 收起<<

Cross-View Image Sequence Geo-localization Xiaohan Zhang1 Waqas Sultani2 and Safwan Wshah1 1Department of Computer Science University of Vermont USA.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cross-View Image Sequence Geo-localization Xiaohan Zhang1 Waqas Sultani2 and Safwan Wshah1 1Department of Computer Science University of Vermont USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: