1 RSVG Exploring Data and Models for Visual Grounding on Remote Sensing Data

2025-04-30 0 0 7.73MB 12 页 10玖币

侵权投诉

RSVG: Exploring Data and Models for Visual

Grounding on Remote Sensing Data

Yang Zhan, Zhitong Xiong, Member, IEEE, and Yuan Yuan, Senior Member, IEEE

Abstract—In this paper, we introduce the task of visual

grounding for remote sensing data (RSVG). RSVG aims to

localize the referred objects in remote sensing (RS) images with

the guidance of natural language. To retrieve rich information

from RS imagery using natural language, many research tasks,

like RS image visual question answering, RS image captioning,

and RS image-text retrieval have been investigated a lot. However,

the object-level visual grounding on RS images is still under-

explored. Thus, in this work, we propose to construct the

dataset and explore deep learning models for the RSVG task.

Speciﬁcally, our contributions can be summarized as follows.

1) We build the new large-scale benchmark dataset of RSVG,

termed RSVGD, to fully advance the research of RSVG. This new

dataset includes image/expression/box triplets for training and

evaluating visual grounding models. 2) We benchmark extensive

state-of-the-art (SOTA) natural image visual grounding methods

on the constructed RSVGD dataset, and some insightful analyses

are provided based on the results. 3) A novel transformer-based

Multi-Level Cross-Modal feature learning (MLCM) module is

proposed. Remotely-sensed images are usually with large scale

variations and cluttered backgrounds. To deal with the scale-

variation problem, the MLCM module takes advantage of multi-

scale visual features and multi-granularity textual embeddings

to learn more discriminative representations. To cope with

the cluttered background problem, MLCM adaptively ﬁlters

irrelevant noise and enhances salient features. In this way, our

proposed model can incorporate more effective multi-level and

multi-modal features to boost performance. Furthermore, this

work also provides useful insights for developing better RSVG

models. The dataset and code will be publicly available at

https://github.com/ZhanYang-nwpu/RSVG-pytorch.

Index Terms—Visual grounding for remote sensing data

(RSVG), transformer, multi-level cross-modal feature learning

(MLCM).

I. INTRODUCTION

WITH the rapid development of remote sensing (RS)

technology, the quantity and resolution of RS images

have been rapidly improved [1–3]. To efﬁciently process and

retrieve RS imagery, tasks of integrating natural language

and RS imagery have become a hot research topic. Although

there are many studies combining natural language processing

(NLP) with RS, like RS image captioning [4–6], RS image-

text retrieval [7–9], and RS image visual question answering

[10–12], the task of visual grounding for RS data (RSVG) is

still under-explored.

RSVG aims to localize the object referred by the query

expression in RS images, as shown in Fig. 1. Given an RS

Yang Zhan and Yuan Yuan are with the School of Artiﬁcial Intelligence,

Optics, and Electronics (iOPEN), Northwestern Polytechnical University,

Xi’an 710072, China (e-mail:y.yuan@nwpu.edu.cn).

Zhitong Xiong is with the Chair of Data Science in Earth Observation,

Technical University of Munich (TUM), 80333 Munich, Germany.

Input

Output

“The big baseball field”“The big baseball field”

Our approach

Output

Input

“The red vehicle is driving

on the highway”

“The red vehicle is driving

on the highway”

“

The big baseball field

”

Textual Embedding

BERT

Multi-modal

fusion

Visual Feature

Multi-Level Cross-Modal

feature learning

CNNCNN

Fig. 1. Illustration of our task and approach. Top row: the input is an image-

query pair and the output is a bounding box of the referred object. Each pair

consists of an RS image and a query expression and the query can be a phrase

or a sentence. Bottom row: our approach is an end-to-end transformer-based

framework with four steps: 1) multi-modal encoding, 2) multi-level cross-

modal feature learning, 3) multi-modal fusion, and 4) localizing.

image and a natural language expression, RSVG is asked to

provide the referred object’s bounding box. Query expressions

include phrases and sentences. Multimodal machine learning

(MML) [13, 14] enables computers to understand image-

text pairs. Therefore, RSVG makes it possible for ordinary

users, not limited to professionals or researchers, to retrieve

objects in RS images, realizing human-computer interaction.

It has a wide application prospect in scenarios such as mili-

tary target detection, military intelligence generation, natural

disaster monitoring, agriculture production, search and rescue

activities, and urban planning [3, 4, 10].

Since RSVG has high potential in real-world applications,

this paper explores the novel task and constructs a new large-

scale dataset. We build a benchmark dataset, named RSVGD,

using an automatic generation method with manual assistance.

The construction procedure is shown in Fig. 2, including four

steps: 1) box sampling, 2) attribute extraction, 3) expression

generation, and 4) worker veriﬁcation. The RSVGD dataset is

sampled from the target detection dataset DIOR [15]. DIOR

is large-scale on the number of object categories, object

instances, and images, and has signiﬁcant object size varia-

tions, image quality variations, inter-class similarity, and intra-

class diversity. Thus, this new dataset provides researchers

with a good data source to foster the research of RSVG.

Speciﬁcally, RSVGD contains 38,320 RS image-query pairs

and 17,402 RS images, and the average length of expressions

is 7.47. Nowadays, natural image visual grounding has been

developed signiﬁcantly. To fully advance the task of RSVG,

arXiv:2210.12634v1 [cs.CV] 23 Oct 2022

we benchmark extensive SOTA visual grounding methods on

the RSVGD dataset. The existing methods can be divided into

two-stage methods [16–31], one-stage methods [32–39], and

transformer-based methods [40–45]. The experimental results

show that transferring the visual grounding methods for natural

image to RS image can obtain only acceptable results. Even

if the above methods have achieved success in the natural

domain, they still have some challenges that need to be tackled

for the RSVG task.

Based on the characteristics of RS imagery and visual

grounding, we propose a Multi-Level Cross-Modal feature

learning (MLCM) module, which effectively improves the per-

formance of RSVG. Firstly, unlike natural scene images, RS

images are gathered from an overhead view by satellites, which

results in large scale variations and cluttered backgrounds. Due

to the characteristics, the model for solving RS tasks has to

consider multi-scale inputs. The methods on natural images

fail to fully take account of multi-scale features, which leads to

suboptimal results on RS imagery. In addition, the background

content of RS images contains numerous objects unrelated to

the query, but natural images generally have salient objects.

Due to the lack of ﬁltering redundant features, the previous

models are difﬁcult to understand RS image-expression pairs.

Therefore, we attempt to design a network that includes multi-

scale fusion and adaptive ﬁltering functions to reﬁne visual

features. Second, the previous frameworks that extract visual

and textual features isolatedly do not conform to human

perceptual habits, and such visual features lack the effective

information needed for multi-modal reasoning. Inspired by the

above discussion, we address the problem of how to learn

ﬁne-grained semantically salient image representations under

multi-scale visual feature inputs. Based on cross-attention

mechanism, MLCM module ﬁrst utilizes multi-scale visual

features and multi-granularity textual embeddings to guide

the visual feature reﬁning and achieve multi-level cross-modal

feature learning. Considering that objects in an RS image

are usually correlated, e.g., stadiums usually co-occur with

ground track ﬁelds, MLCM discovers the relations between

object regions based on self-attention mechanism. Speciﬁcally,

our MLCM includes multi-level cross-modal learning and

self-attention learning. To sum up, our contributions can be

summarized in the following aspects:

1) To foster the research of RSVG, we design an automatic

RS image-query generation method with manual assis-

tance, and the new large-scale dataset is constructed.

Speciﬁcally, the new dataset contains 38,320 image-

query pairs and 17,402 RS images.

2) We benchmark extensive SOTA natural image visual

grounding methods on our RSVGD dataset. Based on

experimental results, some analyses about the effects

of different methods are given, which provide useful

insights on the RSVG task.

3) To address the problems of scale-variation and cluttered

background of RS images and capture the rich contextual

dependencies between semantically salient regions, a

novel transformer-based MLCM module is devised to

learn more transcendent visual representations. MLCM

can incorporate effective information from multi-level

and multi-modal features, which enables our method to

achieve competitive performance.

This paper is organized as follows. We review the related

work of natural image visual grounding in Section II. In

Section III, the construction procedure of the new dataset is

described and the characteristics are analyzed. In Section IV,

we present our transformer-based RSVG method. Evaluation

methods and extensive experiment results are shown in Section

V. Finally, we conclude this work in Section VI.

II. RELATED WORK

In this section, we comprehensively review the related

works about natural image visual grounding methods. To be

more speciﬁc, two-stage, one-stage, and transformer-based

methods are summarized in detail as follows.

A. Two-stage Visual Grounding Methods

With the development of visual grounding, various two-

stage methods have been proposed. Yu et al. [17] introduced

better visual context feature extraction methods and found

that visual comparison with other objects in the image helps

to improve the performance. In [18], a Spatial Context Re-

current ConvNet (SCRC) is presented, which contains two

CNNs to extract local image features and global scene-

level contextual features. Zhang et al. [19] proposed a varia-

tional Bayesian method for complex visual context modeling.

Besides, a localization score function was also proposed,

which is a variational lower bound consisting of multimodal

modules of three speciﬁc cues and can be trained end-to-

end using supervised or unsupervised losses. Hu et al. [20]

attempted to parse the natural language into three modules:

subject, relationship, and object, and align these components

to candidate regions. The three modules are used to predict

the scores of each candidate region. Attention mechanisms

have been further introduced [21, 22] in each module to

better model the interaction between language expressions

and candidate regions. In addition, the attention mechanism

[23] is utilized to reconstruct the input phrase and a parallel

attention network (ParalAttn) [24], including image-level and

proposal-level attention, is proposed. Yu et al. [25] found

that existing two-stage methods pay more attention to multi-

modal representation and region proposals ranking. Therefore,

they proposed DDPN to improve region proposal generation,

considering both the diversity and discrimination. Chen et al.

[26] designed a reinforcement learning mechanism to guide

the network to select more discriminative candidate boxes.

In addition to the above methods, NMTree [27] and RvG-

Tree [28] utilized tree networks by parsing the language.

To capture object relation information, several researchers

[29–31] construct graphs. Yang et al. [29] and Wang et al.

[30] proposed graph attention network to accomplish visual

grounding. CMRIN [31] utilized Gated Graph Convolutional

Network to fuse multimodal information.

B. One-stage Visual Grounding Methods

One-stage methods are more computation-efﬁcient and can

avoid error accumulation in multi-stage frameworks. Thus,

many one-stage methods have been investigated. Some works

use CNN and LSTM or Bi-LSTM to extract visual features

and textual features [32–34]. Multimodal Compact Bilinear

pooling (MCB) is ﬁrst proposed in [32] to fuse the multi-

modal features. Chen et al. [33] designed a multimodal inter-

actor to summarize the complex relationship between visual

features and textual features. Besides, a new guided attention

mechanism was designed to focus visual attention on the

central area of the referred object. In [34], multi-scale features

are extracted and multi-modal features are fed to the fully

convolutional network to regress box coordinates. Signiﬁcant

improvement is observed as Yang et al. [35] fused textual

embeddings with YOLOv3 detector results and augmented the

visual features with spatial features. Liao et al. [36] deﬁned

the visual grounding problem as a correlation ﬁltering process.

They mapped textual features into three ﬁltering kernels and

performed correlation ﬁltering on the image feature map. To

address the limitations of FAOA [35] in complex queries

for visual grounding, Yang et al. [37] proposed a recursive

sub-query construction (ReSC) network. The latest one-stage

methods [38, 39] focus on visual branching and use language

expression to guide the visual feature extraction. A landmark

feature convolution module [38] is designed to transmit visual

features under the guidance of language and encode spatial

relations between the object and its context. Liao et al. [39]

proposed a language-guided visual feature learning mechanism

to customize visual features in each stage and transfer them

to the next stage.

C. Transformer-based Visual Grounding Methods

Recently, transformer-based methods have attracted more

and more research attention due to the high efﬁciency and

visual grounding performance. Du et al. [40] and Deng et al.

[41] proposed the earliest end-to-end transformer-based visual

grounding network, i.e, VGTR and TransVG. VGTR [40] was

a transformer structure that can learn visual features under

the guidance of expression. TransVG [41] was a network

stacked with multiple transformers, including BERT, visual

transformer, and multimodal fusion transformer. Some studies

[42, 43] propose a multi-task framework. Li and Sigal [42]

utilized transformer encoder to reﬁne visual and textual fea-

tures and designed a query encoder and decoder for referring

expression comprehension (REC) and segmentation (RES) at

the same time. Sun et al. [43] proposed the transformer model

for REC and referring expression generation (REG), which

uses the same cross-attention module and fusion module to

perform multi-modal interaction. Similar to the latest one-stage

methods, the latest transformer-based methods [44, 45] also

focus on the improvement of visual branches and adjusting

visual features by combining multi-modal features. VLTVG

[44] aims to adjust visual features with a visual-linguistic ver-

iﬁcation module and aggregate visual context with a language-

guided context encoder. The core of these modules is multi-

head attention. QRNet [45] contains a language query aware

Step1: Box Sampling Step2: Attribute Extraction Step3: Expression Generation Step4: Worker Verification

Color: red

Size: small

Location: middle

(There are many “vehicle” boxes

but only one of them is “red”)

“The red vehicle”

“A red vehicle is driving on

the highway”

“……”

Fig. 2. Illustration of the dataset construction processes. Step 1: the red box

is the sampling result; yellow boxes are ignored. Step 2: attribute extraction

examples in the previous RS image. Step 3: expression generation examples

in the previous RS image. Step 4: the dataset is manually validated using a

data correction system.

dynamic attention mechanism and a language query aware

multi-scale fusion to adjust visual features.

III. DATASET CONSTRUCTION

In this section, we will introduce the construction procedure

of the new dataset in Section III-A. The statistical analysis of

our RSVGD is shown in Section III-B.

A. RSVGD: a new dataset for RSVG

The dataset for RSVG requires lots of RS images with

the annotation and description of different objects. Therefore,

we utilize the existing target detection dataset DIOR [15]

as the basic data to construct a new benchmark dataset.

Over the years, various visual grounding datasets [17, 46–

60] based on real-world and computer-generated images have

been proposed to study visual grounding. The construction

methods of each dataset are divided into manual annotation

[46, 47, 49, 50, 55, 56, 60], game collection [17, 48, 51], and

automatic generation [52, 54, 57–59]. We design an automatic

image-query generation method with manual assistance to

collect image/expression/box triplets, as shown in Fig. 2.

A detailed description of the generation of different query

expressions is given in what follows.

Step 1: Box sampling. DIOR dataset includes 23,463 RS

images, 192,472 object instances, and 20 object categories.

The image size is 800 ×800 pixels and the spatial resolution

range is from 0.5m to 30m. First, the data containing annota-

tion errors in the DIOR dataset are removed, e.g. axis-aligned

bounding box coordinates xmin ≥xmax or ymin ≥ymax.

(xmin, ymin, xmax, ymax)is the coordinate of the ground-

truth bounding box. Then, bounding boxes that are less than

0.02% or greater than 99% of the image size are also removed.

Finally, we sample no more than 5 objects of the same category

in each RS image to avoid unclear references of expression

caused by many of the same category of objects in the image.

Step 2: Attribute extraction. By analyzing visual ground-

ing datasets from the real world, such as RefItGame [48],

RefCOCO [17], and RefCOCO+ [17], a set of attributes

widely contained in referring expressions is summarized. we

extract the attribute set and deﬁne it as a 7-tuple A=

{a1, a2, a3, a4, a5, a6, a7}. The symbol, type, and example of

each attribute are shown in Table I. The object category can

be obtained directly from the DIOR dataset. The HSV color

recognition method is used to obtain the object’s color. Object

size is measured by the ratio of bounding box area to image

size. The geometry attribute is set in advance for some objects

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1RSVG:ExploringDataandModelsforVisualGroundingonRemoteSensingDataYangZhan,ZhitongXiong,Member,IEEE,andYuanYuan,SeniorMember,IEEEAbstractInthispaper,weintroducethetaskofvisualgroundingforremotesensingdata(RSVG).RSVGaimstolocalizethereferredobjectsinremotesensing(RS)imageswiththeguidanceofnaturallang...

展开>> 收起<<

1 RSVG Exploring Data and Models for Visual Grounding on Remote Sensing Data.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 RSVG Exploring Data and Models for Visual Grounding on Remote Sensing Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: