1 RSVG Exploring Data and Models for Visual Grounding on Remote Sensing Data

2025-04-30 0 0 7.73MB 12 页 10玖币
侵权投诉
1
RSVG: Exploring Data and Models for Visual
Grounding on Remote Sensing Data
Yang Zhan, Zhitong Xiong, Member, IEEE, and Yuan Yuan, Senior Member, IEEE
Abstract—In this paper, we introduce the task of visual
grounding for remote sensing data (RSVG). RSVG aims to
localize the referred objects in remote sensing (RS) images with
the guidance of natural language. To retrieve rich information
from RS imagery using natural language, many research tasks,
like RS image visual question answering, RS image captioning,
and RS image-text retrieval have been investigated a lot. However,
the object-level visual grounding on RS images is still under-
explored. Thus, in this work, we propose to construct the
dataset and explore deep learning models for the RSVG task.
Specifically, our contributions can be summarized as follows.
1) We build the new large-scale benchmark dataset of RSVG,
termed RSVGD, to fully advance the research of RSVG. This new
dataset includes image/expression/box triplets for training and
evaluating visual grounding models. 2) We benchmark extensive
state-of-the-art (SOTA) natural image visual grounding methods
on the constructed RSVGD dataset, and some insightful analyses
are provided based on the results. 3) A novel transformer-based
Multi-Level Cross-Modal feature learning (MLCM) module is
proposed. Remotely-sensed images are usually with large scale
variations and cluttered backgrounds. To deal with the scale-
variation problem, the MLCM module takes advantage of multi-
scale visual features and multi-granularity textual embeddings
to learn more discriminative representations. To cope with
the cluttered background problem, MLCM adaptively filters
irrelevant noise and enhances salient features. In this way, our
proposed model can incorporate more effective multi-level and
multi-modal features to boost performance. Furthermore, this
work also provides useful insights for developing better RSVG
models. The dataset and code will be publicly available at
https://github.com/ZhanYang-nwpu/RSVG-pytorch.
Index Terms—Visual grounding for remote sensing data
(RSVG), transformer, multi-level cross-modal feature learning
(MLCM).
I. INTRODUCTION
WITH the rapid development of remote sensing (RS)
technology, the quantity and resolution of RS images
have been rapidly improved [1–3]. To efficiently process and
retrieve RS imagery, tasks of integrating natural language
and RS imagery have become a hot research topic. Although
there are many studies combining natural language processing
(NLP) with RS, like RS image captioning [4–6], RS image-
text retrieval [7–9], and RS image visual question answering
[10–12], the task of visual grounding for RS data (RSVG) is
still under-explored.
RSVG aims to localize the object referred by the query
expression in RS images, as shown in Fig. 1. Given an RS
Yang Zhan and Yuan Yuan are with the School of Artificial Intelligence,
Optics, and Electronics (iOPEN), Northwestern Polytechnical University,
Xi’an 710072, China (e-mail:y.yuan@nwpu.edu.cn).
Zhitong Xiong is with the Chair of Data Science in Earth Observation,
Technical University of Munich (TUM), 80333 Munich, Germany.
Input
Output
The big baseball fieldThe big baseball field
Our approach
Output
Input
The red vehicle is driving
on the highway
The red vehicle is driving
on the highway
The big baseball field
Textual Embedding
BERT
Multi-modal
fusion
Visual Feature
Multi-Level Cross-Modal
feature learning
CNNCNN
Fig. 1. Illustration of our task and approach. Top row: the input is an image-
query pair and the output is a bounding box of the referred object. Each pair
consists of an RS image and a query expression and the query can be a phrase
or a sentence. Bottom row: our approach is an end-to-end transformer-based
framework with four steps: 1) multi-modal encoding, 2) multi-level cross-
modal feature learning, 3) multi-modal fusion, and 4) localizing.
image and a natural language expression, RSVG is asked to
provide the referred object’s bounding box. Query expressions
include phrases and sentences. Multimodal machine learning
(MML) [13, 14] enables computers to understand image-
text pairs. Therefore, RSVG makes it possible for ordinary
users, not limited to professionals or researchers, to retrieve
objects in RS images, realizing human-computer interaction.
It has a wide application prospect in scenarios such as mili-
tary target detection, military intelligence generation, natural
disaster monitoring, agriculture production, search and rescue
activities, and urban planning [3, 4, 10].
Since RSVG has high potential in real-world applications,
this paper explores the novel task and constructs a new large-
scale dataset. We build a benchmark dataset, named RSVGD,
using an automatic generation method with manual assistance.
The construction procedure is shown in Fig. 2, including four
steps: 1) box sampling, 2) attribute extraction, 3) expression
generation, and 4) worker verification. The RSVGD dataset is
sampled from the target detection dataset DIOR [15]. DIOR
is large-scale on the number of object categories, object
instances, and images, and has significant object size varia-
tions, image quality variations, inter-class similarity, and intra-
class diversity. Thus, this new dataset provides researchers
with a good data source to foster the research of RSVG.
Specifically, RSVGD contains 38,320 RS image-query pairs
and 17,402 RS images, and the average length of expressions
is 7.47. Nowadays, natural image visual grounding has been
developed significantly. To fully advance the task of RSVG,
arXiv:2210.12634v1 [cs.CV] 23 Oct 2022
2
we benchmark extensive SOTA visual grounding methods on
the RSVGD dataset. The existing methods can be divided into
two-stage methods [16–31], one-stage methods [32–39], and
transformer-based methods [40–45]. The experimental results
show that transferring the visual grounding methods for natural
image to RS image can obtain only acceptable results. Even
if the above methods have achieved success in the natural
domain, they still have some challenges that need to be tackled
for the RSVG task.
Based on the characteristics of RS imagery and visual
grounding, we propose a Multi-Level Cross-Modal feature
learning (MLCM) module, which effectively improves the per-
formance of RSVG. Firstly, unlike natural scene images, RS
images are gathered from an overhead view by satellites, which
results in large scale variations and cluttered backgrounds. Due
to the characteristics, the model for solving RS tasks has to
consider multi-scale inputs. The methods on natural images
fail to fully take account of multi-scale features, which leads to
suboptimal results on RS imagery. In addition, the background
content of RS images contains numerous objects unrelated to
the query, but natural images generally have salient objects.
Due to the lack of filtering redundant features, the previous
models are difficult to understand RS image-expression pairs.
Therefore, we attempt to design a network that includes multi-
scale fusion and adaptive filtering functions to refine visual
features. Second, the previous frameworks that extract visual
and textual features isolatedly do not conform to human
perceptual habits, and such visual features lack the effective
information needed for multi-modal reasoning. Inspired by the
above discussion, we address the problem of how to learn
fine-grained semantically salient image representations under
multi-scale visual feature inputs. Based on cross-attention
mechanism, MLCM module first utilizes multi-scale visual
features and multi-granularity textual embeddings to guide
the visual feature refining and achieve multi-level cross-modal
feature learning. Considering that objects in an RS image
are usually correlated, e.g., stadiums usually co-occur with
ground track fields, MLCM discovers the relations between
object regions based on self-attention mechanism. Specifically,
our MLCM includes multi-level cross-modal learning and
self-attention learning. To sum up, our contributions can be
summarized in the following aspects:
1) To foster the research of RSVG, we design an automatic
RS image-query generation method with manual assis-
tance, and the new large-scale dataset is constructed.
Specifically, the new dataset contains 38,320 image-
query pairs and 17,402 RS images.
2) We benchmark extensive SOTA natural image visual
grounding methods on our RSVGD dataset. Based on
experimental results, some analyses about the effects
of different methods are given, which provide useful
insights on the RSVG task.
3) To address the problems of scale-variation and cluttered
background of RS images and capture the rich contextual
dependencies between semantically salient regions, a
novel transformer-based MLCM module is devised to
learn more transcendent visual representations. MLCM
can incorporate effective information from multi-level
and multi-modal features, which enables our method to
achieve competitive performance.
This paper is organized as follows. We review the related
work of natural image visual grounding in Section II. In
Section III, the construction procedure of the new dataset is
described and the characteristics are analyzed. In Section IV,
we present our transformer-based RSVG method. Evaluation
methods and extensive experiment results are shown in Section
V. Finally, we conclude this work in Section VI.
II. RELATED WORK
In this section, we comprehensively review the related
works about natural image visual grounding methods. To be
more specific, two-stage, one-stage, and transformer-based
methods are summarized in detail as follows.
A. Two-stage Visual Grounding Methods
With the development of visual grounding, various two-
stage methods have been proposed. Yu et al. [17] introduced
better visual context feature extraction methods and found
that visual comparison with other objects in the image helps
to improve the performance. In [18], a Spatial Context Re-
current ConvNet (SCRC) is presented, which contains two
CNNs to extract local image features and global scene-
level contextual features. Zhang et al. [19] proposed a varia-
tional Bayesian method for complex visual context modeling.
Besides, a localization score function was also proposed,
which is a variational lower bound consisting of multimodal
modules of three specific cues and can be trained end-to-
end using supervised or unsupervised losses. Hu et al. [20]
attempted to parse the natural language into three modules:
subject, relationship, and object, and align these components
to candidate regions. The three modules are used to predict
the scores of each candidate region. Attention mechanisms
have been further introduced [21, 22] in each module to
better model the interaction between language expressions
and candidate regions. In addition, the attention mechanism
[23] is utilized to reconstruct the input phrase and a parallel
attention network (ParalAttn) [24], including image-level and
proposal-level attention, is proposed. Yu et al. [25] found
that existing two-stage methods pay more attention to multi-
modal representation and region proposals ranking. Therefore,
they proposed DDPN to improve region proposal generation,
considering both the diversity and discrimination. Chen et al.
[26] designed a reinforcement learning mechanism to guide
the network to select more discriminative candidate boxes.
In addition to the above methods, NMTree [27] and RvG-
Tree [28] utilized tree networks by parsing the language.
To capture object relation information, several researchers
[29–31] construct graphs. Yang et al. [29] and Wang et al.
[30] proposed graph attention network to accomplish visual
grounding. CMRIN [31] utilized Gated Graph Convolutional
Network to fuse multimodal information.
3
B. One-stage Visual Grounding Methods
One-stage methods are more computation-efficient and can
avoid error accumulation in multi-stage frameworks. Thus,
many one-stage methods have been investigated. Some works
use CNN and LSTM or Bi-LSTM to extract visual features
and textual features [32–34]. Multimodal Compact Bilinear
pooling (MCB) is first proposed in [32] to fuse the multi-
modal features. Chen et al. [33] designed a multimodal inter-
actor to summarize the complex relationship between visual
features and textual features. Besides, a new guided attention
mechanism was designed to focus visual attention on the
central area of the referred object. In [34], multi-scale features
are extracted and multi-modal features are fed to the fully
convolutional network to regress box coordinates. Significant
improvement is observed as Yang et al. [35] fused textual
embeddings with YOLOv3 detector results and augmented the
visual features with spatial features. Liao et al. [36] defined
the visual grounding problem as a correlation filtering process.
They mapped textual features into three filtering kernels and
performed correlation filtering on the image feature map. To
address the limitations of FAOA [35] in complex queries
for visual grounding, Yang et al. [37] proposed a recursive
sub-query construction (ReSC) network. The latest one-stage
methods [38, 39] focus on visual branching and use language
expression to guide the visual feature extraction. A landmark
feature convolution module [38] is designed to transmit visual
features under the guidance of language and encode spatial
relations between the object and its context. Liao et al. [39]
proposed a language-guided visual feature learning mechanism
to customize visual features in each stage and transfer them
to the next stage.
C. Transformer-based Visual Grounding Methods
Recently, transformer-based methods have attracted more
and more research attention due to the high efficiency and
visual grounding performance. Du et al. [40] and Deng et al.
[41] proposed the earliest end-to-end transformer-based visual
grounding network, i.e, VGTR and TransVG. VGTR [40] was
a transformer structure that can learn visual features under
the guidance of expression. TransVG [41] was a network
stacked with multiple transformers, including BERT, visual
transformer, and multimodal fusion transformer. Some studies
[42, 43] propose a multi-task framework. Li and Sigal [42]
utilized transformer encoder to refine visual and textual fea-
tures and designed a query encoder and decoder for referring
expression comprehension (REC) and segmentation (RES) at
the same time. Sun et al. [43] proposed the transformer model
for REC and referring expression generation (REG), which
uses the same cross-attention module and fusion module to
perform multi-modal interaction. Similar to the latest one-stage
methods, the latest transformer-based methods [44, 45] also
focus on the improvement of visual branches and adjusting
visual features by combining multi-modal features. VLTVG
[44] aims to adjust visual features with a visual-linguistic ver-
ification module and aggregate visual context with a language-
guided context encoder. The core of these modules is multi-
head attention. QRNet [45] contains a language query aware
Step1: Box Sampling Step2: Attribute Extraction Step3: Expression Generation Step4: Worker Verification
Color: red
Size: small
Location: middle
(There are many vehicle boxes
but only one of them is red)
The red vehicle
A red vehicle is driving on
the highway
“……”
Fig. 2. Illustration of the dataset construction processes. Step 1: the red box
is the sampling result; yellow boxes are ignored. Step 2: attribute extraction
examples in the previous RS image. Step 3: expression generation examples
in the previous RS image. Step 4: the dataset is manually validated using a
data correction system.
dynamic attention mechanism and a language query aware
multi-scale fusion to adjust visual features.
III. DATASET CONSTRUCTION
In this section, we will introduce the construction procedure
of the new dataset in Section III-A. The statistical analysis of
our RSVGD is shown in Section III-B.
A. RSVGD: a new dataset for RSVG
The dataset for RSVG requires lots of RS images with
the annotation and description of different objects. Therefore,
we utilize the existing target detection dataset DIOR [15]
as the basic data to construct a new benchmark dataset.
Over the years, various visual grounding datasets [17, 46–
60] based on real-world and computer-generated images have
been proposed to study visual grounding. The construction
methods of each dataset are divided into manual annotation
[46, 47, 49, 50, 55, 56, 60], game collection [17, 48, 51], and
automatic generation [52, 54, 57–59]. We design an automatic
image-query generation method with manual assistance to
collect image/expression/box triplets, as shown in Fig. 2.
A detailed description of the generation of different query
expressions is given in what follows.
Step 1: Box sampling. DIOR dataset includes 23,463 RS
images, 192,472 object instances, and 20 object categories.
The image size is 800 ×800 pixels and the spatial resolution
range is from 0.5m to 30m. First, the data containing annota-
tion errors in the DIOR dataset are removed, e.g. axis-aligned
bounding box coordinates xmin xmax or ymin ymax.
(xmin, ymin, xmax, ymax)is the coordinate of the ground-
truth bounding box. Then, bounding boxes that are less than
0.02% or greater than 99% of the image size are also removed.
Finally, we sample no more than 5 objects of the same category
in each RS image to avoid unclear references of expression
caused by many of the same category of objects in the image.
Step 2: Attribute extraction. By analyzing visual ground-
ing datasets from the real world, such as RefItGame [48],
RefCOCO [17], and RefCOCO+ [17], a set of attributes
widely contained in referring expressions is summarized. we
extract the attribute set and define it as a 7-tuple A=
{a1, a2, a3, a4, a5, a6, a7}. The symbol, type, and example of
each attribute are shown in Table I. The object category can
be obtained directly from the DIOR dataset. The HSV color
recognition method is used to obtain the object’s color. Object
size is measured by the ratio of bounding box area to image
size. The geometry attribute is set in advance for some objects
摘要:

1RSVG:ExploringDataandModelsforVisualGroundingonRemoteSensingDataYangZhan,ZhitongXiong,Member,IEEE,andYuanYuan,SeniorMember,IEEEAbstract—Inthispaper,weintroducethetaskofvisualgroundingforremotesensingdata(RSVG).RSVGaimstolocalizethereferredobjectsinremotesensing(RS)imageswiththeguidanceofnaturallang...

展开>> 收起<<
1 RSVG Exploring Data and Models for Visual Grounding on Remote Sensing Data.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:7.73MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注