2
we benchmark extensive SOTA visual grounding methods on
the RSVGD dataset. The existing methods can be divided into
two-stage methods [16–31], one-stage methods [32–39], and
transformer-based methods [40–45]. The experimental results
show that transferring the visual grounding methods for natural
image to RS image can obtain only acceptable results. Even
if the above methods have achieved success in the natural
domain, they still have some challenges that need to be tackled
for the RSVG task.
Based on the characteristics of RS imagery and visual
grounding, we propose a Multi-Level Cross-Modal feature
learning (MLCM) module, which effectively improves the per-
formance of RSVG. Firstly, unlike natural scene images, RS
images are gathered from an overhead view by satellites, which
results in large scale variations and cluttered backgrounds. Due
to the characteristics, the model for solving RS tasks has to
consider multi-scale inputs. The methods on natural images
fail to fully take account of multi-scale features, which leads to
suboptimal results on RS imagery. In addition, the background
content of RS images contains numerous objects unrelated to
the query, but natural images generally have salient objects.
Due to the lack of filtering redundant features, the previous
models are difficult to understand RS image-expression pairs.
Therefore, we attempt to design a network that includes multi-
scale fusion and adaptive filtering functions to refine visual
features. Second, the previous frameworks that extract visual
and textual features isolatedly do not conform to human
perceptual habits, and such visual features lack the effective
information needed for multi-modal reasoning. Inspired by the
above discussion, we address the problem of how to learn
fine-grained semantically salient image representations under
multi-scale visual feature inputs. Based on cross-attention
mechanism, MLCM module first utilizes multi-scale visual
features and multi-granularity textual embeddings to guide
the visual feature refining and achieve multi-level cross-modal
feature learning. Considering that objects in an RS image
are usually correlated, e.g., stadiums usually co-occur with
ground track fields, MLCM discovers the relations between
object regions based on self-attention mechanism. Specifically,
our MLCM includes multi-level cross-modal learning and
self-attention learning. To sum up, our contributions can be
summarized in the following aspects:
1) To foster the research of RSVG, we design an automatic
RS image-query generation method with manual assis-
tance, and the new large-scale dataset is constructed.
Specifically, the new dataset contains 38,320 image-
query pairs and 17,402 RS images.
2) We benchmark extensive SOTA natural image visual
grounding methods on our RSVGD dataset. Based on
experimental results, some analyses about the effects
of different methods are given, which provide useful
insights on the RSVG task.
3) To address the problems of scale-variation and cluttered
background of RS images and capture the rich contextual
dependencies between semantically salient regions, a
novel transformer-based MLCM module is devised to
learn more transcendent visual representations. MLCM
can incorporate effective information from multi-level
and multi-modal features, which enables our method to
achieve competitive performance.
This paper is organized as follows. We review the related
work of natural image visual grounding in Section II. In
Section III, the construction procedure of the new dataset is
described and the characteristics are analyzed. In Section IV,
we present our transformer-based RSVG method. Evaluation
methods and extensive experiment results are shown in Section
V. Finally, we conclude this work in Section VI.
II. RELATED WORK
In this section, we comprehensively review the related
works about natural image visual grounding methods. To be
more specific, two-stage, one-stage, and transformer-based
methods are summarized in detail as follows.
A. Two-stage Visual Grounding Methods
With the development of visual grounding, various two-
stage methods have been proposed. Yu et al. [17] introduced
better visual context feature extraction methods and found
that visual comparison with other objects in the image helps
to improve the performance. In [18], a Spatial Context Re-
current ConvNet (SCRC) is presented, which contains two
CNNs to extract local image features and global scene-
level contextual features. Zhang et al. [19] proposed a varia-
tional Bayesian method for complex visual context modeling.
Besides, a localization score function was also proposed,
which is a variational lower bound consisting of multimodal
modules of three specific cues and can be trained end-to-
end using supervised or unsupervised losses. Hu et al. [20]
attempted to parse the natural language into three modules:
subject, relationship, and object, and align these components
to candidate regions. The three modules are used to predict
the scores of each candidate region. Attention mechanisms
have been further introduced [21, 22] in each module to
better model the interaction between language expressions
and candidate regions. In addition, the attention mechanism
[23] is utilized to reconstruct the input phrase and a parallel
attention network (ParalAttn) [24], including image-level and
proposal-level attention, is proposed. Yu et al. [25] found
that existing two-stage methods pay more attention to multi-
modal representation and region proposals ranking. Therefore,
they proposed DDPN to improve region proposal generation,
considering both the diversity and discrimination. Chen et al.
[26] designed a reinforcement learning mechanism to guide
the network to select more discriminative candidate boxes.
In addition to the above methods, NMTree [27] and RvG-
Tree [28] utilized tree networks by parsing the language.
To capture object relation information, several researchers
[29–31] construct graphs. Yang et al. [29] and Wang et al.
[30] proposed graph attention network to accomplish visual
grounding. CMRIN [31] utilized Gated Graph Convolutional
Network to fuse multimodal information.