tics of image-to-text, which is essential but has
received little attention previously. Spatial seman-
tics is a fundamental aspect of both language and
image interpretation in relation to human cogni-
tion (Zlatev,2007), and it has shown great value in
spatial-based applications such as automatic navi-
gation, personal assistance, and unmanned manip-
ulation (Irshad et al.,2021;Raychaudhuri et al.,
2021;Zeng et al.,2018). Here, we introduce a new
task, Visual Spatial Description (VSD), which gen-
erates text pieces to describe the spatial semantics
in the image. The task takes an image with two
specified objects in it as inputs and outputs one
sentence that describes the detailed spatial relation
of the objects. We manually annotate a dataset for
inquiry to benchmark this task.
VSD is a typical vision-language generation
problem that can be addressed by multi-modal
encoder-decoder modeling. Multi-modal models
allow both visual and linguistic inputs and encode
them to a joint representation that can learn infor-
mation from both modal inputs. Moreover, recent
studies show that vision-language pretraining can
bring remarkable achievements in most image-to-
text tasks (Lu et al.,2019;Sun et al.,2019;Tan
and Bansal,2019;Zhou et al.,2020;Li et al.,2019;
Hu and Singh,2021;Li et al.,2021;Xiao et al.,
2022). Here, we follow these tasks and adopt
VL-BART and VL-T5 (Cho et al.,2021) as back-
bones, which exhibit state-of-the-art performance
in vision-language generation.
In particular, a closely-related task, visual spatial
relationship classification (VSRC), which outputs
the spatial relationship between two objects inside
an image, might be beneficial for our proposed
VSD. The predefined discrete spatial relations such
as “next to” and “behind”, in VSRC should be able
to effectively guide the VSD generation. To this
end, we first make a thorough comparison of the
connections between VSD and VSRC, which can
be regarded as shallow and deep analyses of spatial
semantics, respectively, and further investigate the
VSRC-enhanced VSD models, performing visual
spatial understanding from shallow to deep. Specif-
ically, we present two straightforward architectures
to integrate VSRC into VSD, one being the pipeline
strategy and the other being the end-to-end joint
strategy, respectively.
Finally, we conduct experiments on our con-
structed dataset to evaluate all proposed models.
First, we examine the two start-up models for VSD
only with VL-BART and VL-T5. The results show
that the two models are comparable in terms of
performance, and both models can provide highly
accurate and fluent human-like outputs of spatial
understanding. Second, we verify the effective-
ness of VSRC for VSD and find that: (1) VSRC
has great potentials for VSD because gold-standard
VSRC can lead to striking improvements on VSD;
(2) VSD can be benefited from automatic VSRC,
and the end-to-end joint framework is slightly bet-
ter. We further perform several analyses to inten-
sively understand VSD and the proposed models.
2 Related Work
Image-to-text has been intensively investigated
with the support of neural networks in the past
years (He and Deng,2017). The encoder-decoder
architecture is an often considered framework,
where the encoder extracts visual features from
the image and the decoder generates text for spe-
cific tasks. Early works employ a convolutional
neural network (CNN) as the visual encoder and a
recurrent neural network (RNN) as the text decoder
(Vinyals et al.,2015;Rennie et al.,2017). Re-
cently, the Transformer neural network (Vaswani
et al.,2017), which is impressively powerful in
feature representation learning on both vision and
language, has gained increasing interest. The
Transformer-based encoder-decoder models have
been adopted in a wide range of image-to-text
tasks (Cornia et al.,2020;Herdade et al.,2019;Fei
et al.,2021a). These models coupled with visual-
language pretraining have achieved the top perfor-
mance for these tasks (Lu et al.,2019;Sun et al.,
2019;Tan and Bansal,2019;Zhou et al.,2020;Li
et al.,2019;Hu and Singh,2021;Li et al.,2021). In
this work, we exploit the Transformer-based archi-
tecture and two pretrained visual-language models:
VL-BART and VL-T5 (Cho et al.,2021), reaching
several strong benchmark models for our task.
Image-to-text can be varied depending on the ob-
jective of the visual description. Image captioning
is the most well-studied task, which aims to summa-
rize a given image or to describe a particular region
in it (Karpathy and Fei-Fei,2015;Vinyals et al.,
2015). Several subsequent studies have attempted
to produce captions with specified patterns and
styles (Cornia et al.,2019;Kim et al.,2019;Deng
et al.,2020;Zhong et al.,2020;Zheng et al.,2019).
For example, VQA and visual reasoning can be
regarded as such attempts, which are conditioned