Visual Spatial Description Controlled Spatial-Oriented Image-to-Text Generation Yu Zhao1 Jianguo Wei1 Zhichao Lin2 Yueheng Sun1 Meishan Zhang3 Min Zhang3

2025-05-06 0 0 1.49MB 13 页 10玖币
侵权投诉
Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text
Generation
Yu Zhao1, Jianguo Wei1, Zhichao Lin2, Yueheng Sun1, Meishan Zhang3
, Min Zhang3
1College of Intelligence and Computing, Tianjin University
2School of New Media and Communication, Tianjin University
3Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)
{zhaoyucs,jianguo,chaosmyth,yhs}@tju.edu.cn,
{zhangmeishan,zhangmin2021}@hit.edu.cn
Abstract
Image-to-text tasks, such as open-ended image
captioning and controllable image description,
have received extensive attention for decades.
Here, we further advance this line of work by
presenting Visual Spatial Description (VSD), a
new perspective for image-to-text toward spa-
tial semantics. Given an image and two objects
inside it, VSD aims to produce one description
focusing on the spatial perspective between the
two objects. Accordingly, we manually an-
notate a dataset to facilitate the investigation
of the newly-introduced task and build several
benchmark encoder-decoder models by using
VL-BART and VL-T5 as backbones. In ad-
dition, we investigate pipeline and joint end-
to-end architectures for incorporating visual
spatial relationship classification (VSRC) in-
formation into our model. Finally, we con-
duct experiments on our benchmark dataset to
evaluate all our models. Results show that
our models are impressive, providing accurate
and human-like spatial-oriented text descrip-
tions. Meanwhile, VSRC has great potential
for VSD, and the joint end-to-end architecture
is the better choice for their integration. We
make the dataset and codes public for research
purposes.1
1 Introduction
Text generation from images is a widely-adopted
means for deep understanding of cross-modal data
that has received increasing interest of both com-
puter vision (CV) and natural language processing
(NLP) communities (He and Deng,2017). Image-
to-text tasks generate natural language texts to as-
sist in understanding the scene meaning of a spe-
cific image, which might be beneficial for a variety
of applications such as image retrieval (Diao et al.,
2021;Ahmed et al.,2021), perception assistance
(Xu et al.,2018;Shashirangana et al.,2021), pedes-
Corresponding author
1https://github.com/zhaoyucs/VSD
Task Condition Target Text
Image Captioning —— A man is walking past a car.
VSR-guided Captioning walk;Arg,LocAman is walking cross a street.
Visual Question Answering What color is the car?The car is red.
Our Task: VSD man,carAman is walking behind a red car
from right to left.
car,poleA red car is parked to the left of a pole.
Figure 1: A comparison of three example image-to-text
generation tasks and the proposed VSD in this work.
trian detection (Hasan et al.,2021), and medical
system (Miura et al.,2021).
Image-to-text tasks take on various forms when
serving different purposes. Figure 1illustrates
a comparison of three example tasks. First, the
generic open-ended image captioning aims to pro-
vide a summarised description that describes an in-
put image and reflects the overall understanding of
the image (Lindh et al.,2020;Vinyals et al.,2015;
Ji et al.,2020). Furthermore, the verb-specific se-
mantic roles (VSR) guided captioning (Chen et al.,
2021) and visual question answering (VQA) (Antol
et al.,2015) are two examples of controllable image
description, which produce human-like and styl-
ized descriptions under specified conditions based
on a thorough comprehension of the input image
(Chen et al.,2021;Fei et al.,2021b;Mathews et al.,
2018;Cornia et al.,2019;Lindh et al.,2020;Pont-
Tuset et al.,2020;Deng et al.,2020;Zhong et al.,
2020;Kim et al.,2019;Chen et al.,2020a;Fei
et al.,2022;Jhamtani and Berg-Kirkpatrick,2018).
The VSR-guided captioning produces a description
focusing on a verb with specified semantic roles,
and the VQA generates a reasoning answer based
on a given question.
In this work, we extend the line of controllable
image description by presenting the spatial seman-
arXiv:2210.11109v2 [cs.CV] 26 Oct 2022
tics of image-to-text, which is essential but has
received little attention previously. Spatial seman-
tics is a fundamental aspect of both language and
image interpretation in relation to human cogni-
tion (Zlatev,2007), and it has shown great value in
spatial-based applications such as automatic navi-
gation, personal assistance, and unmanned manip-
ulation (Irshad et al.,2021;Raychaudhuri et al.,
2021;Zeng et al.,2018). Here, we introduce a new
task, Visual Spatial Description (VSD), which gen-
erates text pieces to describe the spatial semantics
in the image. The task takes an image with two
specified objects in it as inputs and outputs one
sentence that describes the detailed spatial relation
of the objects. We manually annotate a dataset for
inquiry to benchmark this task.
VSD is a typical vision-language generation
problem that can be addressed by multi-modal
encoder-decoder modeling. Multi-modal models
allow both visual and linguistic inputs and encode
them to a joint representation that can learn infor-
mation from both modal inputs. Moreover, recent
studies show that vision-language pretraining can
bring remarkable achievements in most image-to-
text tasks (Lu et al.,2019;Sun et al.,2019;Tan
and Bansal,2019;Zhou et al.,2020;Li et al.,2019;
Hu and Singh,2021;Li et al.,2021;Xiao et al.,
2022). Here, we follow these tasks and adopt
VL-BART and VL-T5 (Cho et al.,2021) as back-
bones, which exhibit state-of-the-art performance
in vision-language generation.
In particular, a closely-related task, visual spatial
relationship classification (VSRC), which outputs
the spatial relationship between two objects inside
an image, might be beneficial for our proposed
VSD. The predefined discrete spatial relations such
as “next to” and “behind”, in VSRC should be able
to effectively guide the VSD generation. To this
end, we first make a thorough comparison of the
connections between VSD and VSRC, which can
be regarded as shallow and deep analyses of spatial
semantics, respectively, and further investigate the
VSRC-enhanced VSD models, performing visual
spatial understanding from shallow to deep. Specif-
ically, we present two straightforward architectures
to integrate VSRC into VSD, one being the pipeline
strategy and the other being the end-to-end joint
strategy, respectively.
Finally, we conduct experiments on our con-
structed dataset to evaluate all proposed models.
First, we examine the two start-up models for VSD
only with VL-BART and VL-T5. The results show
that the two models are comparable in terms of
performance, and both models can provide highly
accurate and fluent human-like outputs of spatial
understanding. Second, we verify the effective-
ness of VSRC for VSD and find that: (1) VSRC
has great potentials for VSD because gold-standard
VSRC can lead to striking improvements on VSD;
(2) VSD can be benefited from automatic VSRC,
and the end-to-end joint framework is slightly bet-
ter. We further perform several analyses to inten-
sively understand VSD and the proposed models.
2 Related Work
Image-to-text has been intensively investigated
with the support of neural networks in the past
years (He and Deng,2017). The encoder-decoder
architecture is an often considered framework,
where the encoder extracts visual features from
the image and the decoder generates text for spe-
cific tasks. Early works employ a convolutional
neural network (CNN) as the visual encoder and a
recurrent neural network (RNN) as the text decoder
(Vinyals et al.,2015;Rennie et al.,2017). Re-
cently, the Transformer neural network (Vaswani
et al.,2017), which is impressively powerful in
feature representation learning on both vision and
language, has gained increasing interest. The
Transformer-based encoder-decoder models have
been adopted in a wide range of image-to-text
tasks (Cornia et al.,2020;Herdade et al.,2019;Fei
et al.,2021a). These models coupled with visual-
language pretraining have achieved the top perfor-
mance for these tasks (Lu et al.,2019;Sun et al.,
2019;Tan and Bansal,2019;Zhou et al.,2020;Li
et al.,2019;Hu and Singh,2021;Li et al.,2021). In
this work, we exploit the Transformer-based archi-
tecture and two pretrained visual-language models:
VL-BART and VL-T5 (Cho et al.,2021), reaching
several strong benchmark models for our task.
Image-to-text can be varied depending on the ob-
jective of the visual description. Image captioning
is the most well-studied task, which aims to summa-
rize a given image or to describe a particular region
in it (Karpathy and Fei-Fei,2015;Vinyals et al.,
2015). Several subsequent studies have attempted
to produce captions with specified patterns and
styles (Cornia et al.,2019;Kim et al.,2019;Deng
et al.,2020;Zhong et al.,2020;Zheng et al.,2019).
For example, VQA and visual reasoning can be
regarded as such attempts, which are conditioned
by a specific question directed at the input image
(Antol et al.,2015;Agrawal et al.,2018;Hudson
and Manning,2019;Johnson et al.,2017). The
VSR-guided image captioning (Chen et al.,2021)
is the most close to our work, which generates a
sentence for a particular event in the image with
well-specified semantic roles. Here we focus on
spatial semantics instead, generating a description
based on the spatial relationship.
Spatial semantics is an important topic in both
language and visual analysis. Kordjamshidi et al.
(2011) propose an preliminary study on text-based
spatial role labeling. Later, spatial element extrac-
tion and relation extraction from texts are investi-
gated by (Nichols and Botros,2015). Pustejovsky
et al. (2015) present a fine-grained spatial semantic
analysis in texts with rich spatial roles. Based on
the image input, Yang et al. (2019) propose VSRC
and benchmark it with a manually-crafted dataset.
The VSRC is actually a shallow task for visual spa-
tial analysis based on a closed relationship set and
by using a simple classification schema. Following,
Chiou et al. (2021) build a much stronger model
on the dataset. Many studies have exploited spatial
semantics to assist other image understanding tasks
(Kim et al.,2021;Wu et al.,2021;Collell et al.,
2021;Xiao et al.,2021;Pierrard et al.,2021). In
addition, learning spatial representations from mul-
tiple modalities also receives particular attention
(Collell and Moens,2018;Dan et al.,2020). In
this work, we extend image-to-text and propose
VSD, which aims for the spatial understanding of
the image.
3 Visual Spatial Description
3.1 Task Description
Formally, we define the task of VSD as follows:
given an image
I
and an object pair
hO1, O2i
in-
side
I
, the VSD aims to output a word sequence
S={w1, ..., wn}
to describe the spatial semantics
between
O1
and
O2
. The provided
O1
and
O2
in-
clude both the object tags and their bounding boxes.
In Figure 1, we would receive “A man is walking
behind a red car from right to left. for
hman, cari
and “A red car is parked to the left of a pole. for
hcar, polei
based on the same input image. The
generated sentences of VSD must encode the spa-
tial semantics between the given two objects, which
differs from conventional image-to-text generation.
3.2 Compared with VSRC
Noticeably, VSRC is another representative task of
visual spatial understanding that decides the spatial
relation of two objects in an image. The relation is
chosen from a closed set which is manually prede-
fined. We can regard VSRC as a shallow analysis
task for spatial semantics understanding, while the
VSD task can offer a deeper spatial analysis by
using the much more flexible output.
In particular, compared with VSRC, VSD has
three major advantages. First, VSD can offer richer
semantics which could be necessary for spatial un-
derstanding. Meanwhile, VSRC only outputs a
spatial relation from a closed set in general. VSD
can raise other semantic roles to deepen the spa-
tial understanding beyond the relations, such as
predicates and object attributes. Second, the spatial
relations might be overlapped. For example, the
two relationships, “behind” and “to the right of
might be both correct for VSRC given the “man”
and “car” in Figure 1. The newly proposed task
VSD can more accurately describe the multiple
spatial semantics. Third, from the viewpoint of
downstream tasks, especially the systems that re-
quire automatic content-based image indexing or
visual dialogue, VSD is more straightforward and
adequate to support them.
3.3 Data Collection
We build an initial dataset To benchmark the VSD
task. The constructed dataset is extended from
a VSRC dataset to facilitate the investigation be-
tween VSD and VSRC. Thus, our final corpus in-
cludes both VSRC and VSD annotations.
Our VSRC dataset is sourced from two existing
datasets: SpatialSense (Yang et al.,2019) and Vi-
sualGenome (Krishna et al.,2017). SpatialSense is
a dataset initially constructed for VSRC with nine
well-defined spatial relations, namely, “on”, “in”,
“next to”, “under”, “above”, “behind”, “in front of”,
“to the left of”, and “to the right of”. The only
disadvantage of SpatialSense is its relatively-small
scale. Consequently, we enlarge the corpus with
the help of VisualGenome a widely adopted dataset
for scene graph generation with annotations in the
form of
h
subject, predicate, object
i
. We add the
triplets in VisualGenome, whose predicates can be
easily aligned with the nine spatial relations in Spa-
tialSense.
2
Accordingly, we can obtain a larger
2
The alignment is achieved by a map, which will be re-
leased along with the dataset.
摘要:

VisualSpatialDescription:ControlledSpatial-OrientedImage-to-TextGenerationYuZhao1,JianguoWei1,ZhichaoLin2,YuehengSun1,MeishanZhang3,MinZhang31CollegeofIntelligenceandComputing,TianjinUniversity2SchoolofNewMediaandCommunication,TianjinUniversity3InstituteofComputingandIntelligence,HarbinInstituteofT...

展开>> 收起<<
Visual Spatial Description Controlled Spatial-Oriented Image-to-Text Generation Yu Zhao1 Jianguo Wei1 Zhichao Lin2 Yueheng Sun1 Meishan Zhang3 Min Zhang3.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.49MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注