Visual Spatial Description Controlled Spatial-Oriented Image-to-Text Generation Yu Zhao1 Jianguo Wei1 Zhichao Lin2 Yueheng Sun1 Meishan Zhang3 Min Zhang3

2025-05-06 0 0 1.49MB 13 页 10玖币

侵权投诉

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text

Generation

Yu Zhao1, Jianguo Wei1, Zhichao Lin2, Yueheng Sun1, Meishan Zhang3∗

, Min Zhang3

1College of Intelligence and Computing, Tianjin University

2School of New Media and Communication, Tianjin University

3Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen)

{zhaoyucs,jianguo,chaosmyth,yhs}@tju.edu.cn,

{zhangmeishan,zhangmin2021}@hit.edu.cn

Abstract

Image-to-text tasks, such as open-ended image

captioning and controllable image description,

have received extensive attention for decades.

Here, we further advance this line of work by

presenting Visual Spatial Description (VSD), a

new perspective for image-to-text toward spa-

tial semantics. Given an image and two objects

inside it, VSD aims to produce one description

focusing on the spatial perspective between the

two objects. Accordingly, we manually an-

notate a dataset to facilitate the investigation

of the newly-introduced task and build several

benchmark encoder-decoder models by using

VL-BART and VL-T5 as backbones. In ad-

dition, we investigate pipeline and joint end-

to-end architectures for incorporating visual

spatial relationship classiﬁcation (VSRC) in-

formation into our model. Finally, we con-

duct experiments on our benchmark dataset to

evaluate all our models. Results show that

our models are impressive, providing accurate

and human-like spatial-oriented text descrip-

tions. Meanwhile, VSRC has great potential

for VSD, and the joint end-to-end architecture

is the better choice for their integration. We

make the dataset and codes public for research

purposes.1

1 Introduction

Text generation from images is a widely-adopted

means for deep understanding of cross-modal data

that has received increasing interest of both com-

puter vision (CV) and natural language processing

(NLP) communities (He and Deng,2017). Image-

to-text tasks generate natural language texts to as-

sist in understanding the scene meaning of a spe-

ciﬁc image, which might be beneﬁcial for a variety

of applications such as image retrieval (Diao et al.,

2021;Ahmed et al.,2021), perception assistance

(Xu et al.,2018;Shashirangana et al.,2021), pedes-

∗Corresponding author

1https://github.com/zhaoyucs/VSD

Task Condition Target Text

Image Captioning —— A man is walking past a car.

VSR-guided Captioning walk;⟨Arg⟩,⟨Loc⟩Aman is walking cross a street.

Visual Question Answering What color is the car?The car is red.

Our Task: VSD ⟨man,car⟩Aman is walking behind a red car

from right to left.

⟨car,pole⟩A red car is parked to the left of a pole.

Figure 1: A comparison of three example image-to-text

generation tasks and the proposed VSD in this work.

trian detection (Hasan et al.,2021), and medical

system (Miura et al.,2021).

Image-to-text tasks take on various forms when

serving different purposes. Figure 1illustrates

a comparison of three example tasks. First, the

generic open-ended image captioning aims to pro-

vide a summarised description that describes an in-

put image and reﬂects the overall understanding of

the image (Lindh et al.,2020;Vinyals et al.,2015;

Ji et al.,2020). Furthermore, the verb-speciﬁc se-

mantic roles (VSR) guided captioning (Chen et al.,

2021) and visual question answering (VQA) (Antol

et al.,2015) are two examples of controllable image

description, which produce human-like and styl-

ized descriptions under speciﬁed conditions based

on a thorough comprehension of the input image

(Chen et al.,2021;Fei et al.,2021b;Mathews et al.,

2018;Cornia et al.,2019;Lindh et al.,2020;Pont-

Tuset et al.,2020;Deng et al.,2020;Zhong et al.,

2020;Kim et al.,2019;Chen et al.,2020a;Fei

et al.,2022;Jhamtani and Berg-Kirkpatrick,2018).

The VSR-guided captioning produces a description

focusing on a verb with speciﬁed semantic roles,

and the VQA generates a reasoning answer based

on a given question.

In this work, we extend the line of controllable

image description by presenting the spatial seman-

arXiv:2210.11109v2 [cs.CV] 26 Oct 2022

tics of image-to-text, which is essential but has

received little attention previously. Spatial seman-

tics is a fundamental aspect of both language and

image interpretation in relation to human cogni-

tion (Zlatev,2007), and it has shown great value in

spatial-based applications such as automatic navi-

gation, personal assistance, and unmanned manip-

ulation (Irshad et al.,2021;Raychaudhuri et al.,

2021;Zeng et al.,2018). Here, we introduce a new

task, Visual Spatial Description (VSD), which gen-

erates text pieces to describe the spatial semantics

in the image. The task takes an image with two

speciﬁed objects in it as inputs and outputs one

sentence that describes the detailed spatial relation

of the objects. We manually annotate a dataset for

inquiry to benchmark this task.

VSD is a typical vision-language generation

problem that can be addressed by multi-modal

encoder-decoder modeling. Multi-modal models

allow both visual and linguistic inputs and encode

them to a joint representation that can learn infor-

mation from both modal inputs. Moreover, recent

studies show that vision-language pretraining can

bring remarkable achievements in most image-to-

text tasks (Lu et al.,2019;Sun et al.,2019;Tan

and Bansal,2019;Zhou et al.,2020;Li et al.,2019;

Hu and Singh,2021;Li et al.,2021;Xiao et al.,

2022). Here, we follow these tasks and adopt

VL-BART and VL-T5 (Cho et al.,2021) as back-

bones, which exhibit state-of-the-art performance

in vision-language generation.

In particular, a closely-related task, visual spatial

relationship classiﬁcation (VSRC), which outputs

the spatial relationship between two objects inside

an image, might be beneﬁcial for our proposed

VSD. The predeﬁned discrete spatial relations such

as “next to” and “behind”, in VSRC should be able

to effectively guide the VSD generation. To this

end, we ﬁrst make a thorough comparison of the

connections between VSD and VSRC, which can

be regarded as shallow and deep analyses of spatial

semantics, respectively, and further investigate the

VSRC-enhanced VSD models, performing visual

spatial understanding from shallow to deep. Specif-

ically, we present two straightforward architectures

to integrate VSRC into VSD, one being the pipeline

strategy and the other being the end-to-end joint

strategy, respectively.

Finally, we conduct experiments on our con-

structed dataset to evaluate all proposed models.

First, we examine the two start-up models for VSD

only with VL-BART and VL-T5. The results show

that the two models are comparable in terms of

performance, and both models can provide highly

accurate and ﬂuent human-like outputs of spatial

understanding. Second, we verify the effective-

ness of VSRC for VSD and ﬁnd that: (1) VSRC

has great potentials for VSD because gold-standard

VSRC can lead to striking improvements on VSD;

(2) VSD can be beneﬁted from automatic VSRC,

and the end-to-end joint framework is slightly bet-

ter. We further perform several analyses to inten-

sively understand VSD and the proposed models.

2 Related Work

Image-to-text has been intensively investigated

with the support of neural networks in the past

years (He and Deng,2017). The encoder-decoder

architecture is an often considered framework,

where the encoder extracts visual features from

the image and the decoder generates text for spe-

ciﬁc tasks. Early works employ a convolutional

neural network (CNN) as the visual encoder and a

recurrent neural network (RNN) as the text decoder

(Vinyals et al.,2015;Rennie et al.,2017). Re-

cently, the Transformer neural network (Vaswani

et al.,2017), which is impressively powerful in

feature representation learning on both vision and

language, has gained increasing interest. The

Transformer-based encoder-decoder models have

been adopted in a wide range of image-to-text

tasks (Cornia et al.,2020;Herdade et al.,2019;Fei

et al.,2021a). These models coupled with visual-

language pretraining have achieved the top perfor-

mance for these tasks (Lu et al.,2019;Sun et al.,

2019;Tan and Bansal,2019;Zhou et al.,2020;Li

et al.,2019;Hu and Singh,2021;Li et al.,2021). In

this work, we exploit the Transformer-based archi-

tecture and two pretrained visual-language models:

VL-BART and VL-T5 (Cho et al.,2021), reaching

several strong benchmark models for our task.

Image-to-text can be varied depending on the ob-

jective of the visual description. Image captioning

is the most well-studied task, which aims to summa-

rize a given image or to describe a particular region

in it (Karpathy and Fei-Fei,2015;Vinyals et al.,

2015). Several subsequent studies have attempted

to produce captions with speciﬁed patterns and

styles (Cornia et al.,2019;Kim et al.,2019;Deng

et al.,2020;Zhong et al.,2020;Zheng et al.,2019).

For example, VQA and visual reasoning can be

regarded as such attempts, which are conditioned

by a speciﬁc question directed at the input image

(Antol et al.,2015;Agrawal et al.,2018;Hudson

and Manning,2019;Johnson et al.,2017). The

VSR-guided image captioning (Chen et al.,2021)

is the most close to our work, which generates a

sentence for a particular event in the image with

well-speciﬁed semantic roles. Here we focus on

spatial semantics instead, generating a description

based on the spatial relationship.

Spatial semantics is an important topic in both

language and visual analysis. Kordjamshidi et al.

(2011) propose an preliminary study on text-based

spatial role labeling. Later, spatial element extrac-

tion and relation extraction from texts are investi-

gated by (Nichols and Botros,2015). Pustejovsky

et al. (2015) present a ﬁne-grained spatial semantic

analysis in texts with rich spatial roles. Based on

the image input, Yang et al. (2019) propose VSRC

and benchmark it with a manually-crafted dataset.

The VSRC is actually a shallow task for visual spa-

tial analysis based on a closed relationship set and

by using a simple classiﬁcation schema. Following,

Chiou et al. (2021) build a much stronger model

on the dataset. Many studies have exploited spatial

semantics to assist other image understanding tasks

(Kim et al.,2021;Wu et al.,2021;Collell et al.,

2021;Xiao et al.,2021;Pierrard et al.,2021). In

addition, learning spatial representations from mul-

tiple modalities also receives particular attention

(Collell and Moens,2018;Dan et al.,2020). In

this work, we extend image-to-text and propose

VSD, which aims for the spatial understanding of

the image.

3 Visual Spatial Description

3.1 Task Description

Formally, we deﬁne the task of VSD as follows:

given an image

and an object pair

hO1, O2i

in-

side

, the VSD aims to output a word sequence

S={w1, ..., wn}

to describe the spatial semantics

between

and

. The provided

and

in-

clude both the object tags and their bounding boxes.

In Figure 1, we would receive “A man is walking

behind a red car from right to left.” for

hman, cari

and “A red car is parked to the left of a pole.” for

hcar, polei

based on the same input image. The

generated sentences of VSD must encode the spa-

tial semantics between the given two objects, which

differs from conventional image-to-text generation.

3.2 Compared with VSRC

Noticeably, VSRC is another representative task of

visual spatial understanding that decides the spatial

relation of two objects in an image. The relation is

chosen from a closed set which is manually prede-

ﬁned. We can regard VSRC as a shallow analysis

task for spatial semantics understanding, while the

VSD task can offer a deeper spatial analysis by

using the much more ﬂexible output.

In particular, compared with VSRC, VSD has

three major advantages. First, VSD can offer richer

semantics which could be necessary for spatial un-

derstanding. Meanwhile, VSRC only outputs a

spatial relation from a closed set in general. VSD

can raise other semantic roles to deepen the spa-

tial understanding beyond the relations, such as

predicates and object attributes. Second, the spatial

relations might be overlapped. For example, the

two relationships, “behind” and “to the right of”

might be both correct for VSRC given the “man”

and “car” in Figure 1. The newly proposed task

VSD can more accurately describe the multiple

spatial semantics. Third, from the viewpoint of

downstream tasks, especially the systems that re-

quire automatic content-based image indexing or

visual dialogue, VSD is more straightforward and

adequate to support them.

3.3 Data Collection

We build an initial dataset To benchmark the VSD

task. The constructed dataset is extended from

a VSRC dataset to facilitate the investigation be-

tween VSD and VSRC. Thus, our ﬁnal corpus in-

cludes both VSRC and VSD annotations.

Our VSRC dataset is sourced from two existing

datasets: SpatialSense (Yang et al.,2019) and Vi-

sualGenome (Krishna et al.,2017). SpatialSense is

a dataset initially constructed for VSRC with nine

well-deﬁned spatial relations, namely, “on”, “in”,

“next to”, “under”, “above”, “behind”, “in front of”,

“to the left of”, and “to the right of”. The only

disadvantage of SpatialSense is its relatively-small

scale. Consequently, we enlarge the corpus with

the help of VisualGenome a widely adopted dataset

for scene graph generation with annotations in the

form of

subject, predicate, object

. We add the

triplets in VisualGenome, whose predicates can be

easily aligned with the nine spatial relations in Spa-

tialSense.

Accordingly, we can obtain a larger

The alignment is achieved by a map, which will be re-

leased along with the dataset.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisualSpatialDescription:ControlledSpatial-OrientedImage-to-TextGenerationYuZhao1,JianguoWei1,ZhichaoLin2,YuehengSun1,MeishanZhang3,MinZhang31CollegeofIntelligenceandComputing,TianjinUniversity2SchoolofNewMediaandCommunication,TianjinUniversity3InstituteofComputingandIntelligence,HarbinInstituteofT...

展开>> 收起<<

Visual Spatial Description Controlled Spatial-Oriented Image-to-Text Generation Yu Zhao1 Jianguo Wei1 Zhichao Lin2 Yueheng Sun1 Meishan Zhang3 Min Zhang3.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visual Spatial Description Controlled Spatial-Oriented Image-to-Text Generation Yu Zhao1 Jianguo Wei1 Zhichao Lin2 Yueheng Sun1 Meishan Zhang3 Min Zhang3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: