Visual Semantic Parsing From Images to Abstract Meaning Representation Mohamed A. Abdelsalam1 Zhan Shi12 Federico Fancellu3 Kalliopi Basioti14

2025-05-06 0 0 2.37MB 19 页 10玖币

侵权投诉

Visual Semantic Parsing: From Images to Abstract Meaning

Representation

Mohamed A. Abdelsalam1, Zhan Shi1,2*, Federico Fancellu3†, Kalliopi Basioti1,4*,

Dhaivat J. Bhatt1, Vladimir Pavlovic1,4, Afsaneh Fazly1

1Samsung AI Centre - Toronto ,2Queen’s University ,33M ,4Rutgers University

{m.abdelsalam, d.bhatt, a.fazly}@samsung.com ,z.shi@queensu.ca

f.fancellu0@gmail.com ,{kalliopi.basioti, vladimir}@rutgers.edu

Abstract

The success of scene graphs for visual scene

understanding has brought attention to the ben-

eﬁts of abstracting a visual input (e.g., image)

into a structured representation, where enti-

ties (people and objects) are nodes connected

by edges specifying their relations. Building

these representations, however, requires ex-

pensive manual annotation in the form of im-

ages paired with their scene graphs or frames.

These formalisms remain limited in the na-

ture of entities and relations they can cap-

ture. In this paper, we propose to leverage

a widely-used meaning representation in the

ﬁeld of natural language processing, the Ab-

stract Meaning Representation (AMR), to ad-

dress these shortcomings. Compared to scene

graphs, which largely emphasize spatial rela-

tionships , our visual AMR graphs are more

linguistically informed, with a focus on higher-

level semantic concepts extrapolated from vi-

sual input. Moreover, they allow us to gen-

erate meta-AMR graphs to unify information

contained in multiple image descriptions under

one representation. Through extensive experi-

mentation and analysis, we demonstrate that

we can re-purpose an existing text-to-AMR

parser to parse images into AMRs. Our ﬁnd-

ings point to important future research direc-

tions for improved scene understanding.

1 Introduction

The ability to understand and describe a scene

is fundamental for the development of truly in-

telligent systems, including autonomous vehicles,

robots navigating an environment, or even sim-

pler applications such as language-based image

retrieval. Much work in computer vision has fo-

cused on two key aspects of scene understanding,

namely, recognizing entities, including object de-

tection (Liu et al.,2016;Ren et al.,2015;Carion

*Work done during an internship at Samsung AI Centre -

Toronto

†Work done while at Samsung AI Centre - Toronto

et al.,2020;Liu et al.,2020a) and activity recog-

nition (Herath et al.,2017;Kong and Fu,2022;Li

et al.,2018;Gao et al.,2018), as well as under-

standing how entities are related to each other, e.g.,

human–object interaction (Hou et al.,2020;Zou

et al.,2021) and relation detection (Lu et al.,2016;

Zhang et al.,2017;Zellers et al.,2018).

A natural way of representing scene entities and

their relations is in graph form, so it is perhaps un-

surprising that a lot of work has focused on graph-

based scene representations and especially on scene

graphs (Johnson et al.,2015a). Scene graphs en-

code the salient regions in an image (mainly, ob-

jects) as nodes, and the relations among these

(mostly spatial in nature) as edges, both labelled via

natural language tags; see Fig. 1(b) for an example

scene graph. Along the same lines, Yatskar et al.

(2016) propose to represent a scene as a semantic

role labelled frame, drawn from FrameNet (Rup-

penhofer et al.,2016) — a linguistically-motivated

approach that draws on semantic role labelling lit-

erature.

Scene graphs and situation frames can capture

important aspects of an image, yet they are limited

in important ways. They both require expensive

manual annotation in the form of images paired

with their corresponding scene graphs or frames.

Scene graphs in particular also suffer from being

limited in the nature of entities and relations that

they capture (see Section 2for a detailed analysis).

Ideally, we would like to capture event-level se-

mantics (same as in situation recognition) but as a

structured graph that captures a diverse set of rela-

tions and goes beyond low-level visual semantics.

Inspired by the linguistically-motivated image

understanding research, we propose to represent

images using a well-known graph formalism for

language understanding, i.e., Abstract Meaning

Representations (AMRs Banarescu et al.,2013).

Similarly to (visual) semantic role labeling, AMRs

also represent “who did what to whom, where,

arXiv:2210.14862v2 [cs.CV] 27 Oct 2022

Someone riding a wave on their surfboard.

A man riding a wave on top of a surfboard.

A surfer is on a surfboard riding a large wave.

A man surfing a wave in front of a cliff.

A man surfing with the waves in the sea near mountain side.

(a) Meta AMR

(z0 / ride-01

:ARG0 (z1 / man

:location (z2 / on

:op1 (z3 / surfboard)

:location (z4 / top)))

:ARG1 (z5 / wave

:mod (z6 / large))

:location (z7 / front

:op1 (z8 / cliff))

:location (z9 / sea

:ARG1-of (z10 / near-02

:ARG2 (z11 / side

:part-of (z12 / mountain)))))

:ARG0

z0/ride-01

z1/man

:ARG1 :location

:location

z2/on

:location

z5/wave z7/front

z8/cliff

z9/sea

z6/large

:mod z10/near-01

z11/side

z12/mountain

:ARG1-of

:ARG2

:part-of

z4/top z3/surfboard

:location :op1

:op1

(b) Scene Graph

cone

surfer

water

surfboard

wetsuit

wave man

crevasse

WEARING

hitting

behind

stone surface shadow formation

Linearized AMR

Figure 1: An image from MSCOCO and Visual Genome dataset, along with its ﬁve human-generated captions,

and: (a) an image-level meta-AMR graph capturing its overall semantics, (b) its human-generated scene graph.

when, and how?” (Màrquez et al.,2008), but in

a more structured way via transforming an image

into a graph representation. AMRs not only en-

code the main events, their participants and argu-

ments, as well as their relations (as in semantic

role labelling/situation recognition), but also re-

lations among various other participants and ar-

guments; see Fig. 1(a). Importantly, AMR is a

broadly-adopted and dynamically evolving formal-

ism (e.g., Bonial et al.,2020;Bonn et al.,2020;

Naseem et al.,2021), and AMR parsing is an ac-

tive and successful area of research (e.g., Zhang

et al.,2019b;Bevilacqua et al.,2021;Xia et al.,

2021;Drozdov et al.,2022). Finally, given the high

quality of existing AMR parsers (for language), we

do not need manual AMR annotations for images,

and can rely on existing image–caption datasets to

create high quality silver data for image-to-AMR

parsing. In summary, we make the following con-

tributions:

•

We introduce the novel problem of parsing im-

ages into Abstract Meaning Representations, a

widely-adopted linguistically-motivated graph

formalism; and propose the ﬁrst image-to-AMR

parser model for the task.

•

We present a detailed analysis and comparison

between scene graphs and AMRs with respect to

the nature of entities and relations they capture,

results of which further motivates research in the

use of AMRs for better image understanding.

•

Inspired by work on multi-sentence AMR, we

propose a graph-to-graph transformation algo-

rithm that combines the meanings of several im-

age caption descriptions into image-level meta-

AMR graphs. The motivation behind generating

the meta-AMRs is to build a graph that covers

most of entities, predicates, and semantic rela-

tions contained in the individual caption AMRs.

Our analyses suggest that AMRs encode aspects

of an image content that are not captured by the

commonly-used scene graphs. Our initial results

on re-purposing a text-to-AMR parser for image-to-

AMR parsing, as well as on creating image-level

meta-AMRs, point to exciting future research di-

rections for improved scene understanding.

2 Motivation: AMRs vs. Scene Graphs

Scene graphs (SGs) are a widely-adopted graph

formalism for representing the semantic content of

an image. Scene graphs have been shown useful for

various downstream tasks, such as image caption-

ing (Yang et al.,2019;Li and Jiang,2019;Zhong

et al.,2020), visual question answering (Zhang

et al.,2019a;Hildebrandt et al.,2020;Damodaran

et al.,2021), and image retrieval (Johnson et al.,

2015b;Schuster et al.,2015;Wang et al.,2020;

Schroeder and Tripathi,2020). However, learning

to automatically generate SGs requires expensive

manual annotations (object bounding boxes and

their relations). SGs were also shown to be highly

biased in the entity and relation types that they

capture. For example, an analysis by Zellers et al.

(2018) reveals that clothing (e.g., dress) and ob-

ject/body parts (e.g., eyes,wheel) make up over

one-third of entity instances in the SGs correspond-

ing to the Visual Genome images (Krishna et al.,

2016), and that more than

% of all relation in-

stances belong to the two categories of geometric

(e.g., behind) and possessive (e.g., have).

One advantage of AMR graphs is that we can

draw on supervision through captions associated

with images. Nonetheless, the question remains as

to what types of entities and relations are encoded

by AMR graphs, and how these differ from SGs. To

answer this question, we follow an approach similar

to Zellers et al. (2018), and categorize entities and

relations in SG and AMR graphs corresponding to a

sample of

K images. We use the same categories

as Zellers et al., but add a few new ones to capture

relation types speciﬁc to AMRs, namely, Attribute

(small), Quantiﬁer (few), Event (soccer), and AMR

speciﬁc (date-entity). Details of our categorization

process are provided in Appendix A.

Figure 2shows the distribution of instances for

each Entity and Relation category, compared across

SG and AMR graphs. AMRs tend to encode a

more diverse set of relations, and in particular cap-

ture more of the abstract semantic relations that

are missing from SGs. This is expected because

our caption-generated AMRs by design capture

the essential meaning of the image descriptions

and, as such, encode how people perceive and de-

scribe scenes. In contrast, SGs are designed to

capture the content of an image, including regions

representing objects and (mainly spatial/geometric)

visually-observable relations; see Fig. 1for SG and

AMR graphs corresponding to an image. In the con-

text of Entities, and a major departure from SGs,

(object/body) parts are less frequently encoded in

AMRs, pointing to the well-known whole-object

bias in how people perceive and describe scenes

(Markman,1990;Fei-Fei et al.,2007). In contrast,

location is more frequent in AMRs.

The focus of AMRs on abstract content suggests

that they have the potential for improving down-

stream tasks, especially when the task requires an

understanding of the higher level semantics of an

image. Interestingly, a recent study showed that

using AMRs as an intermediate representation for

textual SG parsing helps improve the quality of the

parsed SGs (Choi et al.,2022), even though AMRs

and SGs encode qualitatively different information.

Since AMRs tend to capture higher level semantics,

we propose to use them as the ﬁnal image represen-

tation. The question remains as to how difﬁcult it is

to directly learn such representations from images.

The rest of the paper focuses on answering this

question.

3 Method

3.1 Parsing Images into AMR Graphs

We develop image-to-AMR parsers based on

a state-of-the-art seq2seq text-to-AMR parser,

Figure 2: Statistics on a selected set of top-frequency

Entity and Relation categories, extracted from the

AMR and SG graphs corresponding to around 50K im-

ages that appear in both Visual Genome and MSCOCO.

SPRING (Bevilacqua et al.,2021), and a mul-

timodal VL-BART (Cho et al.,2021). Both

are transformer-based architectures with a bi-

directional encoder and an auto-regressive de-

coder. SPRING extends a pre-trained seq2seq

model, BART (Lewis et al.,2020), by ﬁne-tuning it

on AMR parsing and generation. Next, we describe

our models, input representation, and training.

Models.

We build two variants of our image-to-

AMR parser, as depicted in Fig. 3(a) and (b).

•

Our ﬁrst model, which we refer to as

IMG2AMR

direct

, modiﬁes SPRING by replac-

ing BART with its vision-and-language coun-

terpart, VL-BART (Cho et al.,2021). VL-

BART extends BART with visual understand-

ing ability through ﬁne-tuning on multiple

vision-and-language tasks. With this modi-

ﬁcation, our model can receive visual features

(plus text) as input, and generate linearized

AMR graphs.

Multimodal Bidirectional Encoder

(

:ARG0

(

...

Autoregressive Decoder

Multimodal Bidirectional Encoder

Global

Image

Features

36 Image Regions

+ Coordinates

Object Tags

girl cat ...

<s>

smile-01 girl hold-01 cat

...

Autoregressive Decoder Multimodal Bidirectional Encoder

Global

Image

Feature

36 Image Regions +

Coordinates

Predicted

Nodes

smile-01 girl ...

(

smile-01

:ARG0

(

...

Autoregressive Decoder

(a)

(b)

gqgn

hold-01girlsmile-01

Global

Image

Features

36 Image Regions

+ Coordinates

Object Tags

girl cat ...

Stage 1 Stage 2

smile-01

Figure 3: Model architecture for our two image-to-AMR models: (a) IMG2AMRdirect: A direct model that uses

a single seq2seq encoder–decoder to generate linearlized AMRs from input images; and (b) IMG2AMR2stage: A

two-stage model containing two independent seq2seq components. gand rstand for global and region features, q

for tag embeddings, and nfor the embeddings of the predicted nodes. The input and output space of the decoders

come from the AMR vocabulary.

•

Our second model, inspired by text-to-graph

AMR parsers (e.g., Zhang et al.,2019b;Xia

et al.,2021), generates linearized AMRs in

two stages by ﬁrst predicting the nodes, and

then the relations. Speciﬁcally, we ﬁrst pre-

dict the nodes of the linearized AMR for a

given image. These predicted nodes are then

fed (along with the image) as input into a sec-

ond seq2seq model that generates a linearized

AMR (effectively adding the relations). We

refer to this model as IMG2AMR2stage.

Input Representation.

To represent images, we

follow VL-BART, which takes the output of Faster

R-CNN (Ren et al.,2015) (i.e., region features

and coordinates for

regions) and projects them

onto

d= 768

dimensional vectors via two separate

fully-connected layers. Faster R-CNN region fea-

tures are obtained via training for visual object and

attribute classiﬁcation (Anderson et al.,2018) on

Visual Genome. The visual input to our model is

composed of position-aware embeddings for the

regions, plus a global image-level feature (

and

in Fig. 3). To get the position-aware embeddings

for the regions, we add together the projected re-

gion and coordinate embeddings. To get the global

image feature, we use the output of the ﬁnal hid-

den layer in ResNet-101 (He et al.,2016), which is

passed through the same fully connected layer as

the regions to obtain a 768-dimensional vector.

Training.

To beneﬁt from transfer learning, we

initialize the encoder and decoder weights of both

our models from the pre-trained VL-BART. This

is a reasonable initialization strategy, given that

VL-BART has been pre-trained on input similar to

ours. Moreover, a large number of AMR labels are

drawn from the English vocabulary, and thus the

pre-training of VL-BART should also be appropri-

ate for AMR generation. We ﬁne-tune our models

on the task of image-to-AMR generation, using

images paired with their automatically-generated

AMR graphs. We consider two alternative AMR

representations: (a) caption AMRs, created directly

from captions associated with images (see Sec-

tion 4for details); and (b) image-level meta-AMRs,

constructed through an algorithm we describe be-

low in Section 3.2. We perform experiments with

either caption or meta-AMRs, where we train and

test on the same type of AMRs. For the various

stages of training, we use the cross-entropy loss be-

tween the model predictions and the ground-truth

labels for each token, where the model predictions

are obtained greedily, i.e., choosing the token with

the maximum score at each step of the sequence

generation.

3.2 Learning per-Image meta-AMR Graphs

Recall that, in order to collect a data set of im-

ages paired with their AMR graphs, we rely on

image–caption datasets such as MSCOCO. Speciﬁ-

cally, we use a pre-trained AMR parser to generate

AMR graphs from each caption of an image. Im-

ages can be described in many different ways, e.g.,

each image in MSCOCO comes with ﬁve different

human-generated captions. We hypothesize that

these captions collectively represent the content of

the image they are describing, and as such propose

to also combine the caption AMRs into image-level

meta-AMR graphs through a merge and reﬁne pro-

cess that we explain next.

Prior work has used graph-to-graph transfor-

mations for merging sentence-level AMRs into

document-level AMRs for abstractive and multi-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisualSemanticParsing:FromImagestoAbstractMeaningRepresentationMohamedA.Abdelsalam1,ZhanShi1,2*,FedericoFancellu3,KalliopiBasioti1,4*,DhaivatJ.Bhatt1,VladimirPavlovic1,4,AfsanehFazly11SamsungAICentre-Toronto,2Queen'sUniversity,33M,4RutgersUniversity{m.abdelsalam,d.bhatt,a.fazly}@samsung.com,z.shi@q...

展开>> 收起<<

Visual Semantic Parsing From Images to Abstract Meaning Representation Mohamed A. Abdelsalam1 Zhan Shi12 Federico Fancellu3 Kalliopi Basioti14.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visual Semantic Parsing From Images to Abstract Meaning Representation Mohamed A. Abdelsalam1 Zhan Shi12 Federico Fancellu3 Kalliopi Basioti14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: