Visual Semantic Parsing From Images to Abstract Meaning Representation Mohamed A. Abdelsalam1 Zhan Shi12 Federico Fancellu3 Kalliopi Basioti14

2025-05-06 0 0 2.37MB 19 页 10玖币
侵权投诉
Visual Semantic Parsing: From Images to Abstract Meaning
Representation
Mohamed A. Abdelsalam1, Zhan Shi1,2*, Federico Fancellu3†, Kalliopi Basioti1,4*,
Dhaivat J. Bhatt1, Vladimir Pavlovic1,4, Afsaneh Fazly1
1Samsung AI Centre - Toronto ,2Queen’s University ,33M ,4Rutgers University
{m.abdelsalam, d.bhatt, a.fazly}@samsung.com ,z.shi@queensu.ca
f.fancellu0@gmail.com ,{kalliopi.basioti, vladimir}@rutgers.edu
Abstract
The success of scene graphs for visual scene
understanding has brought attention to the ben-
efits of abstracting a visual input (e.g., image)
into a structured representation, where enti-
ties (people and objects) are nodes connected
by edges specifying their relations. Building
these representations, however, requires ex-
pensive manual annotation in the form of im-
ages paired with their scene graphs or frames.
These formalisms remain limited in the na-
ture of entities and relations they can cap-
ture. In this paper, we propose to leverage
a widely-used meaning representation in the
field of natural language processing, the Ab-
stract Meaning Representation (AMR), to ad-
dress these shortcomings. Compared to scene
graphs, which largely emphasize spatial rela-
tionships , our visual AMR graphs are more
linguistically informed, with a focus on higher-
level semantic concepts extrapolated from vi-
sual input. Moreover, they allow us to gen-
erate meta-AMR graphs to unify information
contained in multiple image descriptions under
one representation. Through extensive experi-
mentation and analysis, we demonstrate that
we can re-purpose an existing text-to-AMR
parser to parse images into AMRs. Our find-
ings point to important future research direc-
tions for improved scene understanding.
1 Introduction
The ability to understand and describe a scene
is fundamental for the development of truly in-
telligent systems, including autonomous vehicles,
robots navigating an environment, or even sim-
pler applications such as language-based image
retrieval. Much work in computer vision has fo-
cused on two key aspects of scene understanding,
namely, recognizing entities, including object de-
tection (Liu et al.,2016;Ren et al.,2015;Carion
*Work done during an internship at Samsung AI Centre -
Toronto
†Work done while at Samsung AI Centre - Toronto
et al.,2020;Liu et al.,2020a) and activity recog-
nition (Herath et al.,2017;Kong and Fu,2022;Li
et al.,2018;Gao et al.,2018), as well as under-
standing how entities are related to each other, e.g.,
human–object interaction (Hou et al.,2020;Zou
et al.,2021) and relation detection (Lu et al.,2016;
Zhang et al.,2017;Zellers et al.,2018).
A natural way of representing scene entities and
their relations is in graph form, so it is perhaps un-
surprising that a lot of work has focused on graph-
based scene representations and especially on scene
graphs (Johnson et al.,2015a). Scene graphs en-
code the salient regions in an image (mainly, ob-
jects) as nodes, and the relations among these
(mostly spatial in nature) as edges, both labelled via
natural language tags; see Fig. 1(b) for an example
scene graph. Along the same lines, Yatskar et al.
(2016) propose to represent a scene as a semantic
role labelled frame, drawn from FrameNet (Rup-
penhofer et al.,2016) — a linguistically-motivated
approach that draws on semantic role labelling lit-
erature.
Scene graphs and situation frames can capture
important aspects of an image, yet they are limited
in important ways. They both require expensive
manual annotation in the form of images paired
with their corresponding scene graphs or frames.
Scene graphs in particular also suffer from being
limited in the nature of entities and relations that
they capture (see Section 2for a detailed analysis).
Ideally, we would like to capture event-level se-
mantics (same as in situation recognition) but as a
structured graph that captures a diverse set of rela-
tions and goes beyond low-level visual semantics.
Inspired by the linguistically-motivated image
understanding research, we propose to represent
images using a well-known graph formalism for
language understanding, i.e., Abstract Meaning
Representations (AMRs Banarescu et al.,2013).
Similarly to (visual) semantic role labeling, AMRs
also represent “who did what to whom, where,
arXiv:2210.14862v2 [cs.CV] 27 Oct 2022
Someone riding a wave on their surfboard.
A man riding a wave on top of a surfboard.
A surfer is on a surfboard riding a large wave.
A man surfing a wave in front of a cliff.
A man surfing with the waves in the sea near mountain side.
(a) Meta AMR
(z0 / ride-01
:ARG0 (z1 / man
:location (z2 / on
:op1 (z3 / surfboard)
:location (z4 / top)))
:ARG1 (z5 / wave
:mod (z6 / large))
:location (z7 / front
:op1 (z8 / cliff))
:location (z9 / sea
:ARG1-of (z10 / near-02
:ARG2 (z11 / side
:part-of (z12 / mountain)))))
:ARG0
z0/ride-01
z1/man
:ARG1 :location
:location
z2/on
:location
z5/wave z7/front
z8/cliff
z9/sea
z6/large
:mod z10/near-01
z11/side
z12/mountain
:ARG1-of
:ARG2
:part-of
z4/top z3/surfboard
:location :op1
:op1
(b) Scene Graph
cone
surfer
water
surfboard
wetsuit
wave man
crevasse
IN
WEARING
hitting
behind
ON
stone surface shadow formation
Linearized AMR
Figure 1: An image from MSCOCO and Visual Genome dataset, along with its five human-generated captions,
and: (a) an image-level meta-AMR graph capturing its overall semantics, (b) its human-generated scene graph.
when, and how?” (Màrquez et al.,2008), but in
a more structured way via transforming an image
into a graph representation. AMRs not only en-
code the main events, their participants and argu-
ments, as well as their relations (as in semantic
role labelling/situation recognition), but also re-
lations among various other participants and ar-
guments; see Fig. 1(a). Importantly, AMR is a
broadly-adopted and dynamically evolving formal-
ism (e.g., Bonial et al.,2020;Bonn et al.,2020;
Naseem et al.,2021), and AMR parsing is an ac-
tive and successful area of research (e.g., Zhang
et al.,2019b;Bevilacqua et al.,2021;Xia et al.,
2021;Drozdov et al.,2022). Finally, given the high
quality of existing AMR parsers (for language), we
do not need manual AMR annotations for images,
and can rely on existing image–caption datasets to
create high quality silver data for image-to-AMR
parsing. In summary, we make the following con-
tributions:
We introduce the novel problem of parsing im-
ages into Abstract Meaning Representations, a
widely-adopted linguistically-motivated graph
formalism; and propose the first image-to-AMR
parser model for the task.
We present a detailed analysis and comparison
between scene graphs and AMRs with respect to
the nature of entities and relations they capture,
results of which further motivates research in the
use of AMRs for better image understanding.
Inspired by work on multi-sentence AMR, we
propose a graph-to-graph transformation algo-
rithm that combines the meanings of several im-
age caption descriptions into image-level meta-
AMR graphs. The motivation behind generating
the meta-AMRs is to build a graph that covers
most of entities, predicates, and semantic rela-
tions contained in the individual caption AMRs.
Our analyses suggest that AMRs encode aspects
of an image content that are not captured by the
commonly-used scene graphs. Our initial results
on re-purposing a text-to-AMR parser for image-to-
AMR parsing, as well as on creating image-level
meta-AMRs, point to exciting future research di-
rections for improved scene understanding.
2 Motivation: AMRs vs. Scene Graphs
Scene graphs (SGs) are a widely-adopted graph
formalism for representing the semantic content of
an image. Scene graphs have been shown useful for
various downstream tasks, such as image caption-
ing (Yang et al.,2019;Li and Jiang,2019;Zhong
et al.,2020), visual question answering (Zhang
et al.,2019a;Hildebrandt et al.,2020;Damodaran
et al.,2021), and image retrieval (Johnson et al.,
2015b;Schuster et al.,2015;Wang et al.,2020;
Schroeder and Tripathi,2020). However, learning
to automatically generate SGs requires expensive
manual annotations (object bounding boxes and
their relations). SGs were also shown to be highly
biased in the entity and relation types that they
capture. For example, an analysis by Zellers et al.
(2018) reveals that clothing (e.g., dress) and ob-
ject/body parts (e.g., eyes,wheel) make up over
one-third of entity instances in the SGs correspond-
ing to the Visual Genome images (Krishna et al.,
2016), and that more than
90
% of all relation in-
stances belong to the two categories of geometric
(e.g., behind) and possessive (e.g., have).
One advantage of AMR graphs is that we can
draw on supervision through captions associated
with images. Nonetheless, the question remains as
to what types of entities and relations are encoded
by AMR graphs, and how these differ from SGs. To
answer this question, we follow an approach similar
to Zellers et al. (2018), and categorize entities and
relations in SG and AMR graphs corresponding to a
sample of
50
K images. We use the same categories
as Zellers et al., but add a few new ones to capture
relation types specific to AMRs, namely, Attribute
(small), Quantifier (few), Event (soccer), and AMR
specific (date-entity). Details of our categorization
process are provided in Appendix A.
Figure 2shows the distribution of instances for
each Entity and Relation category, compared across
SG and AMR graphs. AMRs tend to encode a
more diverse set of relations, and in particular cap-
ture more of the abstract semantic relations that
are missing from SGs. This is expected because
our caption-generated AMRs by design capture
the essential meaning of the image descriptions
and, as such, encode how people perceive and de-
scribe scenes. In contrast, SGs are designed to
capture the content of an image, including regions
representing objects and (mainly spatial/geometric)
visually-observable relations; see Fig. 1for SG and
AMR graphs corresponding to an image. In the con-
text of Entities, and a major departure from SGs,
(object/body) parts are less frequently encoded in
AMRs, pointing to the well-known whole-object
bias in how people perceive and describe scenes
(Markman,1990;Fei-Fei et al.,2007). In contrast,
location is more frequent in AMRs.
The focus of AMRs on abstract content suggests
that they have the potential for improving down-
stream tasks, especially when the task requires an
understanding of the higher level semantics of an
image. Interestingly, a recent study showed that
using AMRs as an intermediate representation for
textual SG parsing helps improve the quality of the
parsed SGs (Choi et al.,2022), even though AMRs
and SGs encode qualitatively different information.
Since AMRs tend to capture higher level semantics,
we propose to use them as the final image represen-
tation. The question remains as to how difficult it is
to directly learn such representations from images.
The rest of the paper focuses on answering this
question.
3 Method
3.1 Parsing Images into AMR Graphs
We develop image-to-AMR parsers based on
a state-of-the-art seq2seq text-to-AMR parser,
Figure 2: Statistics on a selected set of top-frequency
Entity and Relation categories, extracted from the
AMR and SG graphs corresponding to around 50K im-
ages that appear in both Visual Genome and MSCOCO.
SPRING (Bevilacqua et al.,2021), and a mul-
timodal VL-BART (Cho et al.,2021). Both
are transformer-based architectures with a bi-
directional encoder and an auto-regressive de-
coder. SPRING extends a pre-trained seq2seq
model, BART (Lewis et al.,2020), by fine-tuning it
on AMR parsing and generation. Next, we describe
our models, input representation, and training.
Models.
We build two variants of our image-to-
AMR parser, as depicted in Fig. 3(a) and (b).
Our first model, which we refer to as
IMG2AMR
direct
, modifies SPRING by replac-
ing BART with its vision-and-language coun-
terpart, VL-BART (Cho et al.,2021). VL-
BART extends BART with visual understand-
ing ability through fine-tuning on multiple
vision-and-language tasks. With this modi-
fication, our model can receive visual features
(plus text) as input, and generate linearized
AMR graphs.
Multimodal Bidirectional Encoder
(
z0
z0
:ARG0
:ARG0
(
...
...
Autoregressive Decoder
Multimodal Bidirectional Encoder
Global
Image
Features
36 Image Regions
+ Coordinates
Object Tags
girl cat ...
<s>
smile-01 girl hold-01 cat
...
...
Autoregressive Decoder Multimodal Bidirectional Encoder
Global
Image
Feature
36 Image Regions +
Coordinates
Predicted
Nodes
smile-01 girl ...
(
z0
z0
smile-01
smile-01
:ARG0
:ARG0
(
...
...
Autoregressive Decoder
(a)
(b)
gqgn
hold-01girlsmile-01
r
Global
Image
Features
36 Image Regions
+ Coordinates
Object Tags
girl cat ...
r
Stage 1 Stage 2
gq
r
smile-01
smile-01
Figure 3: Model architecture for our two image-to-AMR models: (a) IMG2AMRdirect: A direct model that uses
a single seq2seq encoder–decoder to generate linearlized AMRs from input images; and (b) IMG2AMR2stage: A
two-stage model containing two independent seq2seq components. gand rstand for global and region features, q
for tag embeddings, and nfor the embeddings of the predicted nodes. The input and output space of the decoders
come from the AMR vocabulary.
Our second model, inspired by text-to-graph
AMR parsers (e.g., Zhang et al.,2019b;Xia
et al.,2021), generates linearized AMRs in
two stages by first predicting the nodes, and
then the relations. Specifically, we first pre-
dict the nodes of the linearized AMR for a
given image. These predicted nodes are then
fed (along with the image) as input into a sec-
ond seq2seq model that generates a linearized
AMR (effectively adding the relations). We
refer to this model as IMG2AMR2stage.
Input Representation.
To represent images, we
follow VL-BART, which takes the output of Faster
R-CNN (Ren et al.,2015) (i.e., region features
and coordinates for
36
regions) and projects them
onto
d= 768
dimensional vectors via two separate
fully-connected layers. Faster R-CNN region fea-
tures are obtained via training for visual object and
attribute classification (Anderson et al.,2018) on
Visual Genome. The visual input to our model is
composed of position-aware embeddings for the
36
regions, plus a global image-level feature (
r
and
g
in Fig. 3). To get the position-aware embeddings
for the regions, we add together the projected re-
gion and coordinate embeddings. To get the global
image feature, we use the output of the final hid-
den layer in ResNet-101 (He et al.,2016), which is
passed through the same fully connected layer as
the regions to obtain a 768-dimensional vector.
Training.
To benefit from transfer learning, we
initialize the encoder and decoder weights of both
our models from the pre-trained VL-BART. This
is a reasonable initialization strategy, given that
VL-BART has been pre-trained on input similar to
ours. Moreover, a large number of AMR labels are
drawn from the English vocabulary, and thus the
pre-training of VL-BART should also be appropri-
ate for AMR generation. We fine-tune our models
on the task of image-to-AMR generation, using
images paired with their automatically-generated
AMR graphs. We consider two alternative AMR
representations: (a) caption AMRs, created directly
from captions associated with images (see Sec-
tion 4for details); and (b) image-level meta-AMRs,
constructed through an algorithm we describe be-
low in Section 3.2. We perform experiments with
either caption or meta-AMRs, where we train and
test on the same type of AMRs. For the various
stages of training, we use the cross-entropy loss be-
tween the model predictions and the ground-truth
labels for each token, where the model predictions
are obtained greedily, i.e., choosing the token with
the maximum score at each step of the sequence
generation.
3.2 Learning per-Image meta-AMR Graphs
Recall that, in order to collect a data set of im-
ages paired with their AMR graphs, we rely on
image–caption datasets such as MSCOCO. Specifi-
cally, we use a pre-trained AMR parser to generate
AMR graphs from each caption of an image. Im-
ages can be described in many different ways, e.g.,
each image in MSCOCO comes with five different
human-generated captions. We hypothesize that
these captions collectively represent the content of
the image they are describing, and as such propose
to also combine the caption AMRs into image-level
meta-AMR graphs through a merge and refine pro-
cess that we explain next.
Prior work has used graph-to-graph transfor-
mations for merging sentence-level AMRs into
document-level AMRs for abstractive and multi-
摘要:

VisualSemanticParsing:FromImagestoAbstractMeaningRepresentationMohamedA.Abdelsalam1,ZhanShi1,2*,FedericoFancellu3†,KalliopiBasioti1,4*,DhaivatJ.Bhatt1,VladimirPavlovic1,4,AfsanehFazly11SamsungAICentre-Toronto,2Queen'sUniversity,33M,4RutgersUniversity{m.abdelsalam,d.bhatt,a.fazly}@samsung.com,z.shi@q...

展开>> 收起<<
Visual Semantic Parsing From Images to Abstract Meaning Representation Mohamed A. Abdelsalam1 Zhan Shi12 Federico Fancellu3 Kalliopi Basioti14.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.37MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注