
Visual Semantic Parsing: From Images to Abstract Meaning
Representation
Mohamed A. Abdelsalam1, Zhan Shi1,2*, Federico Fancellu3†, Kalliopi Basioti1,4*,
Dhaivat J. Bhatt1, Vladimir Pavlovic1,4, Afsaneh Fazly1
1Samsung AI Centre - Toronto ,2Queen’s University ,33M ,4Rutgers University
{m.abdelsalam, d.bhatt, a.fazly}@samsung.com ,z.shi@queensu.ca
f.fancellu0@gmail.com ,{kalliopi.basioti, vladimir}@rutgers.edu
Abstract
The success of scene graphs for visual scene
understanding has brought attention to the ben-
efits of abstracting a visual input (e.g., image)
into a structured representation, where enti-
ties (people and objects) are nodes connected
by edges specifying their relations. Building
these representations, however, requires ex-
pensive manual annotation in the form of im-
ages paired with their scene graphs or frames.
These formalisms remain limited in the na-
ture of entities and relations they can cap-
ture. In this paper, we propose to leverage
a widely-used meaning representation in the
field of natural language processing, the Ab-
stract Meaning Representation (AMR), to ad-
dress these shortcomings. Compared to scene
graphs, which largely emphasize spatial rela-
tionships , our visual AMR graphs are more
linguistically informed, with a focus on higher-
level semantic concepts extrapolated from vi-
sual input. Moreover, they allow us to gen-
erate meta-AMR graphs to unify information
contained in multiple image descriptions under
one representation. Through extensive experi-
mentation and analysis, we demonstrate that
we can re-purpose an existing text-to-AMR
parser to parse images into AMRs. Our find-
ings point to important future research direc-
tions for improved scene understanding.
1 Introduction
The ability to understand and describe a scene
is fundamental for the development of truly in-
telligent systems, including autonomous vehicles,
robots navigating an environment, or even sim-
pler applications such as language-based image
retrieval. Much work in computer vision has fo-
cused on two key aspects of scene understanding,
namely, recognizing entities, including object de-
tection (Liu et al.,2016;Ren et al.,2015;Carion
*Work done during an internship at Samsung AI Centre -
Toronto
†Work done while at Samsung AI Centre - Toronto
et al.,2020;Liu et al.,2020a) and activity recog-
nition (Herath et al.,2017;Kong and Fu,2022;Li
et al.,2018;Gao et al.,2018), as well as under-
standing how entities are related to each other, e.g.,
human–object interaction (Hou et al.,2020;Zou
et al.,2021) and relation detection (Lu et al.,2016;
Zhang et al.,2017;Zellers et al.,2018).
A natural way of representing scene entities and
their relations is in graph form, so it is perhaps un-
surprising that a lot of work has focused on graph-
based scene representations and especially on scene
graphs (Johnson et al.,2015a). Scene graphs en-
code the salient regions in an image (mainly, ob-
jects) as nodes, and the relations among these
(mostly spatial in nature) as edges, both labelled via
natural language tags; see Fig. 1(b) for an example
scene graph. Along the same lines, Yatskar et al.
(2016) propose to represent a scene as a semantic
role labelled frame, drawn from FrameNet (Rup-
penhofer et al.,2016) — a linguistically-motivated
approach that draws on semantic role labelling lit-
erature.
Scene graphs and situation frames can capture
important aspects of an image, yet they are limited
in important ways. They both require expensive
manual annotation in the form of images paired
with their corresponding scene graphs or frames.
Scene graphs in particular also suffer from being
limited in the nature of entities and relations that
they capture (see Section 2for a detailed analysis).
Ideally, we would like to capture event-level se-
mantics (same as in situation recognition) but as a
structured graph that captures a diverse set of rela-
tions and goes beyond low-level visual semantics.
Inspired by the linguistically-motivated image
understanding research, we propose to represent
images using a well-known graph formalism for
language understanding, i.e., Abstract Meaning
Representations (AMRs Banarescu et al.,2013).
Similarly to (visual) semantic role labeling, AMRs
also represent “who did what to whom, where,
arXiv:2210.14862v2 [cs.CV] 27 Oct 2022