
Four parses are located in the ball
B(P1,SMATCH)
of estimated IAA, (gold) parse P1 being the center.
However, the true set of possible human candidates
H
is very likely much smaller than the ball and its
shape is unknown.
2
Besides, a superset of
H
is a set
of acceptable parses
A
, i.e., parses that may have a
small flaw which does not significantly distort the
sentence meaning. Now, it can indeed happen that
parse P2, as opposed to P3, has a lower distance
to reference P1, i.e., to the center of
B(SMATCH)
– but is not found in
A⊇H
, which marks it as
an inaccurate candidate. On the other hand, P4 is
contained in
A
, but not in
H
, which would make it
acceptable, but less preferable than P3.
Research questions
Triggered by these consid-
erations, this paper tackles the key questions: Do
high-performance AMR parsers indeed deliver
accurate semantic graphs, as suggested by high
benchmark scores that surpass human IAA esti-
mates?Does a higher SMATCH against a single
reference necessarily indicate better overall parse
quality? And what steps can we take to mitigate po-
tential issues when assessing the true performance
of high-performance parsers?
Paper structure
After discussing background
and related work (Section 2), we describe our data
setup and give a survey of AMR metrics (Section
3). We then evaluate the metrics with regard to scor-
ing i) corpora (Section 5), ii) AMR pairs (Section
6) and iii) cross-metric differences in their ranking
behavior (Section 7). We conclude by discussing
limitations of our study (Section 8), give recom-
mendations and outline future work (Section 9).3
2 Background and related work
AMR parsing and applications
Over the years,
we have observed a great diversity in approaches
to AMR parsing, ranging from graph prediction
with a pipeline (Flanigan et al.,2014), or a neural
network (Lyu and Titov,2018;Cai and Lam,2020)
to transition-based parsing (Wang et al.,2015) and
sequence-to-sequence parsing, e.g., by exploiting
large parallel corpora (Xu et al.,2020). A re-
cent trend is to exploit the knowledge in large
pre-trained sequence-to-sequence language models
such as T5 (Raffel et al.,2019) or BART (Lewis
2
Under the unrealistic assumptions of an omniscient anno-
tator and AMR being the ideal way of meaning representation,
one might require that Halways has exactly one element.
3
Code and data for our study are available at
https:
//github.com/Heidelberg-nlp/AMRParseEval.
et al.,2020), by fine-tuning them on AMR corpora,
as show-cased, e.g., by Bevilacqua et al. (2021).
Such models are on par or tend to surpass esti-
mates for human AMR agreement (Banarescu et al.,
2013), when measured in SMATCH points.
AMR, by virtue of its properties as a graph-based
abstract meaning representation, is attractive for
many key NLP tasks, such as machine translation
(Song et al.,2019), summarization (Dohare et al.,
2017;Liao et al.,2018), NLG evaluation (Opitz
and Frank,2021;Manning and Schneider,2021;
Ribeiro et al.,2022) and measuring semantic sen-
tence similarity (Opitz and Frank,2022).
Metric evaluation for MT evaluation
Metric
evaluation for machine translation (MT) has re-
ceived much attention over the recent years (Ma
et al.,2019;Mathur et al.,2020;Freitag et al.,
2021). When evaluating metrics for MT evalua-
tion, it seems generally agreed upon that the main
goal of a MT metric is high correlation to human
ratings, mainly with respect to rating adequacy of a
candidate against one (or a set of) gold reference(s).
A recent shared task (Freitag et al.,2021) meta-
evaluates popular metrics such as BLEU (Papineni
et al.,2002) or BLEURT (Sellam et al.,2020), by
comparing the metrics’ scores to human scores for
systems and individual segments. They find that
the performance of each metric varies depending
on the underlying domain (e.g., TED talks or news),
and that most metrics struggle to penalize transla-
tions with errors in reversing negation or sentiment
polarity, and show lower correlations for semantic
phenomena including subordination, named enti-
ties and terminology. This indicates that there is
potential for cross-pollination: clearly, AMR met-
ric evaluation may profit from the vast amount of
experience of metric evaluation for other tasks. On
the other hand, MT evaluation may profit from
relating semantic representations, to better differ-
entiate semantic errors with respect to their type
and severity. A first step in this direction may have
been made by Zeidler et al. (2022), who assess
the behaviour of MT metrics, AMR metrics, and
hybrid metrics when analyzing sentence pairs that
differ in only one linguistic phenomenon.
3 Study Setup: Data creation and AMR
metric overview
In this Section, first we select data and two popular
high-performance parsers for creating candidate
AMRs. Then we describe the human quality an-