Better Smatch Better Parser AMR evaluation is not so simple anymore Juri Opitz Dept. of Computational Linguistics

2025-04-27 0 0 441.92KB 12 页 10玖币
侵权投诉
Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Juri Opitz
Dept. of Computational Linguistics
Heidelberg University
69120 Heidelberg
opitz.sci@gmail.com
Anette Frank
Dept. of Computational Linguistics
Heidelberg University
69120 Heidelberg
frank@cl.uni-heidelberg.de
Abstract
Recently, astonishing advances have been ob-
served in AMR parsing, as measured by the
structural SMATCH metric. In fact, today’s
systems achieve performance levels that seem
to surpass estimates of human inter annotator
agreement (IAA). Therefore, it is unclear how
well SMATCH (still) relates to human estimates
of parse quality, as in this situation potentially
fine-grained errors of similar weight may im-
pact the AMR’s meaning to different degrees.
We conduct an analysis of two popular and
strong AMR parsers that – according to
SMATCH– reach quality levels on par with hu-
man IAA, and assess how human quality rat-
ings relate to SMATCH and other AMR metrics.
Our main findings are: i) While high SMATCH
scores indicate otherwise, we find that AMR
parsing is far from being solved: we fre-
quently find structurally small, but semanti-
cally unacceptable errors that substantially dis-
tort sentence meaning. ii) Considering high-
performance parsers, better SMATCH scores
may not necessarily indicate consistently
better parsing quality. To obtain a meaning-
ful and comprehensive assessment of quality
differences of parse(r)s, we recommend aug-
menting evaluations with macro statistics, use
of additional metrics, and more human analy-
sis.
1 Introduction
Abstract Meaning Representation (AMR), pro-
posed by Banarescu et al. (2013), aims at capturing
the meaning of texts in an explicit graph format.
Nodes describe entities, events, and states, while
edges express key semantic relations, such as ARG
x
(indicating semantic roles as in PropBank (Palmer
et al.,2005)), or instrument and cause.
Albeit the development of parsers can be
driven by multiple desiderata, better performance
on benchmarks often serves as main criterion.
For AMR, this goal is typically measured using
P1
P2
P3
P4
r = 1-IAA
Figure 1: Sketch of AMR IAA ball. The center (P1)
is a reference AMR, while P2, P3, P4 are candidates.
Any AMR xfrom the ball has high structural SMATCH
agreement with P1, i.e., SMATCH(x, P 1) estimated
human IAA. However, they may fall in different cate-
gories: H(green cloud) contains correct AMR alterna-
tives. Its superset A(light cloud) contains acceptable
AMRs that may misrepresent the sentence meaning up
to a minor degree. Other parses from the ball, e.g., P2,
mis-represent the sentence’s meaning – despite possi-
bly having higher SMATCH agreement with the refer-
ence than all other candidates.
SMATCH (Cai and Knight,2013) against a refer-
ence corpus. The metric measures to what extent
the reference has been reconstructed by the parser.
However, thanks to astonishing recent advances
in AMR parsing, mainly powered by the language
modeling and fine-tuning paradigm (Bevilacqua
et al.,2021), parsers now achieve benchmark scores
that surpass IAA estimates.
1
Therefore, it is diffi-
cult to assess whether (fine) differences in SMATCH
scores i) can be attributed to minor but valid diver-
gences in interpretation or AMR structure, as they
may also occur in human assessments, or ii) if they
constitute significant meaning distorting errors.
This fundamental issue is outlined in Figure 1.
1
Banarescu et al. (2013) find that an (optimistic) aver-
age annotator vs. consensus IAA (SMATCH) was 0.83 for
newswire and 0.79 for web text. When newly trained annota-
tors doubly annotated web text sentences, their annotator vs.
annotator IAA was 0.71. Recent BART and T5 based models
range between 0.82 and 0.84 SMATCH F1 scores.
arXiv:2210.06461v1 [cs.CL] 12 Oct 2022
Four parses are located in the ball
B(P1,SMATCH)
of estimated IAA, (gold) parse P1 being the center.
However, the true set of possible human candidates
H
is very likely much smaller than the ball and its
shape is unknown.
2
Besides, a superset of
H
is a set
of acceptable parses
A
, i.e., parses that may have a
small flaw which does not significantly distort the
sentence meaning. Now, it can indeed happen that
parse P2, as opposed to P3, has a lower distance
to reference P1, i.e., to the center of
B(SMATCH)
– but is not found in
A⊇H
, which marks it as
an inaccurate candidate. On the other hand, P4 is
contained in
A
, but not in
H
, which would make it
acceptable, but less preferable than P3.
Research questions
Triggered by these consid-
erations, this paper tackles the key questions: Do
high-performance AMR parsers indeed deliver
accurate semantic graphs, as suggested by high
benchmark scores that surpass human IAA esti-
mates?Does a higher SMATCH against a single
reference necessarily indicate better overall parse
quality? And what steps can we take to mitigate po-
tential issues when assessing the true performance
of high-performance parsers?
Paper structure
After discussing background
and related work (Section 2), we describe our data
setup and give a survey of AMR metrics (Section
3). We then evaluate the metrics with regard to scor-
ing i) corpora (Section 5), ii) AMR pairs (Section
6) and iii) cross-metric differences in their ranking
behavior (Section 7). We conclude by discussing
limitations of our study (Section 8), give recom-
mendations and outline future work (Section 9).3
2 Background and related work
AMR parsing and applications
Over the years,
we have observed a great diversity in approaches
to AMR parsing, ranging from graph prediction
with a pipeline (Flanigan et al.,2014), or a neural
network (Lyu and Titov,2018;Cai and Lam,2020)
to transition-based parsing (Wang et al.,2015) and
sequence-to-sequence parsing, e.g., by exploiting
large parallel corpora (Xu et al.,2020). A re-
cent trend is to exploit the knowledge in large
pre-trained sequence-to-sequence language models
such as T5 (Raffel et al.,2019) or BART (Lewis
2
Under the unrealistic assumptions of an omniscient anno-
tator and AMR being the ideal way of meaning representation,
one might require that Halways has exactly one element.
3
Code and data for our study are available at
https:
//github.com/Heidelberg-nlp/AMRParseEval.
et al.,2020), by fine-tuning them on AMR corpora,
as show-cased, e.g., by Bevilacqua et al. (2021).
Such models are on par or tend to surpass esti-
mates for human AMR agreement (Banarescu et al.,
2013), when measured in SMATCH points.
AMR, by virtue of its properties as a graph-based
abstract meaning representation, is attractive for
many key NLP tasks, such as machine translation
(Song et al.,2019), summarization (Dohare et al.,
2017;Liao et al.,2018), NLG evaluation (Opitz
and Frank,2021;Manning and Schneider,2021;
Ribeiro et al.,2022) and measuring semantic sen-
tence similarity (Opitz and Frank,2022).
Metric evaluation for MT evaluation
Metric
evaluation for machine translation (MT) has re-
ceived much attention over the recent years (Ma
et al.,2019;Mathur et al.,2020;Freitag et al.,
2021). When evaluating metrics for MT evalua-
tion, it seems generally agreed upon that the main
goal of a MT metric is high correlation to human
ratings, mainly with respect to rating adequacy of a
candidate against one (or a set of) gold reference(s).
A recent shared task (Freitag et al.,2021) meta-
evaluates popular metrics such as BLEU (Papineni
et al.,2002) or BLEURT (Sellam et al.,2020), by
comparing the metrics’ scores to human scores for
systems and individual segments. They find that
the performance of each metric varies depending
on the underlying domain (e.g., TED talks or news),
and that most metrics struggle to penalize transla-
tions with errors in reversing negation or sentiment
polarity, and show lower correlations for semantic
phenomena including subordination, named enti-
ties and terminology. This indicates that there is
potential for cross-pollination: clearly, AMR met-
ric evaluation may profit from the vast amount of
experience of metric evaluation for other tasks. On
the other hand, MT evaluation may profit from
relating semantic representations, to better differ-
entiate semantic errors with respect to their type
and severity. A first step in this direction may have
been made by Zeidler et al. (2022), who assess
the behaviour of MT metrics, AMR metrics, and
hybrid metrics when analyzing sentence pairs that
differ in only one linguistic phenomenon.
3 Study Setup: Data creation and AMR
metric overview
In this Section, first we select data and two popular
high-performance parsers for creating candidate
AMRs. Then we describe the human quality an-
-----------------Reference AMR and Sentence------------------
(l / look-over-06 ‘‘Looking over to the flag’’
:ARG1 (f / flag))
---------------------Candidate parses------------------------
(l / look-01 (z0 / look-01
:direction (o / over) :ARG2 (z1 / flag)
:destination (f / flag)) :direction (z2 / over))
---------------------------Eval------------------------------
Smatch (ref, cand): both score 0.2 (indicates low quality)
Human (sent, cand): both are acceptable
Human (cand, cand): no preference
-------------------------------------------------------------
Figure 2: Data example: acceptable, low SMATCH.
That is, P∈ H but P /B(SMATCH, ref).
notation, and give an overview of automatic AMR
metrics that we consider in our subsequent studies.
Parsers and corpora
We choose the AMR3
benchmark
4
and the literary texts from the freely
available Little Prince corpus.
5
As parsers we
choose T5- and BART-based systems, both on par
with human IAA estimates, where BART achieves
higher scores on AMR3.
6
We proceed as follows:
we 1. parse the corpora with T5 and BART parsers
and use SMATCH to select diverging parse candi-
date pairs, and 2. sample 200 of those pairs, both
for AMR3, and for Little Prince (i.e., 800 AMR
candidates in total).
3.1 Annotation dimensions
Annotation dimension I: pairwise ranking
The annotator is presented the sentence and two
candidate graphs, assigning one of three labels and
a free-text rationale. The labels are either +1 (prefer
first graph),
1
(prefer second graph), or 0 (both
are of same or very similar quality).
Annotation dimension II: parse acceptability
In addition, each graph is independently assigned a
single label, considering only the sentence that it is
supposed to represent. Here, the annotator makes
a binary decision: +1, if the parse is acceptable,
or 0, if the graph is not acceptable. A graph that
is acceptable is fully valid, or may allow a very
minor meaning deviation from the sentence, or a
slightly weird but allowed interpretation that may
differ from a normative interpretation. All other
graphs are deemed not acceptable (0).
Example: Acceptable candidates, low SMATCH
Figure 2shows an example of two graphs that
have very low structural overlap with the refer-
ence (SMATCH = 0.2), but are acceptable. Here,
4LDC corpus LDC2020T02
5From https://amr.isi.edu/download.html
6
See
https://github.com/bjascob/
amrlib-models for more benchmarking statistics.
----------------Reference AMR (excerpt)--------------------
(i2 / imagine-01
:ARG0 (y / you)
:ARG1 (a / amaze-01
:ARG1 (i / i)))
:time-of (w / wake-01
:ARG0 (v / voice
:mod (o / odd)
:mod (l / little))
:ARG1 i))))))
----------------Candidate parse (excerpt)--------------------
(ii / imagine-01
:ARG0 (y / you)
:ARG1 (a / amaze-01
:ARG0 (v / voice
:mod (l / little)
:mod (o / odd))
:ARG1 (ii2 / i)))
Means: (..) imagine my amazement (..) by an odd little voice
Should mean: (..) imagine my amazement (..) when I was
awakened by an odd little voice
---------------------------Eval------------------------------
Smatch (ref, cand): scores 0.88 (indicates high quality)
Human (sent, cand): not acceptable
-------------------------------------------------------------
Figure 3: Data example excerpt that shows an unaccapt-
able parse with high SMATCH. That is, P6∈ A ⊇ H
but PB(SMATCH, ref)
the candidate graphs both differ from the reference
because they tend to a more conservative interpre-
tation, using the more general look-01 predicate
instead of the look-over-06 predicate in the human
reference. In fact, the meaning of the reference can
be considered, albeit valid, slightly weird, since
look-over-06 is defined in PropBank as examining
something idly, which is a more ‘specific’ inter-
pretation of the sentence in question. On the other
hand, the candidate graphs differ from each other in
the semantic role assigned to flag. In the first, flag
is the destination of the looking action (which can
be accepted), while in the second, we find a more
questionable but still acceptable interpretation that
flag is an attribute of the thing that is looked at.
Example: Candidate not acceptable, high
SMATCH
An inverse example (high SMATCH,
unacceptable) is shown in Figure 3, where the parse
omits awaken. Albeit the factuality of the sentence
is not (much) changed, and the structural deviation
may legitimately imply that the odd voice is the
cause of amazement, it misses a relevant piece of
meaning and is therefore rated unacceptable.
Label statistics
will be discussed in Section 5,
where the human annotations are also contrasted
against parser rankings of automatic metrics.
3.2 Metric overview
We distinguish metrics targeting monolingual AMR
parsing evaluation from multi-purpose AMR met-
rics. AMR metrics that are designed for evaluation
of monolingual parsers typically have two features
摘要:

BetterSmatch=BetterParser?AMRevaluationisnotsosimpleanymoreJuriOpitzDept.ofComputationalLinguisticsHeidelbergUniversity69120Heidelbergopitz.sci@gmail.comAnetteFrankDept.ofComputationalLinguisticsHeidelbergUniversity69120Heidelbergfrank@cl.uni-heidelberg.deAbstractRecently,astonishingadvanceshavebeen...

展开>> 收起<<
Better Smatch Better Parser AMR evaluation is not so simple anymore Juri Opitz Dept. of Computational Linguistics.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:441.92KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注