Better Smatch Better Parser AMR evaluation is not so simple anymore Juri Opitz Dept. of Computational Linguistics

2025-04-27 0 0 441.92KB 12 页 10玖币

侵权投诉

Better Smatch = Better Parser? AMR evaluation is not so simple anymore

Juri Opitz

Dept. of Computational Linguistics

Heidelberg University

69120 Heidelberg

opitz.sci@gmail.com

Anette Frank

Dept. of Computational Linguistics

Heidelberg University

69120 Heidelberg

frank@cl.uni-heidelberg.de

Abstract

Recently, astonishing advances have been ob-

served in AMR parsing, as measured by the

structural SMATCH metric. In fact, today’s

systems achieve performance levels that seem

to surpass estimates of human inter annotator

agreement (IAA). Therefore, it is unclear how

well SMATCH (still) relates to human estimates

of parse quality, as in this situation potentially

ﬁne-grained errors of similar weight may im-

pact the AMR’s meaning to different degrees.

We conduct an analysis of two popular and

strong AMR parsers that – according to

SMATCH– reach quality levels on par with hu-

man IAA, and assess how human quality rat-

ings relate to SMATCH and other AMR metrics.

Our main ﬁndings are: i) While high SMATCH

scores indicate otherwise, we ﬁnd that AMR

parsing is far from being solved: we fre-

quently ﬁnd structurally small, but semanti-

cally unacceptable errors that substantially dis-

tort sentence meaning. ii) Considering high-

performance parsers, better SMATCH scores

may not necessarily indicate consistently

better parsing quality. To obtain a meaning-

ful and comprehensive assessment of quality

differences of parse(r)s, we recommend aug-

menting evaluations with macro statistics, use

of additional metrics, and more human analy-

sis.

1 Introduction

Abstract Meaning Representation (AMR), pro-

posed by Banarescu et al. (2013), aims at capturing

the meaning of texts in an explicit graph format.

Nodes describe entities, events, and states, while

edges express key semantic relations, such as ARG

(indicating semantic roles as in PropBank (Palmer

et al.,2005)), or instrument and cause.

Albeit the development of parsers can be

driven by multiple desiderata, better performance

on benchmarks often serves as main criterion.

For AMR, this goal is typically measured using

r = 1-IAA

Figure 1: Sketch of AMR IAA ball. The center (P1)

is a reference AMR, while P2, P3, P4 are candidates.

Any AMR xfrom the ball has high structural SMATCH

agreement with P1, i.e., SMATCH(x, P 1) ≥estimated

human IAA. However, they may fall in different cate-

gories: H(green cloud) contains correct AMR alterna-

tives. Its superset A(light cloud) contains acceptable

AMRs that may misrepresent the sentence meaning up

to a minor degree. Other parses from the ball, e.g., P2,

mis-represent the sentence’s meaning – despite possi-

bly having higher SMATCH agreement with the refer-

ence than all other candidates.

SMATCH (Cai and Knight,2013) against a refer-

ence corpus. The metric measures to what extent

the reference has been reconstructed by the parser.

However, thanks to astonishing recent advances

in AMR parsing, mainly powered by the language

modeling and ﬁne-tuning paradigm (Bevilacqua

et al.,2021), parsers now achieve benchmark scores

that surpass IAA estimates.

Therefore, it is difﬁ-

cult to assess whether (ﬁne) differences in SMATCH

scores i) can be attributed to minor but valid diver-

gences in interpretation or AMR structure, as they

may also occur in human assessments, or ii) if they

constitute signiﬁcant meaning distorting errors.

This fundamental issue is outlined in Figure 1.

Banarescu et al. (2013) ﬁnd that an (optimistic) aver-

age annotator vs. consensus IAA (SMATCH) was 0.83 for

newswire and 0.79 for web text. When newly trained annota-

tors doubly annotated web text sentences, their annotator vs.

annotator IAA was 0.71. Recent BART and T5 based models

range between 0.82 and 0.84 SMATCH F1 scores.

arXiv:2210.06461v1 [cs.CL] 12 Oct 2022

Four parses are located in the ball

B(P1,SMATCH)

of estimated IAA, (gold) parse P1 being the center.

However, the true set of possible human candidates

is very likely much smaller than the ball and its

shape is unknown.

Besides, a superset of

is a set

of acceptable parses

, i.e., parses that may have a

small ﬂaw which does not signiﬁcantly distort the

sentence meaning. Now, it can indeed happen that

parse P2, as opposed to P3, has a lower distance

to reference P1, i.e., to the center of

B(SMATCH)

– but is not found in

A⊇H

, which marks it as

an inaccurate candidate. On the other hand, P4 is

contained in

, but not in

, which would make it

acceptable, but less preferable than P3.

Research questions

Triggered by these consid-

erations, this paper tackles the key questions: Do

high-performance AMR parsers indeed deliver

accurate semantic graphs, as suggested by high

benchmark scores that surpass human IAA esti-

mates?Does a higher SMATCH against a single

reference necessarily indicate better overall parse

quality? And what steps can we take to mitigate po-

tential issues when assessing the true performance

of high-performance parsers?

Paper structure

After discussing background

and related work (Section 2), we describe our data

setup and give a survey of AMR metrics (Section

3). We then evaluate the metrics with regard to scor-

ing i) corpora (Section 5), ii) AMR pairs (Section

6) and iii) cross-metric differences in their ranking

behavior (Section 7). We conclude by discussing

limitations of our study (Section 8), give recom-

mendations and outline future work (Section 9).3

2 Background and related work

AMR parsing and applications

Over the years,

we have observed a great diversity in approaches

to AMR parsing, ranging from graph prediction

with a pipeline (Flanigan et al.,2014), or a neural

network (Lyu and Titov,2018;Cai and Lam,2020)

to transition-based parsing (Wang et al.,2015) and

sequence-to-sequence parsing, e.g., by exploiting

large parallel corpora (Xu et al.,2020). A re-

cent trend is to exploit the knowledge in large

pre-trained sequence-to-sequence language models

such as T5 (Raffel et al.,2019) or BART (Lewis

Under the unrealistic assumptions of an omniscient anno-

tator and AMR being the ideal way of meaning representation,

one might require that Halways has exactly one element.

Code and data for our study are available at

https:

//github.com/Heidelberg-nlp/AMRParseEval.

et al.,2020), by ﬁne-tuning them on AMR corpora,

as show-cased, e.g., by Bevilacqua et al. (2021).

Such models are on par or tend to surpass esti-

mates for human AMR agreement (Banarescu et al.,

2013), when measured in SMATCH points.

AMR, by virtue of its properties as a graph-based

abstract meaning representation, is attractive for

many key NLP tasks, such as machine translation

(Song et al.,2019), summarization (Dohare et al.,

2017;Liao et al.,2018), NLG evaluation (Opitz

and Frank,2021;Manning and Schneider,2021;

Ribeiro et al.,2022) and measuring semantic sen-

tence similarity (Opitz and Frank,2022).

Metric evaluation for MT evaluation

Metric

evaluation for machine translation (MT) has re-

ceived much attention over the recent years (Ma

et al.,2019;Mathur et al.,2020;Freitag et al.,

2021). When evaluating metrics for MT evalua-

tion, it seems generally agreed upon that the main

goal of a MT metric is high correlation to human

ratings, mainly with respect to rating adequacy of a

candidate against one (or a set of) gold reference(s).

A recent shared task (Freitag et al.,2021) meta-

evaluates popular metrics such as BLEU (Papineni

et al.,2002) or BLEURT (Sellam et al.,2020), by

comparing the metrics’ scores to human scores for

systems and individual segments. They ﬁnd that

the performance of each metric varies depending

on the underlying domain (e.g., TED talks or news),

and that most metrics struggle to penalize transla-

tions with errors in reversing negation or sentiment

polarity, and show lower correlations for semantic

phenomena including subordination, named enti-

ties and terminology. This indicates that there is

potential for cross-pollination: clearly, AMR met-

ric evaluation may proﬁt from the vast amount of

experience of metric evaluation for other tasks. On

the other hand, MT evaluation may proﬁt from

relating semantic representations, to better differ-

entiate semantic errors with respect to their type

and severity. A ﬁrst step in this direction may have

been made by Zeidler et al. (2022), who assess

the behaviour of MT metrics, AMR metrics, and

hybrid metrics when analyzing sentence pairs that

differ in only one linguistic phenomenon.

3 Study Setup: Data creation and AMR

metric overview

In this Section, ﬁrst we select data and two popular

high-performance parsers for creating candidate

AMRs. Then we describe the human quality an-

-----------------Reference AMR and Sentence------------------

(l / look-over-06 ‘‘Looking over to the flag’’

:ARG1 (f / flag))

---------------------Candidate parses------------------------

(l / look-01 (z0 / look-01

:direction (o / over) :ARG2 (z1 / flag)

:destination (f / flag)) :direction (z2 / over))

---------------------------Eval------------------------------

Smatch (ref, cand): both score 0.2 (indicates low quality)

Human (sent, cand): both are acceptable

Human (cand, cand): no preference

-------------------------------------------------------------

Figure 2: Data example: acceptable, low SMATCH.

That is, P∈ H but P /∈B(SMATCH, ref).

notation, and give an overview of automatic AMR

metrics that we consider in our subsequent studies.

Parsers and corpora

We choose the AMR3

benchmark

and the literary texts from the freely

available Little Prince corpus.

As parsers we

choose T5- and BART-based systems, both on par

with human IAA estimates, where BART achieves

higher scores on AMR3.

We proceed as follows:

we 1. parse the corpora with T5 and BART parsers

and use SMATCH to select diverging parse candi-

date pairs, and 2. sample 200 of those pairs, both

for AMR3, and for Little Prince (i.e., 800 AMR

candidates in total).

3.1 Annotation dimensions

Annotation dimension I: pairwise ranking

The annotator is presented the sentence and two

candidate graphs, assigning one of three labels and

a free-text rationale. The labels are either +1 (prefer

ﬁrst graph),

−1

(prefer second graph), or 0 (both

are of same or very similar quality).

Annotation dimension II: parse acceptability

In addition, each graph is independently assigned a

single label, considering only the sentence that it is

supposed to represent. Here, the annotator makes

a binary decision: +1, if the parse is acceptable,

or 0, if the graph is not acceptable. A graph that

is acceptable is fully valid, or may allow a very

minor meaning deviation from the sentence, or a

slightly weird but allowed interpretation that may

differ from a normative interpretation. All other

graphs are deemed not acceptable (0).

Example: Acceptable candidates, low SMATCH

Figure 2shows an example of two graphs that

have very low structural overlap with the refer-

ence (SMATCH = 0.2), but are acceptable. Here,

4LDC corpus LDC2020T02

5From https://amr.isi.edu/download.html

See

https://github.com/bjascob/

amrlib-models for more benchmarking statistics.

----------------Reference AMR (excerpt)--------------------

(i2 / imagine-01

:ARG0 (y / you)

:ARG1 (a / amaze-01

:ARG1 (i / i)))

:time-of (w / wake-01

:ARG0 (v / voice

:mod (o / odd)

:mod (l / little))

:ARG1 i))))))

----------------Candidate parse (excerpt)--------------------

(ii / imagine-01

:ARG0 (y / you)

:ARG1 (a / amaze-01

:ARG0 (v / voice

:mod (l / little)

:mod (o / odd))

:ARG1 (ii2 / i)))

Means: (..) imagine my amazement (..) by an odd little voice

Should mean: (..) imagine my amazement (..) when I was

awakened by an odd little voice

---------------------------Eval------------------------------

Smatch (ref, cand): scores 0.88 (indicates high quality)

Human (sent, cand): not acceptable

-------------------------------------------------------------

Figure 3: Data example excerpt that shows an unaccapt-

able parse with high SMATCH. That is, P6∈ A ⊇ H

but P∈B(SMATCH, ref)

the candidate graphs both differ from the reference

because they tend to a more conservative interpre-

tation, using the more general look-01 predicate

instead of the look-over-06 predicate in the human

reference. In fact, the meaning of the reference can

be considered, albeit valid, slightly weird, since

look-over-06 is deﬁned in PropBank as examining

something idly, which is a more ‘speciﬁc’ inter-

pretation of the sentence in question. On the other

hand, the candidate graphs differ from each other in

the semantic role assigned to ﬂag. In the ﬁrst, ﬂag

is the destination of the looking action (which can

be accepted), while in the second, we ﬁnd a more

questionable but still acceptable interpretation that

ﬂag is an attribute of the thing that is looked at.

Example: Candidate not acceptable, high

SMATCH

An inverse example (high SMATCH,

unacceptable) is shown in Figure 3, where the parse

omits awaken. Albeit the factuality of the sentence

is not (much) changed, and the structural deviation

may legitimately imply that the odd voice is the

cause of amazement, it misses a relevant piece of

meaning and is therefore rated unacceptable.

Label statistics

will be discussed in Section 5,

where the human annotations are also contrasted

against parser rankings of automatic metrics.

3.2 Metric overview

We distinguish metrics targeting monolingual AMR

parsing evaluation from multi-purpose AMR met-

rics. AMR metrics that are designed for evaluation

of monolingual parsers typically have two features

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BetterSmatch=BetterParser?AMRevaluationisnotsosimpleanymoreJuriOpitzDept.ofComputationalLinguisticsHeidelbergUniversity69120Heidelbergopitz.sci@gmail.comAnetteFrankDept.ofComputationalLinguisticsHeidelbergUniversity69120Heidelbergfrank@cl.uni-heidelberg.deAbstractRecently,astonishingadvanceshavebeen...

展开>> 收起<<

Better Smatch Better Parser AMR evaluation is not so simple anymore Juri Opitz Dept. of Computational Linguistics.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Better Smatch Better Parser AMR evaluation is not so simple anymore Juri Opitz Dept. of Computational Linguistics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: