PriMeSRL-Eval A Practical Quality Metric for Semantic Role Labeling Systems Evaluation Ishan Jindala Alexandre Rademakera Khoi-Nguyen TranaHuaiyu Zhua

2025-04-26 0 0 275.04KB 13 页 10玖币

侵权投诉

PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling

Systems Evaluation

Ishan Jindala, Alexandre Rademakera, Khoi-Nguyen Trana,Huaiyu Zhua,

Hiroshi Kanayamaa, Marina Danilevskya, Yunyao Lib∗

aIBM Research, bApple

ishan.jindal@ibm.com, alexrad@br.ibm.com, kndtran@ibm.com,

huaiyu@us.ibm.com, hkana@jp.ibm.com, mdanile@us.ibm.com,

yunyaoli@apple.com

Abstract

Semantic role labeling (SRL) identiﬁes the

predicate-argument structure in a sentence.

This task is usually accomplished in four steps:

predicate identiﬁcation, predicate sense dis-

ambiguation, argument identiﬁcation, and ar-

gument classiﬁcation. Errors introduced at

one step propagate to later steps. Unfor-

tunately, the existing SRL evaluation scripts

do not consider the full effect of this er-

ror propagation aspect. They either evalu-

ate arguments independent of predicate sense

(CoNLL09) or do not evaluate predicate sense

at all (CoNLL05), yielding an inaccurate SRL

model performance on the argument classiﬁ-

cation task. In this paper we address key

practical issues with existing evaluation scripts

and propose a more strict SRL evaluation met-

ric PriMeSRL. We observe that by employing

PriMeSRL, the quality evaluation of all SoTA

SRL models drop signiﬁcantly, and their rela-

tive rankings also change. We also show that

PriMeSRLsuccessfully penalizes actual fail-

ures in SoTA SRL models.1

1 Introduction

Semantic Role Labeling (SRL) extracts predicate-

argument structures from a sentence, where pred-

icates represent relations (verbs, adjectives or

nouns) and arguments are the spans attached to

the predicate demonstrating “who did what to

whom, when, where and how.” As one of the

fundamental natural language processing (NLP)

tasks, SRL has been shown to help a wide range

of NLP downstream applications such as natural

language inference (Zhang et al.,2020b;Liu et al.,

2022), question answering (Maqsud et al.,2014;

Yih et al.,2016;Zhang et al.,2020b;Dryja´

nski

et al.,2022), machine translation (Shi et al.,2016;

Rapp,2022), content moderation and veriﬁcation

∗bWork done while at IBM Research

1https://github.com/

UniversalPropositions/PriMeSRL-Eval

(Calvo Figueras et al.,2022;Fharook et al.,2022),

information extraction (Niklaus et al.,2018;Zhang

et al.,2020a). In all of these applications, the qual-

ity of the underlying SRL models has a signiﬁcant

impact on the downstream tasks. Despite this, little

study exists on how to properly evaluate the quality

of practical SRL systems.

Given a sentence, a typical SRL system obtains

predicate-argument structure by following a series

of four steps: 1) predicate identiﬁcation; 2) predi-

cate sense disambiguation; 3) argument identiﬁca-

tion; and 4) argument classiﬁcation. The predicate

senses and its argument labels are taken from inven-

tories of frame deﬁnitions such as Proposition Bank

(PropBank) (Palmer et al.,2005), FrameNet (Baker

et al.,1998), and VerbNet (Schuler,2005).

The correctness of SRL extraction is affected

by the correctness of these steps. Consider the

example in Figure 1using PropBank

annotations:

[Derick]broke the [window]with a [hammer]to [escape].

A0 break.01 A1 A2 AM-PRR

nsubj det

obj mark

det

obl

mark

obl

ROOT

Figure 1: An SRL example with head-based semantic

roles on top of Universal Dependencies annotation.

The SRL system must:

1. Identify the verb ‘break’ as a predicate

Disambiguate its particular sense as

‘break.01’,

which has four associated

arguments:

(the breaker),

(thing

broken),

(the instrument),

(the number

of pieces), and

(from what

is broken

away).4

In this paper we discuss SRL based on PropBank frames.

3https://verbs.colorado.edu/propbank/

framesets-english-aliases/break.html

Note that in PropBank each verb sense has a speciﬁc set

arXiv:2210.06408v1 [cs.CL] 12 Oct 2022

Identify each argument as it occurs (‘Derick’,

‘the window’, etc.)

4. Classify the arguments (‘Derick’ : A0)

Finally, this example has one additional modiﬁer:

the

AM-PRP

(the purpose). Figure 1illustrate the

same analysis on top of the universal dependencies

annotations where only the tokens head of phrases

are annotated with the proper argument.

To obtain a completely correct predicate-

argument structure both the predicate sense and

all of its associated arguments need to be correctly

extracted. Mistakes introduced at one step may

propagate to later steps, leading to further errors.

For instance, in the above example a wrong

predicate sense ‘break.02’ (break in or gain entry)

has not only a different meaning from ‘break.01’

(break), but also a different set of arguments. In

many cases, even if an argument for a wrong predi-

cate sense is labeled with the same numerical roles

(

, etc), their meanings can be very different.

Therefore, in general, the labels for argument roles

should be considered as incorrect when the pred-

icate sense itself is incorrect. However, existing

SRL evaluation metrics (e.g. (Hajic et al.,2009))

do not penalize argument labels in such cases.

The currently used evaluation metrics also do

not evaluate discontinuous arguments accurately.

Some arguments in the PropBank original corpora

have discontinuous spans that all refer to the same

argument. This can happen for a number of rea-

sons such as in verb-particle constructions. In a

dependency-based analysis, these arguments ends

up being attached to distinct syntactic heads (Sur-

deanu et al.,2008a). Take as an example the sen-

tence, “I know your answer will be that those peo-

ple should be allowed to live where they please as

long as they pay their full locational costs.” For

the predicate “allow.01,” the

(action allowed)

is the discontinuous span “those people” (A1) and

“to live where they please as long as they pay their

full locational costs” (

C-A1

). Existing evaluation

metrics treat these as two independent labels.

A similar problem exists for the evaluation of ref-

erence arguments (

R-X

). For example, in the sen-

tence “This is exactly a road that leads nowhere”,

for the predicate “lead.01”, the

“road” is ref-

erenced by

C-A0

“that”. If

is not correctly

of underspeciﬁed roles, given by numbers:

, and

so on. This is because of the well-known difﬁculty of deﬁning

a universal set of thematic roles (Jurafsky and Martin,2021).

identiﬁed, the reference

C-A0

would be meaning-

less.

In this paper, we conduct a systematic analysis

of the pros and cons of different evaluation metrics

for SRL, including:

•

Proper evaluation of predicate sense disam-

biguation task;

•

Argument label evaluation in conjunction with

predicate sense;

•

Proper evaluation for discontinuous argu-

ments and reference arguments; and

•

Uniﬁed evaluation of argument head and span.

We then propose a new metric for evaluating SRL

systems in a more accurate and intuitive manner

in Section 3, and compare it with currently used

methods in Section 4.

2 Existing Evaluation Metrics for SRL

Most of the existing evaluation metrics came from

shared tasks for the development of systems capa-

ble of extracting predicates and arguments from

natural language sentences. In this section, we

summarize the approaches to SRL evaluation in the

shared tasks from SemEval and CoNLL

2.1 Senseval and SemEval

SemEval (Semantic Evaluation) is a series of evalu-

ations of computational semantic analysis systems

that evolved from the Senseval (word sense evalua-

tion) series.

SENSEVAL-3

(Litkowski,2004) addressed the

task of automatic labeling of semantic roles and

was designed to encourage research into and use

of the FrameNet dataset. The system would re-

ceive as input a target word and its frame, and was

required to identify and label the frame elements

(arguments). The evaluation metric counted the

number of arguments correctly identiﬁed (complete

match of span) and labeled, but did not penalize

those spuriously identiﬁed. An overlap score was

generated as the average of proportion of partial

matches.

SemEval-2007

contained three tasks that eval-

uate SRL. Task 17 and 18 identiﬁed arguments

for given predicates using two different role la-

bel sets: PropBank and VerbNet (Pradhan et al.,

2007). They used the srl-eval.pl script from

the CoNLL-2005 scoring package (Carreras and

Màrquez,2005a) (see below). Task 19 consists

of recognizing words and phrases that evoke se-

mantic frames from FrameNet and their semantic

dependents, which are usually, but not always, their

syntactic dependents. The evaluation measured pre-

cision and recall for frames and frame elements,

with partial credit for incorrect but closely related

frames. Two types of evaluation were carried out.

The ﬁrst is the label matching evaluation. The par-

ticipant’s labeled data were compared directly with

the gold standard labeled using the same evalua-

tion procedure used in the previous SRL tasks at

SemEval. The second is the semantic dependency

evaluation, in which both the gold standard and

the submitted data were ﬁrst converted to semantic

dependency graphs and compared.

SemEval-2012

(Kordjamshidi et al.,2012) and

SemEval-2013

(Kolomiyets et al.,2013) intro-

duced the ‘Spatial Role Labeling’ task, but this is

somewhat different from the standard SRL task and

will not be discussed in this paper. Since

SemEval-

2014

(Marelli et al.,2014), a deeper semantic rep-

resentation of sentences in a single graph-based

structure via semantic parsing has superseded the

previous ‘shallow’ SRL tasks.

2.2 CoNLL

The

CoNLL-2004

shared task (Carreras and

Màrquez,2004) was based on the PropBank cor-

pus, comprising six sections of the Wall Street

Journal part of the Penn Treebank (Kingsbury and

Palmer,2002) enriched with predicate–argument

structures. The task was to identify and label the

arguments of each marked verb. The precision,

recall and F1 of arguments were evaluated using

the

srl-eval.pl

program. For an argument to

be correctly recognized, the words spanning the

argument as well as its semantic role have to be

correct. The verb argument is the lexicalization

of the predicate of the proposition. Most of the

time, the verb corresponds to the target verb of the

proposition, which is provided as input, and only

in few cases the verb participant spans more words

than the target verb. This situation makes the verb

easy to identify and, since there is one verb with

each proposition, evaluating its recognition overes-

timates the overall performance of a system. For

this reason, the verb argument is excluded from

evaluation. The shared task proceedings does not

details how non-continuous arguments are eval-

uated. In

CoNLL-2005

(Carreras and Màrquez,

2005b) a system had to recognize and label the ar-

guments of each target verb. The evaluation method

remained the same as CoNLL-2004, using the same

evaluation code.

The

CoNLL 2008

shared task (Surdeanu et al.,

2008b) was dedicated to the joint parsing of syntac-

tic and semantic dependencies. The shared task was

divided into three subtasks: (i) parsing of syntactic

dependencies, (ii) identiﬁcation and disambigua-

tion of semantic predicates, and (iii) identiﬁcation

of arguments and assignment of semantic roles for

each predicate. SRL was performed and evaluated

using a dependency-based representation for both

syntactic and semantic dependencies.

The ofﬁcial evaluation measures consist of three

different scores: (i) syntactic dependencies are

scored using the labeled attachment score (LAS),

(ii) semantic dependencies are evaluated using a

labeled F1 score, and (iii) the overall task is scored

with a macro average of the two previous scores.

The semantic propositions are evaluated by convert-

ing them to semantic dependencies, i.e., a semantic

dependency from every predicate to all its individ-

ual arguments were created. These dependencies

are labeled with the labels of the corresponding

arguments. Additionally, a semantic dependency

from each predicate to a virtual ROOT node was

created. The latter dependencies are labeled with

the predicate senses. This approach guarantees that

the semantic dependency structure conceptually

forms a single-rooted, connected (not necessarily

acyclic) graph. More importantly, this scoring strat-

egy implies that if a system assigns the incorrect

predicate sense, it still receives some points for the

arguments correctly assigned. Several additional

evaluation measures were applied to further ana-

lyze the performance of the participating systems.

The Exact Match reports the percentage of sen-

tences that are completely correct, i.e., all the gen-

erated syntactic dependencies are correct and all

the semantic propositions are present and correct.

The Perfect Proposition F1 score entire semantic

frames or propositions. The ratio between labeled

F1 score for semantic dependencies and the LAS

for syntactic dependencies.

As in CoNLL-2008, the CoNLL-2009 shared

task (Hajiˇ

c et al.,2009) combined syntactic de-

pendency parsing and the task of identifying and

labeling semantic arguments of verbs or nouns for

six more languages in addition to the original En-

glish from CoNLL-2008. Predicate disambiguation

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PriMeSRL-Eval:APracticalQualityMetricforSemanticRoleLabelingSystemsEvaluationIshanJindala,AlexandreRademakera,Khoi-NguyenTrana,HuaiyuZhua,HiroshiKanayamaa,MarinaDanilevskya,YunyaoLibaIBMResearch,bAppleishan.jindal@ibm.com,alexrad@br.ibm.com,kndtran@ibm.com,huaiyu@us.ibm.com,hkana@jp.ibm.com,mdanile...

展开>> 收起<<

PriMeSRL-Eval A Practical Quality Metric for Semantic Role Labeling Systems Evaluation Ishan Jindala Alexandre Rademakera Khoi-Nguyen TranaHuaiyu Zhua.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PriMeSRL-Eval A Practical Quality Metric for Semantic Role Labeling Systems Evaluation Ishan Jindala Alexandre Rademakera Khoi-Nguyen TranaHuaiyu Zhua

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: