PriMeSRL-Eval A Practical Quality Metric for Semantic Role Labeling Systems Evaluation Ishan Jindala Alexandre Rademakera Khoi-Nguyen TranaHuaiyu Zhua

2025-04-26 0 0 275.04KB 13 页 10玖币
侵权投诉
PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling
Systems Evaluation
Ishan Jindala, Alexandre Rademakera, Khoi-Nguyen Trana,Huaiyu Zhua,
Hiroshi Kanayamaa, Marina Danilevskya, Yunyao Lib
aIBM Research, bApple
ishan.jindal@ibm.com, alexrad@br.ibm.com, kndtran@ibm.com,
huaiyu@us.ibm.com, hkana@jp.ibm.com, mdanile@us.ibm.com,
yunyaoli@apple.com
Abstract
Semantic role labeling (SRL) identifies the
predicate-argument structure in a sentence.
This task is usually accomplished in four steps:
predicate identification, predicate sense dis-
ambiguation, argument identification, and ar-
gument classification. Errors introduced at
one step propagate to later steps. Unfor-
tunately, the existing SRL evaluation scripts
do not consider the full effect of this er-
ror propagation aspect. They either evalu-
ate arguments independent of predicate sense
(CoNLL09) or do not evaluate predicate sense
at all (CoNLL05), yielding an inaccurate SRL
model performance on the argument classifi-
cation task. In this paper we address key
practical issues with existing evaluation scripts
and propose a more strict SRL evaluation met-
ric PriMeSRL. We observe that by employing
PriMeSRL, the quality evaluation of all SoTA
SRL models drop significantly, and their rela-
tive rankings also change. We also show that
PriMeSRLsuccessfully penalizes actual fail-
ures in SoTA SRL models.1
1 Introduction
Semantic Role Labeling (SRL) extracts predicate-
argument structures from a sentence, where pred-
icates represent relations (verbs, adjectives or
nouns) and arguments are the spans attached to
the predicate demonstrating “who did what to
whom, when, where and how. As one of the
fundamental natural language processing (NLP)
tasks, SRL has been shown to help a wide range
of NLP downstream applications such as natural
language inference (Zhang et al.,2020b;Liu et al.,
2022), question answering (Maqsud et al.,2014;
Yih et al.,2016;Zhang et al.,2020b;Dryja´
nski
et al.,2022), machine translation (Shi et al.,2016;
Rapp,2022), content moderation and verification
bWork done while at IBM Research
1https://github.com/
UniversalPropositions/PriMeSRL-Eval
(Calvo Figueras et al.,2022;Fharook et al.,2022),
information extraction (Niklaus et al.,2018;Zhang
et al.,2020a). In all of these applications, the qual-
ity of the underlying SRL models has a significant
impact on the downstream tasks. Despite this, little
study exists on how to properly evaluate the quality
of practical SRL systems.
Given a sentence, a typical SRL system obtains
predicate-argument structure by following a series
of four steps: 1) predicate identification; 2) predi-
cate sense disambiguation; 3) argument identifica-
tion; and 4) argument classification. The predicate
senses and its argument labels are taken from inven-
tories of frame definitions such as Proposition Bank
(PropBank) (Palmer et al.,2005), FrameNet (Baker
et al.,1998), and VerbNet (Schuler,2005).
The correctness of SRL extraction is affected
by the correctness of these steps. Consider the
example in Figure 1using PropBank
2
annotations:
[Derick]broke the [window]with a [hammer]to [escape].
A0 break.01 A1 A2 AM-PRR
nsubj det
obj mark
det
obl
mark
obl
ROOT
Figure 1: An SRL example with head-based semantic
roles on top of Universal Dependencies annotation.
The SRL system must:
1. Identify the verb ‘break’ as a predicate
2.
Disambiguate its particular sense as
‘break.01’,
3
which has four associated
arguments:
A0
(the breaker),
A1
(thing
broken),
A2
(the instrument),
A3
(the number
of pieces), and
A4
(from what
A1
is broken
away).4
2
In this paper we discuss SRL based on PropBank frames.
3https://verbs.colorado.edu/propbank/
framesets-english-aliases/break.html
4
Note that in PropBank each verb sense has a specific set
arXiv:2210.06408v1 [cs.CL] 12 Oct 2022
3.
Identify each argument as it occurs (‘Derick’,
‘the window’, etc.)
4. Classify the arguments (‘Derick’ : A0)
Finally, this example has one additional modifier:
the
AM-PRP
(the purpose). Figure 1illustrate the
same analysis on top of the universal dependencies
annotations where only the tokens head of phrases
are annotated with the proper argument.
To obtain a completely correct predicate-
argument structure both the predicate sense and
all of its associated arguments need to be correctly
extracted. Mistakes introduced at one step may
propagate to later steps, leading to further errors.
For instance, in the above example a wrong
predicate sense ‘break.02’ (break in or gain entry)
has not only a different meaning from ‘break.01’
(break), but also a different set of arguments. In
many cases, even if an argument for a wrong predi-
cate sense is labeled with the same numerical roles
(
A1
,
A2
, etc), their meanings can be very different.
Therefore, in general, the labels for argument roles
should be considered as incorrect when the pred-
icate sense itself is incorrect. However, existing
SRL evaluation metrics (e.g. (Hajic et al.,2009))
do not penalize argument labels in such cases.
The currently used evaluation metrics also do
not evaluate discontinuous arguments accurately.
Some arguments in the PropBank original corpora
have discontinuous spans that all refer to the same
argument. This can happen for a number of rea-
sons such as in verb-particle constructions. In a
dependency-based analysis, these arguments ends
up being attached to distinct syntactic heads (Sur-
deanu et al.,2008a). Take as an example the sen-
tence, “I know your answer will be that those peo-
ple should be allowed to live where they please as
long as they pay their full locational costs. For
the predicate “allow.01,” the
A1
(action allowed)
is the discontinuous span “those people” (A1) and
“to live where they please as long as they pay their
full locational costs” (
C-A1
). Existing evaluation
metrics treat these as two independent labels.
A similar problem exists for the evaluation of ref-
erence arguments (
R-X
). For example, in the sen-
tence “This is exactly a road that leads nowhere”,
for the predicate “lead.01”, the
A0
“road” is ref-
erenced by
C-A0
“that”. If
A0
is not correctly
of underspecified roles, given by numbers:
A0
,
A1
,
A2
, and
so on. This is because of the well-known difficulty of defining
a universal set of thematic roles (Jurafsky and Martin,2021).
identified, the reference
C-A0
would be meaning-
less.
In this paper, we conduct a systematic analysis
of the pros and cons of different evaluation metrics
for SRL, including:
Proper evaluation of predicate sense disam-
biguation task;
Argument label evaluation in conjunction with
predicate sense;
Proper evaluation for discontinuous argu-
ments and reference arguments; and
Unified evaluation of argument head and span.
We then propose a new metric for evaluating SRL
systems in a more accurate and intuitive manner
in Section 3, and compare it with currently used
methods in Section 4.
2 Existing Evaluation Metrics for SRL
Most of the existing evaluation metrics came from
shared tasks for the development of systems capa-
ble of extracting predicates and arguments from
natural language sentences. In this section, we
summarize the approaches to SRL evaluation in the
shared tasks from SemEval and CoNLL
2.1 Senseval and SemEval
SemEval (Semantic Evaluation) is a series of evalu-
ations of computational semantic analysis systems
that evolved from the Senseval (word sense evalua-
tion) series.
SENSEVAL-3
(Litkowski,2004) addressed the
task of automatic labeling of semantic roles and
was designed to encourage research into and use
of the FrameNet dataset. The system would re-
ceive as input a target word and its frame, and was
required to identify and label the frame elements
(arguments). The evaluation metric counted the
number of arguments correctly identified (complete
match of span) and labeled, but did not penalize
those spuriously identified. An overlap score was
generated as the average of proportion of partial
matches.
SemEval-2007
contained three tasks that eval-
uate SRL. Task 17 and 18 identified arguments
for given predicates using two different role la-
bel sets: PropBank and VerbNet (Pradhan et al.,
2007). They used the srl-eval.pl script from
the CoNLL-2005 scoring package (Carreras and
Màrquez,2005a) (see below). Task 19 consists
of recognizing words and phrases that evoke se-
mantic frames from FrameNet and their semantic
dependents, which are usually, but not always, their
syntactic dependents. The evaluation measured pre-
cision and recall for frames and frame elements,
with partial credit for incorrect but closely related
frames. Two types of evaluation were carried out.
The first is the label matching evaluation. The par-
ticipant’s labeled data were compared directly with
the gold standard labeled using the same evalua-
tion procedure used in the previous SRL tasks at
SemEval. The second is the semantic dependency
evaluation, in which both the gold standard and
the submitted data were first converted to semantic
dependency graphs and compared.
SemEval-2012
(Kordjamshidi et al.,2012) and
SemEval-2013
(Kolomiyets et al.,2013) intro-
duced the ‘Spatial Role Labeling’ task, but this is
somewhat different from the standard SRL task and
will not be discussed in this paper. Since
SemEval-
2014
(Marelli et al.,2014), a deeper semantic rep-
resentation of sentences in a single graph-based
structure via semantic parsing has superseded the
previous ‘shallow’ SRL tasks.
2.2 CoNLL
The
CoNLL-2004
shared task (Carreras and
Màrquez,2004) was based on the PropBank cor-
pus, comprising six sections of the Wall Street
Journal part of the Penn Treebank (Kingsbury and
Palmer,2002) enriched with predicate–argument
structures. The task was to identify and label the
arguments of each marked verb. The precision,
recall and F1 of arguments were evaluated using
the
srl-eval.pl
program. For an argument to
be correctly recognized, the words spanning the
argument as well as its semantic role have to be
correct. The verb argument is the lexicalization
of the predicate of the proposition. Most of the
time, the verb corresponds to the target verb of the
proposition, which is provided as input, and only
in few cases the verb participant spans more words
than the target verb. This situation makes the verb
easy to identify and, since there is one verb with
each proposition, evaluating its recognition overes-
timates the overall performance of a system. For
this reason, the verb argument is excluded from
evaluation. The shared task proceedings does not
details how non-continuous arguments are eval-
uated. In
CoNLL-2005
(Carreras and Màrquez,
2005b) a system had to recognize and label the ar-
guments of each target verb. The evaluation method
remained the same as CoNLL-2004, using the same
evaluation code.
The
CoNLL 2008
shared task (Surdeanu et al.,
2008b) was dedicated to the joint parsing of syntac-
tic and semantic dependencies. The shared task was
divided into three subtasks: (i) parsing of syntactic
dependencies, (ii) identification and disambigua-
tion of semantic predicates, and (iii) identification
of arguments and assignment of semantic roles for
each predicate. SRL was performed and evaluated
using a dependency-based representation for both
syntactic and semantic dependencies.
The official evaluation measures consist of three
different scores: (i) syntactic dependencies are
scored using the labeled attachment score (LAS),
(ii) semantic dependencies are evaluated using a
labeled F1 score, and (iii) the overall task is scored
with a macro average of the two previous scores.
The semantic propositions are evaluated by convert-
ing them to semantic dependencies, i.e., a semantic
dependency from every predicate to all its individ-
ual arguments were created. These dependencies
are labeled with the labels of the corresponding
arguments. Additionally, a semantic dependency
from each predicate to a virtual ROOT node was
created. The latter dependencies are labeled with
the predicate senses. This approach guarantees that
the semantic dependency structure conceptually
forms a single-rooted, connected (not necessarily
acyclic) graph. More importantly, this scoring strat-
egy implies that if a system assigns the incorrect
predicate sense, it still receives some points for the
arguments correctly assigned. Several additional
evaluation measures were applied to further ana-
lyze the performance of the participating systems.
The Exact Match reports the percentage of sen-
tences that are completely correct, i.e., all the gen-
erated syntactic dependencies are correct and all
the semantic propositions are present and correct.
The Perfect Proposition F1 score entire semantic
frames or propositions. The ratio between labeled
F1 score for semantic dependencies and the LAS
for syntactic dependencies.
As in CoNLL-2008, the CoNLL-2009 shared
task (Hajiˇ
c et al.,2009) combined syntactic de-
pendency parsing and the task of identifying and
labeling semantic arguments of verbs or nouns for
six more languages in addition to the original En-
glish from CoNLL-2008. Predicate disambiguation
摘要:

PriMeSRL-Eval:APracticalQualityMetricforSemanticRoleLabelingSystemsEvaluationIshanJindala,AlexandreRademakera,Khoi-NguyenTrana,HuaiyuZhua,HiroshiKanayamaa,MarinaDanilevskya,YunyaoLibaIBMResearch,bAppleishan.jindal@ibm.com,alexrad@br.ibm.com,kndtran@ibm.com,huaiyu@us.ibm.com,hkana@jp.ibm.com,mdanile...

展开>> 收起<<
PriMeSRL-Eval A Practical Quality Metric for Semantic Role Labeling Systems Evaluation Ishan Jindala Alexandre Rademakera Khoi-Nguyen TranaHuaiyu Zhua.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:275.04KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注