Màrquez,2005a) (see below). Task 19 consists
of recognizing words and phrases that evoke se-
mantic frames from FrameNet and their semantic
dependents, which are usually, but not always, their
syntactic dependents. The evaluation measured pre-
cision and recall for frames and frame elements,
with partial credit for incorrect but closely related
frames. Two types of evaluation were carried out.
The first is the label matching evaluation. The par-
ticipant’s labeled data were compared directly with
the gold standard labeled using the same evalua-
tion procedure used in the previous SRL tasks at
SemEval. The second is the semantic dependency
evaluation, in which both the gold standard and
the submitted data were first converted to semantic
dependency graphs and compared.
SemEval-2012
(Kordjamshidi et al.,2012) and
SemEval-2013
(Kolomiyets et al.,2013) intro-
duced the ‘Spatial Role Labeling’ task, but this is
somewhat different from the standard SRL task and
will not be discussed in this paper. Since
SemEval-
2014
(Marelli et al.,2014), a deeper semantic rep-
resentation of sentences in a single graph-based
structure via semantic parsing has superseded the
previous ‘shallow’ SRL tasks.
2.2 CoNLL
The
CoNLL-2004
shared task (Carreras and
Màrquez,2004) was based on the PropBank cor-
pus, comprising six sections of the Wall Street
Journal part of the Penn Treebank (Kingsbury and
Palmer,2002) enriched with predicate–argument
structures. The task was to identify and label the
arguments of each marked verb. The precision,
recall and F1 of arguments were evaluated using
the
srl-eval.pl
program. For an argument to
be correctly recognized, the words spanning the
argument as well as its semantic role have to be
correct. The verb argument is the lexicalization
of the predicate of the proposition. Most of the
time, the verb corresponds to the target verb of the
proposition, which is provided as input, and only
in few cases the verb participant spans more words
than the target verb. This situation makes the verb
easy to identify and, since there is one verb with
each proposition, evaluating its recognition overes-
timates the overall performance of a system. For
this reason, the verb argument is excluded from
evaluation. The shared task proceedings does not
details how non-continuous arguments are eval-
uated. In
CoNLL-2005
(Carreras and Màrquez,
2005b) a system had to recognize and label the ar-
guments of each target verb. The evaluation method
remained the same as CoNLL-2004, using the same
evaluation code.
The
CoNLL 2008
shared task (Surdeanu et al.,
2008b) was dedicated to the joint parsing of syntac-
tic and semantic dependencies. The shared task was
divided into three subtasks: (i) parsing of syntactic
dependencies, (ii) identification and disambigua-
tion of semantic predicates, and (iii) identification
of arguments and assignment of semantic roles for
each predicate. SRL was performed and evaluated
using a dependency-based representation for both
syntactic and semantic dependencies.
The official evaluation measures consist of three
different scores: (i) syntactic dependencies are
scored using the labeled attachment score (LAS),
(ii) semantic dependencies are evaluated using a
labeled F1 score, and (iii) the overall task is scored
with a macro average of the two previous scores.
The semantic propositions are evaluated by convert-
ing them to semantic dependencies, i.e., a semantic
dependency from every predicate to all its individ-
ual arguments were created. These dependencies
are labeled with the labels of the corresponding
arguments. Additionally, a semantic dependency
from each predicate to a virtual ROOT node was
created. The latter dependencies are labeled with
the predicate senses. This approach guarantees that
the semantic dependency structure conceptually
forms a single-rooted, connected (not necessarily
acyclic) graph. More importantly, this scoring strat-
egy implies that if a system assigns the incorrect
predicate sense, it still receives some points for the
arguments correctly assigned. Several additional
evaluation measures were applied to further ana-
lyze the performance of the participating systems.
The Exact Match reports the percentage of sen-
tences that are completely correct, i.e., all the gen-
erated syntactic dependencies are correct and all
the semantic propositions are present and correct.
The Perfect Proposition F1 score entire semantic
frames or propositions. The ratio between labeled
F1 score for semantic dependencies and the LAS
for syntactic dependencies.
As in CoNLL-2008, the CoNLL-2009 shared
task (Hajiˇ
c et al.,2009) combined syntactic de-
pendency parsing and the task of identifying and
labeling semantic arguments of verbs or nouns for
six more languages in addition to the original En-
glish from CoNLL-2008. Predicate disambiguation