
numerous NLP applications. For instance, event
schema induction (Dror et al.,2023) relies on event-
centric information extraction to derive graphical
representations of events from text. In this context,
understanding essentiality can enhance the qual-
ity of induced schemas by eliminating hallucina-
tions and suggesting the addition of missing crucial
events. Moreover, grasping essentiality can poten-
tially benefit intelligent systems for QA tasks (Bisk
et al.,2020) and task-oriented dialogue process-
ing (Madotto et al.,2020).
In this paper, we aim to assess the depth of un-
derstanding that current NLU models possess re-
garding events in comparison to human cognition.
To accomplish this, we introduce a new cognitively
inspired problem of detecting essential step events
in goal event processes and establish a novel bench-
mark, Essential Step Detection (ESD), to promote
research in this area. Specifically, we gather goals
and their corresponding steps from WikiHow
2
and
manually annotate the essentiality of various steps
in relation to the goal. Our experimental findings
reveal that although humans consistently perceive
event essentiality, current models still have a long
way to go to match this level of understanding.
2 Task and Data
The essential step detection task is defined as fol-
lows: for each goal
G
and one of its sub-steps
S
,
the objective is to predict whether the failure of
S
will result in the failure of
G
. In our formulation,
G
and
S
are presented as natural language sentences.
The construction of ESD includes two steps: (1)
Data Preparation and (2) Essentiality Annotation.
Details of these steps are provided below.
2.1 Data Preparation
WikiHow is a widely-used and well-structured
resource for exploring the relationship between
goal-oriented processes and their corresponding
steps (Koupaee and Wang,2018;Zhang et al.,
2020b). To the best of our knowledge, it is the
most appropriate resource for the purpose of our re-
search. Consequently, we begin by collecting 1,000
random goal-oriented processes from WikiHow. To
avoid oversimplified and overly complex processes,
we only retain those with three to ten steps. Further-
more, given that all WikiHow processes and their
associated steps are carefully crafted by humans,
2
WikiHow is a community website featuring extensive
collections of step-by-step guidelines.
Essential Non-essential Total
Number of instances 1,118 397 1,515
Average step length 17.1 17.4 17.2
Table 1: Dataset statistics of ESD. The average step
length represents the mean number of tokens per step.
the majority of the steps mentioned are essential.
To achieve balance in the dataset, we enlist crowd-
sourcing workers to contribute optional steps (i.e.,
those that could occur as part of the process but are
not essential)
3
. We employ three annotators from
Amazon Mechanical Turk
4
, who are native English
speakers, to provide optional steps for each goal.
To ensure high-quality annotations, we require an-
notators to hold the “Master annotator” title. The
average cost and time for supplying annotations
are 0.1 USD and 32 seconds per instance (approxi-
mately 12 USD per hour).
2.2 Essentiality Annotation
Given that our task necessitates a profound under-
standing of the events and careful consideration,
we ensure annotation quality by employing three
well-trained research assistants from our depart-
ment rather than ordinary annotators to conduct the
essentiality annotations. For each goal-step pair,
annotators are asked to rate it as 0 (non-essential), 1
(essential), or -1 (the step is not a valid step for the
target goal, or the goal/step contains confidential or
hostile information)
5
. Since all annotators are well-
trained and fully comprehend our task, we discard
any pair that is deemed invalid (i.e., -1) by at least
one annotator. This results in 1,515 pairs being
retained. We determine the final label based on
majority voting. The dataset statistics can be found
in Table 1. Altogether, we compile 1,118 essential
and 397 non-essential "goal-step" pairs. The inter-
annotator agreement, measured by Fleiss’s Kappa
6
,
is 0.611, signifying the high quality of ESD.
3 Experiments
Recently, large-scale pre-trained language models
have exhibited impressive language understanding
capabilities. To assess the extent to which these
models truly understand events, we evaluate them
3The survey template is shown in Appendix Figure 2.
4https://www.mturk.com/
5The survey template is shown in Appendix Figure 3.
6
We utilize tools from https://github.com/Shamya/
FleissKappa.