
previous datasets cover the semantic relations be-
tween words or concepts. These relations include
the concept hierarchies, such as those covered by
WordNet or ConceptNet, and word collocations and
associations. For example, the early work Wino-
grad (Levesque et al.,2012) evaluates the model’s
ability to capture word collocations, associations
between objects, and their attributes as a pronoun
resolution task. The work by (Talmor et al.,2019)
is one of the first datasets covering the ConceptNet
relational tuple validation as a question-answering
task. The problem asks the relation of a source
object, and the model selects the target object that
satisfies the relation from four candidates. (Mullen-
bach et al.,2019) focus on the collocations between
adjectives and objects. Their task takes the form
of textual inference, where a premise describes an
object and the corresponding hypothesis consists of
the object that is modified by an adjective. (Jiang
et al.,2020) study associations among multiple
words, i.e., whether a word can be associated with
two or more given others (but the work does not for-
mally define the types of associations). They pro-
pose a new task format in games where the player
produces as many words as possible by combining
existing words.
Causes/Effects between Events or States. Pre-
vious work proposes datasets that require causal
knowledge between events and states (Sap et al.,
2019a;Bhagavatula et al.,2019;Huang et al.,
2019). (Sap et al.,2019a) takes a text generation or
inference form between a cause and an effect. (Bha-
gavatula et al.,2019) takes a similar form to ours
– a sequence of two observations is given, and the
model selects the plausible hypothesis from multi-
ple candidates. Their idea of data construction can
also be applied to include any types of knowledge.
However, their dataset only focuses on causal rela-
tions between events. The work of (Huang et al.,
2019) utilizes multi-choice QA on a background
paragraph, which covers a wider range of casual
knowledge for both events and statements.
Other Commonsense Datasets. (Zhou et al.,
2019) proposed a unique temporal commonsense
dataset. The task is to predict a follow-up event’s
duration or frequency, given a short paragraph de-
scribing an event. (Bisk et al.,2020) focus on
physical interactions between actions and objects,
namely whether an action over an object leads to a
target effect in the physical world. These datasets
can be solved by mostly applying the correct com-
monsense facts; thus, they do not require reasoning.
(Sap et al.,2019b) propose a task of inferring peo-
ple’s emotions and behaviors under the given sit-
uation. Compared to the others, this task contains
a larger portion of instances that require reasoning
beyond fact validation. The above tasks take the
multi-choice question-answering form.
Next-Sentence Prediction. The next sentence pre-
diction tasks, such as SWAG (Zellers et al.,2018),
are also related to our work. These tasks naturally
cover various types of commonsense knowledge
and sometimes require reasoning. The issue is that
the way they guarantee distractor candidates to be
irrelevant greatly simplified the task. In compari-
son, our task utilizes the IF game engine to ensure
actions uniquely determining the candidates, and
ours has human-written texts.
Finally, our idea is closely related to (Yao et al.,
2020), which creates a task of predicting valid ac-
tions for each IF game state. (Yao et al.,2020,
2021) also discussed the advantages of the super-
vised tasks derived from IF games for natural lan-
gauge understanding purpose.
3 Dataset Construction: Commonsense
Challenges from IF Games
We pick games supported by the Jericho environ-
ment (Hausknecht et al.,2019) to construct the
JECC dataset.
4
We pick games in the Zork Uni-
verse for the ZUCC dataset.
5
We first introduce
the necessary definitions in the IF game domain
and then describe how we construct our ZUCC
and JECC datasets as the forward prediction tasks
based on human players’ gameplay records, fol-
lowed by a summary on the improved properties of
our dataset compared to previous ones. The dataset
will be released for public usage. It can be created
with our released code with MIT License.
3.1 Interactive Fiction Game Background
Each IF game can be defined as a Partially Observ-
able Markov Decision Process (POMDP), namely a
7-tuple of
⟨S
,
A
,
T
,
O
,
Ω
,
R
,
γ⟩
, representing the
hidden game state set, the action set, the state tran-
sition function, the set of textual observations com-
4
We collect the games 905, acorncourt, advent, adven-
tureland, afflicted, awaken, balances, deephome, dragon, en-
chanter, inhumane, library, moonlit, omniquest, pentari, re-
verb, snacktime, sorcerer, zork1 for training, zork3, detective,
ztuu, jewel, zork2 as the development set, temple, gold, karn,
zenon, wishbringer as the test set.
5
We pick Zork1,Enchanter, and Sorcerer as the training
set, and the dev and sets are non-overlapping split from Zork3.