JECC Commonsense Reasoning Tasks Derived from Interactive Fictions Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4 Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5

2025-05-05 0 0 860.87KB 11 页 10玖币
侵权投诉
JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions
Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4
Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5
1WeChat AI 2UC San Diego 3LinkedIn 4Queens University 5IBM Research
moyumyu@tencent.com yig025@ucsd.edu
Abstract
Commonsense reasoning simulates the human
ability to make presumptions about our phys-
ical world, and it is an essential cornerstone
in building general AI systems. We propose
a new commonsense reasoning dataset based
on human’s Interactive Fiction (IF) gameplay
walkthroughs as human players demonstrate
plentiful and diverse commonsense reasoning.
The new dataset provides a natural mixture of
various reasoning types and requires multi-hop
reasoning. Moreover, the IF game-based con-
struction procedure requires much less human
interventions than previous ones. Different
from existing benchmarks, our dataset focuses
on the assessment of functional commonsense
knowledge rules rather than factual knowledge.
Hence, in order to achieve higher performance
on our tasks, models need to effectively uti-
lize such functional knowledge to infer the out-
comes of actions, rather than relying solely on
memorizing facts. Experiments show that the
introduced dataset is challenging to previous
machine reading models as well as the new
large language models with a significant 20%
performance gap compared to human experts.
1
1 Introduction
There has been a flurry of datasets and benchmarks
proposed to address natural language-based com-
monsense reasoning (Levesque et al.,2012;Zhou
et al.,2019;Talmor et al.,2019;Mullenbach et al.,
2019;Jiang et al.,2020;Sap et al.,2019a;Bhaga-
vatula et al.,2019;Huang et al.,2019;Bisk et al.,
2020;Sap et al.,2019b;Zellers et al.,2018). These
benchmarks usually adopt a multi-choice form –
with the input query and an optional short para-
graph of the background description, each candi-
date forms a statement; the task is to predict the
Equal Contribution. Work done when MY and XG were
working at IBM.
1
Our code and data are released at
https://github.com/
Gorov/zucc.
Figure 1: Classic dungeon game Zork1 gameplay sam-
ple. The player receives textual observations describing
the current game state and sends textual action com-
mands to control the protagonist.
statement that is consistent with some common-
sense knowledge facts.
These benchmarks share some limitations, as
they are mostly constructed to focus on a single
reasoning type and require similar validation-based
reasoning. First, most benchmarks concentrate on
a specific facet and ask human annotators to write
candidate statements related to the particular type
of commonsense. As a result, the distribution of
these datasets is unnatural and biased to a specific
facet. For example, most benchmarks focus on
collocation, association, or other relations (e.g.,
ConceptNet (Speer et al.,2017) relations) between
words or concepts (Levesque et al.,2012;Talmor
et al.,2019;Mullenbach et al.,2019;Jiang et al.,
2020). Other examples include temporal common-
sense (Zhou et al.,2019), physical interactions
arXiv:2210.15456v2 [cs.CL] 26 May 2023
between actions and objects (Bisk et al.,2020),
emotions and behaviors of people under the given
situation (Sap et al.,2019b), and cause-effects be-
tween events and states (Sap et al.,2019a;Bhaga-
vatula et al.,2019;Huang et al.,2019). Second,
most datasets require validation-based reasoning
between a commonsense fact and a text statement
but neglect hops over multiple facts.
2
The previous
work’s limitations bias the model evaluation. For
example, pre-trained Language Models (PLMs),
such as BERT (Devlin et al.,2019), can well han-
dle most benchmarks, because their pre-training
process may include texts on the required facts
thus provide shortcuts to a dominating portion of
commonsense validation instances. In summary,
the above limitations of previous benchmarks lead
to discrepancies among practical NLP tasks that
require broad reasoning ability on various facets.
Our Contribution. We derive a new common-
sense reasoning dataset from the model-based re-
inforcement learning challenge of Interactive Fic-
tions (IF) to address the above limitations. Recent
advances (Hausknecht et al.,2019;Ammanabrolu
and Hausknecht,2020;Guo et al.,2020) in IF
games have recognized several commonsense rea-
soning challenges, such as detecting valid actions
and predicting different actions’ effects. Figure 1
illustrates sample gameplay of the classic game
Zork1 and the required commonsense knowledge.
We derive a commonsense dataset from human
players’ gameplay records related to the second
challenge, i.e., predicting which textual observa-
tion is most likely after applying an action or a
sequence of actions to a given game state.
The derived dataset naturally addresses the afore-
mentioned limitations in previous datasets. First,
predicting the next observation naturally requires
various commonsense knowledge and reasoning
types. As shown in Figure 1, a primary common-
sense type is spatial reasoning, e.g.,
“climb the
tree”
makes the protagonist up on a tree. Another
primary type is reasoning about object interactions.
For example, keys can open locks (object relation-
ships);
“hatch egg”
will reveal “things” inside
the egg (object properties);
“burn repellent”
leads to an explosion and kills the player (physical
reasoning). The above interactions are more com-
2Some datasets include a portion of instances that require
explicit reasoning capacity, such as (Bhagavatula et al.,2019;
Huang et al.,2019;Bisk et al.,2020;Sap et al.,2019b). But
still, standalone facts can solve most such instances.
prehensive than the relationships defined in Con-
ceptNet as used in previous datasets. Second, the
rich textual observation enables more complex rea-
soning over direct commonsense validation. Due
to the textual observation’s narrative nature, a large
portion of the textual observations are not a sole
statement of the action effect, but an extended nar-
rates about what happens because of the effect.
3
Third, our commonsense reasoning task formula-
tion shares the essence of dynamics model learn-
ing for model-based RL solutions related to world
models and MuZero (Ha and Schmidhuber,2018;
Schrittwieser et al.,2019). Therefore, models de-
veloped on our benchmarks provide direct values
to model-based RL for text-game playing.
Finally, compared to previous works that heavily
rely on human annotation, our dataset construction
requires minimal human effort, providing great ex-
pansibility. For example, with large amounts of
available IF games in dungeon crawls, Sci-Fi, mys-
tery, comedy, and horror, it is straightforward to
extend our dataset to include more data samples
and cover a wide range of genres. We can also natu-
rally increase the reasoning difficulty by increasing
the prediction horizon of future observations after
taking multi-step actions instead of a single one.
In summary, we introduce a new common-
sense reasoning dataset construction paradigm, col-
lectively with two datasets. The larger dataset
covers 29 games in multiple domains from the
Jericho Environment (Hausknecht et al.,2019),
named the
J
ericho
E
nvironment
C
ommonsense
C
omprehension task (JECC). The smaller dataset,
aimed for the single-domain test and fast model
development, includes four IF games in the Zork
Universe, named
Z
ork
U
niverse
C
ommonsense
C
omprehension (ZUCC). We provide strong base-
lines to the datasets and categorize their perfor-
mance gap compared to human experts.
2 Related Work
Previous work has identified various types of com-
monsense knowledge humans master for text under-
standing. As discussed in the introduction section,
most existing datasets cover one or a few limited
types. Also, they mostly have the form of common-
sense fact validation based on a text statement.
Semantic Relations between Concepts. Most
3
For some actions, such as
get
and
drop
objects, the next
observations are simple statements. We removed some of
these actions. Details can be found in Section 3.
previous datasets cover the semantic relations be-
tween words or concepts. These relations include
the concept hierarchies, such as those covered by
WordNet or ConceptNet, and word collocations and
associations. For example, the early work Wino-
grad (Levesque et al.,2012) evaluates the model’s
ability to capture word collocations, associations
between objects, and their attributes as a pronoun
resolution task. The work by (Talmor et al.,2019)
is one of the first datasets covering the ConceptNet
relational tuple validation as a question-answering
task. The problem asks the relation of a source
object, and the model selects the target object that
satisfies the relation from four candidates. (Mullen-
bach et al.,2019) focus on the collocations between
adjectives and objects. Their task takes the form
of textual inference, where a premise describes an
object and the corresponding hypothesis consists of
the object that is modified by an adjective. (Jiang
et al.,2020) study associations among multiple
words, i.e., whether a word can be associated with
two or more given others (but the work does not for-
mally define the types of associations). They pro-
pose a new task format in games where the player
produces as many words as possible by combining
existing words.
Causes/Effects between Events or States. Pre-
vious work proposes datasets that require causal
knowledge between events and states (Sap et al.,
2019a;Bhagavatula et al.,2019;Huang et al.,
2019). (Sap et al.,2019a) takes a text generation or
inference form between a cause and an effect. (Bha-
gavatula et al.,2019) takes a similar form to ours
– a sequence of two observations is given, and the
model selects the plausible hypothesis from multi-
ple candidates. Their idea of data construction can
also be applied to include any types of knowledge.
However, their dataset only focuses on causal rela-
tions between events. The work of (Huang et al.,
2019) utilizes multi-choice QA on a background
paragraph, which covers a wider range of casual
knowledge for both events and statements.
Other Commonsense Datasets. (Zhou et al.,
2019) proposed a unique temporal commonsense
dataset. The task is to predict a follow-up event’s
duration or frequency, given a short paragraph de-
scribing an event. (Bisk et al.,2020) focus on
physical interactions between actions and objects,
namely whether an action over an object leads to a
target effect in the physical world. These datasets
can be solved by mostly applying the correct com-
monsense facts; thus, they do not require reasoning.
(Sap et al.,2019b) propose a task of inferring peo-
ple’s emotions and behaviors under the given sit-
uation. Compared to the others, this task contains
a larger portion of instances that require reasoning
beyond fact validation. The above tasks take the
multi-choice question-answering form.
Next-Sentence Prediction. The next sentence pre-
diction tasks, such as SWAG (Zellers et al.,2018),
are also related to our work. These tasks naturally
cover various types of commonsense knowledge
and sometimes require reasoning. The issue is that
the way they guarantee distractor candidates to be
irrelevant greatly simplified the task. In compari-
son, our task utilizes the IF game engine to ensure
actions uniquely determining the candidates, and
ours has human-written texts.
Finally, our idea is closely related to (Yao et al.,
2020), which creates a task of predicting valid ac-
tions for each IF game state. (Yao et al.,2020,
2021) also discussed the advantages of the super-
vised tasks derived from IF games for natural lan-
gauge understanding purpose.
3 Dataset Construction: Commonsense
Challenges from IF Games
We pick games supported by the Jericho environ-
ment (Hausknecht et al.,2019) to construct the
JECC dataset.
4
We pick games in the Zork Uni-
verse for the ZUCC dataset.
5
We first introduce
the necessary definitions in the IF game domain
and then describe how we construct our ZUCC
and JECC datasets as the forward prediction tasks
based on human players’ gameplay records, fol-
lowed by a summary on the improved properties of
our dataset compared to previous ones. The dataset
will be released for public usage. It can be created
with our released code with MIT License.
3.1 Interactive Fiction Game Background
Each IF game can be defined as a Partially Observ-
able Markov Decision Process (POMDP), namely a
7-tuple of
S
,
A
,
T
,
O
,
,
R
,
γ
, representing the
hidden game state set, the action set, the state tran-
sition function, the set of textual observations com-
4
We collect the games 905, acorncourt, advent, adven-
tureland, afflicted, awaken, balances, deephome, dragon, en-
chanter, inhumane, library, moonlit, omniquest, pentari, re-
verb, snacktime, sorcerer, zork1 for training, zork3, detective,
ztuu, jewel, zork2 as the development set, temple, gold, karn,
zenon, wishbringer as the test set.
5
We pick Zork1,Enchanter, and Sorcerer as the training
set, and the dev and sets are non-overlapping split from Zork3.
摘要:

JECC:CommonsenseReasoningTasksDerivedfromInteractiveFictionsMoYu∗1YiGu∗2XiaoxiaoGuo3YufeiFeng4XiaodanZhu4MichaelGreenspan4MurrayCampbell5ChuangGan51WeChatAI2UCSanDiego3LinkedIn4QueensUniversity5IBMResearchmoyumyu@tencent.comyig025@ucsd.eduAbstractCommonsensereasoningsimulatesthehumanabilitytomakepre...

展开>> 收起<<
JECC Commonsense Reasoning Tasks Derived from Interactive Fictions Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4 Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:860.87KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注