JECC Commonsense Reasoning Tasks Derived from Interactive Fictions Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4 Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5

2025-05-05 0 0 860.87KB 11 页 10玖币

侵权投诉

JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions

Mo Yu∗1Yi Gu∗2Xiaoxiao Guo3Yufei Feng4

Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5

1WeChat AI 2UC San Diego 3LinkedIn 4Queens University 5IBM Research

moyumyu@tencent.com yig025@ucsd.edu

Abstract

Commonsense reasoning simulates the human

ability to make presumptions about our phys-

ical world, and it is an essential cornerstone

in building general AI systems. We propose

a new commonsense reasoning dataset based

on human’s Interactive Fiction (IF) gameplay

walkthroughs as human players demonstrate

plentiful and diverse commonsense reasoning.

The new dataset provides a natural mixture of

various reasoning types and requires multi-hop

reasoning. Moreover, the IF game-based con-

struction procedure requires much less human

interventions than previous ones. Different

from existing benchmarks, our dataset focuses

on the assessment of functional commonsense

knowledge rules rather than factual knowledge.

Hence, in order to achieve higher performance

on our tasks, models need to effectively uti-

lize such functional knowledge to infer the out-

comes of actions, rather than relying solely on

memorizing facts. Experiments show that the

introduced dataset is challenging to previous

machine reading models as well as the new

large language models with a signiﬁcant 20%

performance gap compared to human experts.

1 Introduction

There has been a ﬂurry of datasets and benchmarks

proposed to address natural language-based com-

monsense reasoning (Levesque et al.,2012;Zhou

et al.,2019;Talmor et al.,2019;Mullenbach et al.,

2019;Jiang et al.,2020;Sap et al.,2019a;Bhaga-

vatula et al.,2019;Huang et al.,2019;Bisk et al.,

2020;Sap et al.,2019b;Zellers et al.,2018). These

benchmarks usually adopt a multi-choice form –

with the input query and an optional short para-

graph of the background description, each candi-

date forms a statement; the task is to predict the

∗

Equal Contribution. Work done when MY and XG were

working at IBM.

Our code and data are released at

https://github.com/

Gorov/zucc.

Figure 1: Classic dungeon game Zork1 gameplay sam-

ple. The player receives textual observations describing

the current game state and sends textual action com-

mands to control the protagonist.

statement that is consistent with some common-

sense knowledge facts.

These benchmarks share some limitations, as

they are mostly constructed to focus on a single

reasoning type and require similar validation-based

reasoning. First, most benchmarks concentrate on

a speciﬁc facet and ask human annotators to write

candidate statements related to the particular type

of commonsense. As a result, the distribution of

these datasets is unnatural and biased to a speciﬁc

facet. For example, most benchmarks focus on

collocation, association, or other relations (e.g.,

ConceptNet (Speer et al.,2017) relations) between

words or concepts (Levesque et al.,2012;Talmor

et al.,2019;Mullenbach et al.,2019;Jiang et al.,

2020). Other examples include temporal common-

sense (Zhou et al.,2019), physical interactions

arXiv:2210.15456v2 [cs.CL] 26 May 2023

between actions and objects (Bisk et al.,2020),

emotions and behaviors of people under the given

situation (Sap et al.,2019b), and cause-effects be-

tween events and states (Sap et al.,2019a;Bhaga-

vatula et al.,2019;Huang et al.,2019). Second,

most datasets require validation-based reasoning

between a commonsense fact and a text statement

but neglect hops over multiple facts.

The previous

work’s limitations bias the model evaluation. For

example, pre-trained Language Models (PLMs),

such as BERT (Devlin et al.,2019), can well han-

dle most benchmarks, because their pre-training

process may include texts on the required facts

thus provide shortcuts to a dominating portion of

commonsense validation instances. In summary,

the above limitations of previous benchmarks lead

to discrepancies among practical NLP tasks that

require broad reasoning ability on various facets.

Our Contribution. We derive a new common-

sense reasoning dataset from the model-based re-

inforcement learning challenge of Interactive Fic-

tions (IF) to address the above limitations. Recent

advances (Hausknecht et al.,2019;Ammanabrolu

and Hausknecht,2020;Guo et al.,2020) in IF

games have recognized several commonsense rea-

soning challenges, such as detecting valid actions

and predicting different actions’ effects. Figure 1

illustrates sample gameplay of the classic game

Zork1 and the required commonsense knowledge.

We derive a commonsense dataset from human

players’ gameplay records related to the second

challenge, i.e., predicting which textual observa-

tion is most likely after applying an action or a

sequence of actions to a given game state.

The derived dataset naturally addresses the afore-

mentioned limitations in previous datasets. First,

predicting the next observation naturally requires

various commonsense knowledge and reasoning

types. As shown in Figure 1, a primary common-

sense type is spatial reasoning, e.g.,

“climb the

tree”

makes the protagonist up on a tree. Another

primary type is reasoning about object interactions.

For example, keys can open locks (object relation-

ships);

“hatch egg”

will reveal “things” inside

the egg (object properties);

“burn repellent”

leads to an explosion and kills the player (physical

reasoning). The above interactions are more com-

2Some datasets include a portion of instances that require

explicit reasoning capacity, such as (Bhagavatula et al.,2019;

Huang et al.,2019;Bisk et al.,2020;Sap et al.,2019b). But

still, standalone facts can solve most such instances.

prehensive than the relationships deﬁned in Con-

ceptNet as used in previous datasets. Second, the

rich textual observation enables more complex rea-

soning over direct commonsense validation. Due

to the textual observation’s narrative nature, a large

portion of the textual observations are not a sole

statement of the action effect, but an extended nar-

rates about what happens because of the effect.

Third, our commonsense reasoning task formula-

tion shares the essence of dynamics model learn-

ing for model-based RL solutions related to world

models and MuZero (Ha and Schmidhuber,2018;

Schrittwieser et al.,2019). Therefore, models de-

veloped on our benchmarks provide direct values

to model-based RL for text-game playing.

Finally, compared to previous works that heavily

rely on human annotation, our dataset construction

requires minimal human effort, providing great ex-

pansibility. For example, with large amounts of

available IF games in dungeon crawls, Sci-Fi, mys-

tery, comedy, and horror, it is straightforward to

extend our dataset to include more data samples

and cover a wide range of genres. We can also natu-

rally increase the reasoning difﬁculty by increasing

the prediction horizon of future observations after

taking multi-step actions instead of a single one.

In summary, we introduce a new common-

sense reasoning dataset construction paradigm, col-

lectively with two datasets. The larger dataset

covers 29 games in multiple domains from the

Jericho Environment (Hausknecht et al.,2019),

named the

ericho

nvironment

ommonsense

omprehension task (JECC). The smaller dataset,

aimed for the single-domain test and fast model

development, includes four IF games in the Zork

Universe, named

ork

niverse

ommonsense

omprehension (ZUCC). We provide strong base-

lines to the datasets and categorize their perfor-

mance gap compared to human experts.

2 Related Work

Previous work has identiﬁed various types of com-

monsense knowledge humans master for text under-

standing. As discussed in the introduction section,

most existing datasets cover one or a few limited

types. Also, they mostly have the form of common-

sense fact validation based on a text statement.

Semantic Relations between Concepts. Most

For some actions, such as

get

and

drop

objects, the next

observations are simple statements. We removed some of

these actions. Details can be found in Section 3.

previous datasets cover the semantic relations be-

tween words or concepts. These relations include

the concept hierarchies, such as those covered by

WordNet or ConceptNet, and word collocations and

associations. For example, the early work Wino-

grad (Levesque et al.,2012) evaluates the model’s

ability to capture word collocations, associations

between objects, and their attributes as a pronoun

resolution task. The work by (Talmor et al.,2019)

is one of the ﬁrst datasets covering the ConceptNet

relational tuple validation as a question-answering

task. The problem asks the relation of a source

object, and the model selects the target object that

satisﬁes the relation from four candidates. (Mullen-

bach et al.,2019) focus on the collocations between

adjectives and objects. Their task takes the form

of textual inference, where a premise describes an

object and the corresponding hypothesis consists of

the object that is modiﬁed by an adjective. (Jiang

et al.,2020) study associations among multiple

words, i.e., whether a word can be associated with

two or more given others (but the work does not for-

mally deﬁne the types of associations). They pro-

pose a new task format in games where the player

produces as many words as possible by combining

existing words.

Causes/Effects between Events or States. Pre-

vious work proposes datasets that require causal

knowledge between events and states (Sap et al.,

2019a;Bhagavatula et al.,2019;Huang et al.,

2019). (Sap et al.,2019a) takes a text generation or

inference form between a cause and an effect. (Bha-

gavatula et al.,2019) takes a similar form to ours

– a sequence of two observations is given, and the

model selects the plausible hypothesis from multi-

ple candidates. Their idea of data construction can

also be applied to include any types of knowledge.

However, their dataset only focuses on causal rela-

tions between events. The work of (Huang et al.,

2019) utilizes multi-choice QA on a background

paragraph, which covers a wider range of casual

knowledge for both events and statements.

Other Commonsense Datasets. (Zhou et al.,

2019) proposed a unique temporal commonsense

dataset. The task is to predict a follow-up event’s

duration or frequency, given a short paragraph de-

scribing an event. (Bisk et al.,2020) focus on

physical interactions between actions and objects,

namely whether an action over an object leads to a

target effect in the physical world. These datasets

can be solved by mostly applying the correct com-

monsense facts; thus, they do not require reasoning.

(Sap et al.,2019b) propose a task of inferring peo-

ple’s emotions and behaviors under the given sit-

uation. Compared to the others, this task contains

a larger portion of instances that require reasoning

beyond fact validation. The above tasks take the

multi-choice question-answering form.

Next-Sentence Prediction. The next sentence pre-

diction tasks, such as SWAG (Zellers et al.,2018),

are also related to our work. These tasks naturally

cover various types of commonsense knowledge

and sometimes require reasoning. The issue is that

the way they guarantee distractor candidates to be

irrelevant greatly simpliﬁed the task. In compari-

son, our task utilizes the IF game engine to ensure

actions uniquely determining the candidates, and

ours has human-written texts.

Finally, our idea is closely related to (Yao et al.,

2020), which creates a task of predicting valid ac-

tions for each IF game state. (Yao et al.,2020,

2021) also discussed the advantages of the super-

vised tasks derived from IF games for natural lan-

gauge understanding purpose.

3 Dataset Construction: Commonsense

Challenges from IF Games

We pick games supported by the Jericho environ-

ment (Hausknecht et al.,2019) to construct the

JECC dataset.

We pick games in the Zork Uni-

verse for the ZUCC dataset.

We ﬁrst introduce

the necessary deﬁnitions in the IF game domain

and then describe how we construct our ZUCC

and JECC datasets as the forward prediction tasks

based on human players’ gameplay records, fol-

lowed by a summary on the improved properties of

our dataset compared to previous ones. The dataset

will be released for public usage. It can be created

with our released code with MIT License.

3.1 Interactive Fiction Game Background

Each IF game can be deﬁned as a Partially Observ-

able Markov Decision Process (POMDP), namely a

7-tuple of

⟨S

Ω

γ⟩

, representing the

hidden game state set, the action set, the state tran-

sition function, the set of textual observations com-

We collect the games 905, acorncourt, advent, adven-

tureland, afﬂicted, awaken, balances, deephome, dragon, en-

chanter, inhumane, library, moonlit, omniquest, pentari, re-

verb, snacktime, sorcerer, zork1 for training, zork3, detective,

ztuu, jewel, zork2 as the development set, temple, gold, karn,

zenon, wishbringer as the test set.

We pick Zork1,Enchanter, and Sorcerer as the training

set, and the dev and sets are non-overlapping split from Zork3.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JECC:CommonsenseReasoningTasksDerivedfromInteractiveFictionsMoYu∗1YiGu∗2XiaoxiaoGuo3YufeiFeng4XiaodanZhu4MichaelGreenspan4MurrayCampbell5ChuangGan51WeChatAI2UCSanDiego3LinkedIn4QueensUniversity5IBMResearchmoyumyu@tencent.comyig025@ucsd.eduAbstractCommonsensereasoningsimulatesthehumanabilitytomakepre...

展开>> 收起<<

JECC Commonsense Reasoning Tasks Derived from Interactive Fictions Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4 Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

JECC Commonsense Reasoning Tasks Derived from Interactive Fictions Mo Yu1Yi Gu2Xiaoxiao Guo3Yufei Feng4 Xiaodan Zhu4Michael Greenspan4Murray Campbell5Chuang Gan5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: