Commonsense Knowledge from Scene Graphs for Textual Environments Tsunehiko Tanaka12Daiki Kimura1Michiaki Tatsubori1 1IBM Research2Waseda University

2025-05-06 0 0 532.77KB 8 页 10玖币

侵权投诉

Commonsense Knowledge from Scene Graphs for Textual Environments

Tsunehiko Tanaka, 1,2*Daiki Kimura, 1Michiaki Tatsubori 1

1IBM Research 2Waseda University

tsunehiko@fuji.waseda.jp, daiki@jp.ibm.com, mich@jp.ibm.com

Abstract

Text-based games are becoming commonly used in reinforce-

ment learning as real-world simulation environments. They

are usually imperfect information games, and their interac-

tions are only in the textual modality. To challenge these

games, it is effective to complement the missing informa-

tion by providing knowledge outside the game, such as hu-

man common sense. However, such knowledge has only been

available from textual information in previous works. In this

paper, we investigate the advantage of employing common-

sense reasoning obtained from visual datasets such as scene

graph datasets. In general, images convey more comprehen-

sive information compared with text for humans. This prop-

erty enables to extract commonsense relationship knowledge

more useful for acting effectively in a game. We compare the

statistics of spatial relationships available in Visual Genome

(a scene graph dataset) and ConceptNet (a text-based knowl-

edge) to analyze the effectiveness of introducing scene graph

datasets. We also conducted experiments on a text-based

game task that requires commonsense reasoning. Our experi-

mental results demonstrated that our proposed methods have

higher and competitive performance than existing state-of-

the-art methods.

Introduction

Reinforcement learning (RL) is a type of machine learn-

ing method that has a great advantage of not requiring la-

beled data and has been used in various simulation envi-

ronments (Mnih et al. 2015; Silver, Huang, and et al. 2016;

Kimura 2018; Kimura et al. 2018). Since textual conversa-

tion agents are commonly used in our daily lives in the real

world, text-based environments, where both observation and

action spaces are restricted to the modality of text, have been

attracting attention. RL in such environments requires de-

veloping an agent to have language comprehension skills by

natural language process and sequential decision-making in

the complex environment. This means the textual observa-

tion contains a lot of noisy information and the problem of

partial observability.

Text-based games are a partially observable Markov deci-

sion process (POMDP) (Kaelbling, Littman, and Cassandra

*This work was done during an internship at IBM Research.

Agent

Complicated...Simple !

ConceptNet

fork spoon dishwasher

RelatedTo AtLocation

dishwasher wash dirty

DerivedFrom RelatedTo

Observation

You’ve entered a kitchen. You

see a dishwasher and a fridge.

Here’s a dining table. You see a

dirty fork and a red apple on the

table.

Actions

1. Take the dirty fork from the table

2. Open the fridge

3. Put the dirty fork in the fridge

4. Put the dirty fork in the dishwasher

Scene GraphScene Graph

dishwasherdirty fork IN

Figure 1: Illustration of our commonsense acquisition from

scene graphs. To provide commonsense: dirty fork →IN →

dishwasher to an agent, a single image is sufﬁcient for scene

graphs (top left), but ConceptNet requires several graphs to

be combined, which is redundant.

1998) where the agent cannot observe the entire information

from the text given by the environment. TextWorld (Cˆ

ot´

et al. 2018) is a textual game generator and extensible sand-

box learning environment for RL agents, and various meth-

ods have been proposed for this game to compensate for

the missing information (Kimura et al. 2021b; Murugesan

et al. 2021; Carta et al. 2020; Murugesan, Chaudhury, and

Talamadupula 2021; Shridhar et al. 2020; Kimura et al.

2021a,c; Chaudhury et al. 2021). There are three types of

extensions: external knowledge, new modality, and logical

rule extraction. External knowledge that is useful for train-

ing agents from humans or other domain sources. A study

reports commonsense knowledge is an important aspect of

human intelligence (Murugesan et al. 2021). In this study,

TextWorld Commonsense (TWC), which requires common-

sense as external knowledge, is proposed as an extension

of TextWorld. The task of the TWC game is cleaning up a

room, and the commonsense in this game is mainly place

information for each object. The same study also includes

a baseline agent for TWC games that uses a commonsense

subgraph extracted from external knowledge (we call this

model TWC agent and the environment TWC games to dis-

tinguish them). Another study reported that introducing ex-

arXiv:2210.14162v1 [cs.CV] 19 Oct 2022

ternal knowledge from humans as logical functions helps the

training of the agent (Kimura et al. 2021b). New modal-

ity information extracted from observations or action text

can be introduced to make decisions (Carta et al. 2020;

Murugesan, Chaudhury, and Talamadupula 2021; Shridhar

et al. 2020). In these methods, visual information from im-

ages or videos is commonly used since it has been used in

many other studies (Tanaka and Simo-Serra 2021; Kimura

et al. 2020) to understand attention and sequential informa-

tion in decision making. Logical rule extraction can be ex-

ploited to improve the speed of training and interpretability

of the agent (Kimura et al. 2021c; Chaudhury et al. 2021).

Since commonsense knowledge is normally represented by

a graph structure, the logical rule representation is compati-

ble with commonsense knowledge.

However, at the time of writing, there has been no re-

search that utilizes the beneﬁts of these multiple extensions

to compensate for missing information. In particular, we hy-

pothesize that the commonsense knowledge of object place

relationships that are used in TWC games can be easily ob-

tained from visual information. For example, instead of stat-

ing the place name of each object, operators can display a

picture of a tidy room, which is a quicker explanation for

humans.

In this paper, we propose a novel agent that challenges

a TWC game by leveraging visual scene graph datasets to

obtain commonsense. The original TWC agent (Murugesan

et al. 2021) constructs a commonsense subgraph from Con-

ceptNet (Speer, Chin, and Havasi 2017a), which is textual

knowledge, but it is necessary to combine many graphs to

obtain one commonsense and to create a complicated sub-

graph. In fact, Murugesan et al. prepared a ‘manual’ com-

monsense subgraph from ConceptNet to tackle this com-

plexity of graphs in their study. However, since scene graph

recognition achieves high accuracy from complex images,

visual information can deliver various detailed and orga-

nized graph information all at once. Figure 1 shows an ex-

ample for the acquisition of commonsense knowledge from

scene graphs in an image. In this example, despite Concept-

Net having redundant information for extracting a common-

sense subgraph, the proposed extraction from scene graphs

has necessary and sufﬁcient information for the cleaning-

up task. Furthermore, relationships from scene graphs also

contain direct spatial relationships such as “on” or “in” (Fig-

ure 2) between objects because agents need to determine an

object’s place in the TWC game. Therefore, we use scene

graph datasets as visual external knowledge. A scene graph

dataset contains a large number of graphs that represent

the relationships between entities in images. We use Visual

Genome (VG) (Krishna et al. 2017) as a scene graph dataset,

which is the most commonly used, and compare its statistics

with ConceptNet. We also conduct experiments to evaluate

the performance of agents with commonsense knowledge

from a scene graph dataset in RL on text-based games.

Related Work

Text-based RL Games

Text-based interactive RL games has been gaining the fo-

cus of many researchers due to the development of envi-

ronments such as TextWorld (Cˆ

ot´

e et al. 2018) and Jeri-

cho (Hausknecht et al. 2019). In these games, RL agents

are required to understand the high-level context informa-

tion from only textual observation. To overcome this difﬁ-

culty, a number of prior works on these environments have

extracted new information from textual observations: knowl-

edge graphs, visual information, and logical rule.

Knowledge graphs represent relationships between enti-

ties like real-world objects and events, or abstract concepts.

A new text-based environment, called “TextWorld Com-

monsense”, was proposed in (Murugesan et al. 2021) to

infuse RL agents with commonsense knowledge and de-

veloped baseline agents using a commonsense subgraph

constructed from ConceptNet(Liu and Singh 2004; Speer,

Chin, and Havasi 2017a) as an external knowledge. We use

this work as a baseline method, and introduce a new type

of commonsense from visual datasets. Worldformer (Am-

manabrolu and Riedl 2021) represents environment status

as a knowledge graph and uses a world model to predict

changes caused by an agent’s actions and generates a set of

contextually relevant actions.

While knowledge graphs are useful for organizing ab-

stract information from only text descriptions, visual infor-

mation enables the agent to obtain a detailed locational sit-

uation like human imagination and visualization. The most

important issue in using visual information is how to ob-

tain it from only textual observation in text-based games.

VisualHints (Carta et al. 2020) proposed an environment

that can automatically generate various hints about game

states from textual observation and changes the difﬁculty

level depending on their type. The main sources of im-

ages in (Murugesan, Chaudhury, and Talamadupula 2021)

are retrieved from the Internet and generated from a text-to-

image pre-trained model, AttnGAN (Xu et al. 2018) with

given text descriptions. ALFWorld (Shridhar et al. 2021)

combines TextWorld and an embodied simulator called AL-

FRED (Shridhar et al. 2020) to obtain information on two

modalities. Shridhar et al. proposed an agent that ﬁrst learns

to solve abstract tasks in TextWorld, then transfers the

learned high-level policies to low-level embodied tasks in

ALFRED.

In addition, even if we use the aforementioned methods,

improvements in the speed of training are few and the inter-

pretability of the trained network is still missing. A number

of studies (Kimura et al. 2021c; Chaudhury et al. 2021) pro-

posed novel approaches to extract symbolic ﬁrst-order logics

from text observations, and select actions by using neuro-

symbolic Logical Neural Networks (Riegel et al. 2020).

These logical representations are compatible with common-

sense graph structures.

As previously described, there have been various ap-

proaches using knowledge graphs, visual information, and

logical rules. However, at the time of writing, there has been

no method that combines any of them. Therefore, we pro-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CommonsenseKnowledgefromSceneGraphsforTextualEnvironmentsTsunehikoTanaka,1,2*DaikiKimura,1MichiakiTatsubori11IBMResearch2WasedaUniversitytsunehiko@fuji.waseda.jp,daiki@jp.ibm.com,mich@jp.ibm.comAbstractText-basedgamesarebecomingcommonlyusedinreinforce-mentlearningasreal-worldsimulationenvironments.T...

展开>> 收起<<

Commonsense Knowledge from Scene Graphs for Textual Environments Tsunehiko Tanaka12Daiki Kimura1Michiaki Tatsubori1 1IBM Research2Waseda University.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Commonsense Knowledge from Scene Graphs for Textual Environments Tsunehiko Tanaka12Daiki Kimura1Michiaki Tatsubori1 1IBM Research2Waseda University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: