Commonsense Knowledge from Scene Graphs for Textual Environments Tsunehiko Tanaka12Daiki Kimura1Michiaki Tatsubori1 1IBM Research2Waseda University

2025-05-06 0 0 532.77KB 8 页 10玖币
侵权投诉
Commonsense Knowledge from Scene Graphs for Textual Environments
Tsunehiko Tanaka, 1,2*Daiki Kimura, 1Michiaki Tatsubori 1
1IBM Research 2Waseda University
tsunehiko@fuji.waseda.jp, daiki@jp.ibm.com, mich@jp.ibm.com
Abstract
Text-based games are becoming commonly used in reinforce-
ment learning as real-world simulation environments. They
are usually imperfect information games, and their interac-
tions are only in the textual modality. To challenge these
games, it is effective to complement the missing informa-
tion by providing knowledge outside the game, such as hu-
man common sense. However, such knowledge has only been
available from textual information in previous works. In this
paper, we investigate the advantage of employing common-
sense reasoning obtained from visual datasets such as scene
graph datasets. In general, images convey more comprehen-
sive information compared with text for humans. This prop-
erty enables to extract commonsense relationship knowledge
more useful for acting effectively in a game. We compare the
statistics of spatial relationships available in Visual Genome
(a scene graph dataset) and ConceptNet (a text-based knowl-
edge) to analyze the effectiveness of introducing scene graph
datasets. We also conducted experiments on a text-based
game task that requires commonsense reasoning. Our experi-
mental results demonstrated that our proposed methods have
higher and competitive performance than existing state-of-
the-art methods.
Introduction
Reinforcement learning (RL) is a type of machine learn-
ing method that has a great advantage of not requiring la-
beled data and has been used in various simulation envi-
ronments (Mnih et al. 2015; Silver, Huang, and et al. 2016;
Kimura 2018; Kimura et al. 2018). Since textual conversa-
tion agents are commonly used in our daily lives in the real
world, text-based environments, where both observation and
action spaces are restricted to the modality of text, have been
attracting attention. RL in such environments requires de-
veloping an agent to have language comprehension skills by
natural language process and sequential decision-making in
the complex environment. This means the textual observa-
tion contains a lot of noisy information and the problem of
partial observability.
Text-based games are a partially observable Markov deci-
sion process (POMDP) (Kaelbling, Littman, and Cassandra
*This work was done during an internship at IBM Research.
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Agent
Complicated...Simple !
ConceptNet
fork spoon dishwasher
RelatedTo AtLocation
dishwasher wash dirty
DerivedFrom RelatedTo
Observation
You’ve entered a kitchen. You
see a dishwasher and a fridge.
Here’s a dining table. You see a
dirty fork and a red apple on the
table.
Actions
1. Take the dirty fork from the table
2. Open the fridge
3. Put the dirty fork in the fridge
4. Put the dirty fork in the dishwasher
Scene GraphScene Graph
dishwasherdirty fork IN
Figure 1: Illustration of our commonsense acquisition from
scene graphs. To provide commonsense: dirty fork IN
dishwasher to an agent, a single image is sufficient for scene
graphs (top left), but ConceptNet requires several graphs to
be combined, which is redundant.
1998) where the agent cannot observe the entire information
from the text given by the environment. TextWorld (Cˆ
ot´
e
et al. 2018) is a textual game generator and extensible sand-
box learning environment for RL agents, and various meth-
ods have been proposed for this game to compensate for
the missing information (Kimura et al. 2021b; Murugesan
et al. 2021; Carta et al. 2020; Murugesan, Chaudhury, and
Talamadupula 2021; Shridhar et al. 2020; Kimura et al.
2021a,c; Chaudhury et al. 2021). There are three types of
extensions: external knowledge, new modality, and logical
rule extraction. External knowledge that is useful for train-
ing agents from humans or other domain sources. A study
reports commonsense knowledge is an important aspect of
human intelligence (Murugesan et al. 2021). In this study,
TextWorld Commonsense (TWC), which requires common-
sense as external knowledge, is proposed as an extension
of TextWorld. The task of the TWC game is cleaning up a
room, and the commonsense in this game is mainly place
information for each object. The same study also includes
a baseline agent for TWC games that uses a commonsense
subgraph extracted from external knowledge (we call this
model TWC agent and the environment TWC games to dis-
tinguish them). Another study reported that introducing ex-
arXiv:2210.14162v1 [cs.CV] 19 Oct 2022
ternal knowledge from humans as logical functions helps the
training of the agent (Kimura et al. 2021b). New modal-
ity information extracted from observations or action text
can be introduced to make decisions (Carta et al. 2020;
Murugesan, Chaudhury, and Talamadupula 2021; Shridhar
et al. 2020). In these methods, visual information from im-
ages or videos is commonly used since it has been used in
many other studies (Tanaka and Simo-Serra 2021; Kimura
et al. 2020) to understand attention and sequential informa-
tion in decision making. Logical rule extraction can be ex-
ploited to improve the speed of training and interpretability
of the agent (Kimura et al. 2021c; Chaudhury et al. 2021).
Since commonsense knowledge is normally represented by
a graph structure, the logical rule representation is compati-
ble with commonsense knowledge.
However, at the time of writing, there has been no re-
search that utilizes the benefits of these multiple extensions
to compensate for missing information. In particular, we hy-
pothesize that the commonsense knowledge of object place
relationships that are used in TWC games can be easily ob-
tained from visual information. For example, instead of stat-
ing the place name of each object, operators can display a
picture of a tidy room, which is a quicker explanation for
humans.
In this paper, we propose a novel agent that challenges
a TWC game by leveraging visual scene graph datasets to
obtain commonsense. The original TWC agent (Murugesan
et al. 2021) constructs a commonsense subgraph from Con-
ceptNet (Speer, Chin, and Havasi 2017a), which is textual
knowledge, but it is necessary to combine many graphs to
obtain one commonsense and to create a complicated sub-
graph. In fact, Murugesan et al. prepared a ‘manual’ com-
monsense subgraph from ConceptNet to tackle this com-
plexity of graphs in their study. However, since scene graph
recognition achieves high accuracy from complex images,
visual information can deliver various detailed and orga-
nized graph information all at once. Figure 1 shows an ex-
ample for the acquisition of commonsense knowledge from
scene graphs in an image. In this example, despite Concept-
Net having redundant information for extracting a common-
sense subgraph, the proposed extraction from scene graphs
has necessary and sufficient information for the cleaning-
up task. Furthermore, relationships from scene graphs also
contain direct spatial relationships such as “on” or “in” (Fig-
ure 2) between objects because agents need to determine an
object’s place in the TWC game. Therefore, we use scene
graph datasets as visual external knowledge. A scene graph
dataset contains a large number of graphs that represent
the relationships between entities in images. We use Visual
Genome (VG) (Krishna et al. 2017) as a scene graph dataset,
which is the most commonly used, and compare its statistics
with ConceptNet. We also conduct experiments to evaluate
the performance of agents with commonsense knowledge
from a scene graph dataset in RL on text-based games.
Related Work
Text-based RL Games
Text-based interactive RL games has been gaining the fo-
cus of many researchers due to the development of envi-
ronments such as TextWorld (Cˆ
ot´
e et al. 2018) and Jeri-
cho (Hausknecht et al. 2019). In these games, RL agents
are required to understand the high-level context informa-
tion from only textual observation. To overcome this diffi-
culty, a number of prior works on these environments have
extracted new information from textual observations: knowl-
edge graphs, visual information, and logical rule.
Knowledge graphs represent relationships between enti-
ties like real-world objects and events, or abstract concepts.
A new text-based environment, called “TextWorld Com-
monsense”, was proposed in (Murugesan et al. 2021) to
infuse RL agents with commonsense knowledge and de-
veloped baseline agents using a commonsense subgraph
constructed from ConceptNet(Liu and Singh 2004; Speer,
Chin, and Havasi 2017a) as an external knowledge. We use
this work as a baseline method, and introduce a new type
of commonsense from visual datasets. Worldformer (Am-
manabrolu and Riedl 2021) represents environment status
as a knowledge graph and uses a world model to predict
changes caused by an agent’s actions and generates a set of
contextually relevant actions.
While knowledge graphs are useful for organizing ab-
stract information from only text descriptions, visual infor-
mation enables the agent to obtain a detailed locational sit-
uation like human imagination and visualization. The most
important issue in using visual information is how to ob-
tain it from only textual observation in text-based games.
VisualHints (Carta et al. 2020) proposed an environment
that can automatically generate various hints about game
states from textual observation and changes the difficulty
level depending on their type. The main sources of im-
ages in (Murugesan, Chaudhury, and Talamadupula 2021)
are retrieved from the Internet and generated from a text-to-
image pre-trained model, AttnGAN (Xu et al. 2018) with
given text descriptions. ALFWorld (Shridhar et al. 2021)
combines TextWorld and an embodied simulator called AL-
FRED (Shridhar et al. 2020) to obtain information on two
modalities. Shridhar et al. proposed an agent that first learns
to solve abstract tasks in TextWorld, then transfers the
learned high-level policies to low-level embodied tasks in
ALFRED.
In addition, even if we use the aforementioned methods,
improvements in the speed of training are few and the inter-
pretability of the trained network is still missing. A number
of studies (Kimura et al. 2021c; Chaudhury et al. 2021) pro-
posed novel approaches to extract symbolic first-order logics
from text observations, and select actions by using neuro-
symbolic Logical Neural Networks (Riegel et al. 2020).
These logical representations are compatible with common-
sense graph structures.
As previously described, there have been various ap-
proaches using knowledge graphs, visual information, and
logical rules. However, at the time of writing, there has been
no method that combines any of them. Therefore, we pro-
摘要:

CommonsenseKnowledgefromSceneGraphsforTextualEnvironmentsTsunehikoTanaka,1,2*DaikiKimura,1MichiakiTatsubori11IBMResearch2WasedaUniversitytsunehiko@fuji.waseda.jp,daiki@jp.ibm.com,mich@jp.ibm.comAbstractText-basedgamesarebecomingcommonlyusedinreinforce-mentlearningasreal-worldsimulationenvironments.T...

展开>> 收起<<
Commonsense Knowledge from Scene Graphs for Textual Environments Tsunehiko Tanaka12Daiki Kimura1Michiaki Tatsubori1 1IBM Research2Waseda University.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:532.77KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注