ternal knowledge from humans as logical functions helps the
training of the agent (Kimura et al. 2021b). New modal-
ity information extracted from observations or action text
can be introduced to make decisions (Carta et al. 2020;
Murugesan, Chaudhury, and Talamadupula 2021; Shridhar
et al. 2020). In these methods, visual information from im-
ages or videos is commonly used since it has been used in
many other studies (Tanaka and Simo-Serra 2021; Kimura
et al. 2020) to understand attention and sequential informa-
tion in decision making. Logical rule extraction can be ex-
ploited to improve the speed of training and interpretability
of the agent (Kimura et al. 2021c; Chaudhury et al. 2021).
Since commonsense knowledge is normally represented by
a graph structure, the logical rule representation is compati-
ble with commonsense knowledge.
However, at the time of writing, there has been no re-
search that utilizes the benefits of these multiple extensions
to compensate for missing information. In particular, we hy-
pothesize that the commonsense knowledge of object place
relationships that are used in TWC games can be easily ob-
tained from visual information. For example, instead of stat-
ing the place name of each object, operators can display a
picture of a tidy room, which is a quicker explanation for
humans.
In this paper, we propose a novel agent that challenges
a TWC game by leveraging visual scene graph datasets to
obtain commonsense. The original TWC agent (Murugesan
et al. 2021) constructs a commonsense subgraph from Con-
ceptNet (Speer, Chin, and Havasi 2017a), which is textual
knowledge, but it is necessary to combine many graphs to
obtain one commonsense and to create a complicated sub-
graph. In fact, Murugesan et al. prepared a ‘manual’ com-
monsense subgraph from ConceptNet to tackle this com-
plexity of graphs in their study. However, since scene graph
recognition achieves high accuracy from complex images,
visual information can deliver various detailed and orga-
nized graph information all at once. Figure 1 shows an ex-
ample for the acquisition of commonsense knowledge from
scene graphs in an image. In this example, despite Concept-
Net having redundant information for extracting a common-
sense subgraph, the proposed extraction from scene graphs
has necessary and sufficient information for the cleaning-
up task. Furthermore, relationships from scene graphs also
contain direct spatial relationships such as “on” or “in” (Fig-
ure 2) between objects because agents need to determine an
object’s place in the TWC game. Therefore, we use scene
graph datasets as visual external knowledge. A scene graph
dataset contains a large number of graphs that represent
the relationships between entities in images. We use Visual
Genome (VG) (Krishna et al. 2017) as a scene graph dataset,
which is the most commonly used, and compare its statistics
with ConceptNet. We also conduct experiments to evaluate
the performance of agents with commonsense knowledge
from a scene graph dataset in RL on text-based games.
Related Work
Text-based RL Games
Text-based interactive RL games has been gaining the fo-
cus of many researchers due to the development of envi-
ronments such as TextWorld (Cˆ
ot´
e et al. 2018) and Jeri-
cho (Hausknecht et al. 2019). In these games, RL agents
are required to understand the high-level context informa-
tion from only textual observation. To overcome this diffi-
culty, a number of prior works on these environments have
extracted new information from textual observations: knowl-
edge graphs, visual information, and logical rule.
Knowledge graphs represent relationships between enti-
ties like real-world objects and events, or abstract concepts.
A new text-based environment, called “TextWorld Com-
monsense”, was proposed in (Murugesan et al. 2021) to
infuse RL agents with commonsense knowledge and de-
veloped baseline agents using a commonsense subgraph
constructed from ConceptNet(Liu and Singh 2004; Speer,
Chin, and Havasi 2017a) as an external knowledge. We use
this work as a baseline method, and introduce a new type
of commonsense from visual datasets. Worldformer (Am-
manabrolu and Riedl 2021) represents environment status
as a knowledge graph and uses a world model to predict
changes caused by an agent’s actions and generates a set of
contextually relevant actions.
While knowledge graphs are useful for organizing ab-
stract information from only text descriptions, visual infor-
mation enables the agent to obtain a detailed locational sit-
uation like human imagination and visualization. The most
important issue in using visual information is how to ob-
tain it from only textual observation in text-based games.
VisualHints (Carta et al. 2020) proposed an environment
that can automatically generate various hints about game
states from textual observation and changes the difficulty
level depending on their type. The main sources of im-
ages in (Murugesan, Chaudhury, and Talamadupula 2021)
are retrieved from the Internet and generated from a text-to-
image pre-trained model, AttnGAN (Xu et al. 2018) with
given text descriptions. ALFWorld (Shridhar et al. 2021)
combines TextWorld and an embodied simulator called AL-
FRED (Shridhar et al. 2020) to obtain information on two
modalities. Shridhar et al. proposed an agent that first learns
to solve abstract tasks in TextWorld, then transfers the
learned high-level policies to low-level embodied tasks in
ALFRED.
In addition, even if we use the aforementioned methods,
improvements in the speed of training are few and the inter-
pretability of the trained network is still missing. A number
of studies (Kimura et al. 2021c; Chaudhury et al. 2021) pro-
posed novel approaches to extract symbolic first-order logics
from text observations, and select actions by using neuro-
symbolic Logical Neural Networks (Riegel et al. 2020).
These logical representations are compatible with common-
sense graph structures.
As previously described, there have been various ap-
proaches using knowledge graphs, visual information, and
logical rules. However, at the time of writing, there has been
no method that combines any of them. Therefore, we pro-