EVALUATING LONG -TERM MEMORY IN 3D M AZES Jurgis Pasukonis12Timothy Lillicrap35Danijar Hafner3467 ABSTRACT

2025-05-06 0 0 3.02MB 16 页 10玖币
侵权投诉
EVALUATING LONG-TERM MEMORY IN 3D MAZES
Jurgis Pasukonis12 Timothy Lillicrap35 Danijar Hafner3467
ABSTRACT
Intelligent agents need to remember salient information to reason in partially-
observed environments. For example, agents with a first-person view should
remember the positions of relevant objects even if they go out of view. Similarly, to
effectively navigate through rooms agents need to remember the floor plan of how
rooms are connected. However, most benchmark tasks in reinforcement learning
do not test long-term memory in agents, slowing down progress in this important
research direction. In this paper, we introduce the Memory Maze, a 3D domain
of randomized mazes specifically designed for evaluating long-term memory in
agents. Unlike existing benchmarks, Memory Maze measures long-term memory
separate from confounding agent abilities and requires the agent to localize itself
by integrating information over time. With Memory Maze, we propose an online
reinforcement learning benchmark, a diverse offline dataset, and an offline probing
evaluation. Recording a human player establishes a strong baseline and verifies the
need to build up and retain memories, which is reflected in their gradually increasing
rewards within each episode. We find that current algorithms benefit from training
with truncated backpropagation through time and succeed on small mazes, but
fall short of human performance on the large mazes, leaving room for future
algorithmic designs to be evaluated on the Memory Maze. Videos are available on
the website: https://github.com/jurgisp/memory-maze
1 INTRODUCTION
Deep reinforcement learning (RL) has made tremendous progress in recent years, outperforming
humans on Atari games (Mnih et al.,2015;Badia et al.,2020), board games (Silver et al.,2016;
Schrittwieser et al.,2019), and advances in robot learning (Akkaya et al.,2019;Wu et al.,2022).
Much of this progress has been driven by the availability of challenging benchmarks that are easy to
use and allow for standardized comparison (Bellemare et al.,2013;Tassa et al.,2018;Cobbe et al.,
2020). What is more, the RL algorithms developed on these benchmarks are often general enough to
solve later solve completely unrelated challenges, such as finetuning large language models from
human preferences (Ziegler et al.,2019), optimizing video compression parameters (Mandhane et al.,
2022), or promising results in controlling the plasma of nuclear fusion reactors (Degrave et al.,2022).
Agent
Inputs
Underlying
Trajectory
t= 0 30 60 90 120 150
Figure 1:
The first 150 time steps of an episode in the Memory Maze 9x9 environment. The bottom
row shows the top-down view of a randomly generated maze with 3 colored objects. The agent only
observes the first-person view (top row) which includes a prompt for the next object to find as a
border of the corresponding color. The agent receives +1 reward when it reaches the object of the
prompted color. During the episode, the agent has to visit the same objects multiple times, testing its
ability to memorize their positions, the way the rooms are connected, and its own location.
1
Verses Research Lab,
2
Minds.ai,
3
DeepMind,
4
Google Research,
5
University College London,
6
University of
Toronto, 7University of Berkeley, California. Corresponding author: Jurgis Pasukonis <jurgisp@gmail.com>
1
arXiv:2210.13383v1 [cs.AI] 24 Oct 2022
Despite the progress in RL, many current algorithms are still limited to environments that are mostly
fully observed and struggle in partially-observed scenarios where the agent needs to integrate and
retain information over many time steps. Despite this, the ability to remember over long time
horizons is a central aspect of human intelligence and a major limitation on the applicability of
current algorithms. While many existing benchmarks are partially observable to some extent, memory
is rarely the limiting factor of agent performance (Oh et al.,2015;Cobbe et al.,2020;Beattie et al.,
2016;Hafner,2021). Instead, these benchmarks evaluate a wide range of skills at once, making it
challenging to measure improvements in an agent’s ability to remember.
Ideally, we would like a memory benchmark to fulfill the following requirements: (1) isolate
the challenge of long-term memory from confounding challenges such as exploration and credit
assignment, so that performance improvements can be attributed to better memory. (2) The tasks
should challenge an average human player but be solvable for them, giving an estimate of how
far current algorithms are away from human memory abilities. (3) The task requires remembering
multiple pieces of information rather than a single bit or position, e.g. whether to go left or right at
the end of a long corridor. (4) The benchmark should be open source and easy to use.
We introduce the Memory Maze, a benchmark platform for evaluating long-term memory in RL
agents and sequence models. The Memory Maze features randomized 3D mazes in which the agent
is tasked with repeatedly navigating to one of the multiple objects. To find the objects quickly, the
agent has to remember their locations, the wall layout of the maze, as well as its own location. The
contributions of this paper are summarized as follows:
Environment
We introduce the Memory Maze environment, which is specifically designed
to measure memory isolated from other challenges and overcomes the limitations of existing
benchmarks. We open source the environment and make it easy to install and use.
Human Performance
We record the performance of a human player and find that the benchmark
is challenging but solvable for them. This offers and estimate of how far current algorithms are
from the memory ability of a human.
Memory Challenge
We confirm that memory is indeed the leading challenge in this benchmark,
by observing that the rewards of the human player increases within each episode, as well as by
finding strong improvements of training agents with truncated backpropagation through time.
Offline Dataset
We collect a diverse offline dataset that includes semantic information, such
as the top-down view, object positions, and the wall layout. This enables offline RL as well as
evaluating representations through probing of both task-specific and task-agnostic information.
Baseline Scores
We benchmark a strong model-free and model-based agent on the four sizes of
the Memory Maze and find that they make progress on the smaller mazes but lag far behind human
performance on the larger mazes, showing that the benchmark is of appropriate difficulty.
2 RELATED WORK
Several benchmarks for measuring memory abilities have been proposed. This section summarizes
important examples and discusses the limitations that motivated the design of the Memory Maze.
DMLab
(Beattie et al.,2016) features various tasks, some of which require memory among other
challenges. Parisotto et al. (2020) identified a subset of 8 DMLab tasks relating to memory but these
tasks have largely been solved by R2D2 and IMPALA (see Figure 11 in Kapturowski et al. (2018)).
Moreover, DMLab features a skyline in the background that makes it trivial for the agent to localize
itself, so the agent does not need to remember its location in the maze.
SimCore
(Gregor et al.,2019) studied the memory abilities of agents by probing representations
and compared a range of agent objectives and memory mechanisms, an approach that we build upon
in this paper. However, their datasets and implementations were not released, making it difficult for
the research community to build upon the work. A standardized probe benchmark is available for
Atari (Anand et al.,2019), but those tasks require almost no memory.
DM Memory Suite
(Fortunato et al.,2019) consists of 5 existing DMLab tasks and 7 variations of
T-Maze and Watermaze tasks implemented in the Unity game engine, which neccessitates interfacing
with a provided Docker container via networking. These tasks pose an exploration challenge due
to the initialization far away from the goal, creating a confounding factor in agent performance.
Moreover, the tasks tend to require only small memory capacity, namely 1 bit for T-Mazes and 1
coordinate for Watermazes.
2
Memory 9x9 Memory 11x11 Memory 13x13 Memory 15x15
Figure 2: Examples of randomly generated Memory Maze layouts of the four sizes.
3 THE MEMORY MAZE
Memory Maze is a 3D domain of randomized mazes specifically designed for evaluating the long-term
memory abilities of RL agents. Memory Maze isolates long-term memory from confounding agent
abilities, such as exploration, and requires remembering several pieces of information: the positions
of objects, the wall layout, and the agent’s own position. This section introduces three aspects of the
benchmark: (1) an online reinforcement learning environment with four tasks, (2) an offline dataset,
and (3) a protocol for evaluating representations on this dataset by probing.
3.1 ENVIRONMENT
The Memory Maze environment is implemented using MuJoCo (Todorov et al.,2012) as the physics
and graphics engine and the dm_control (Tunyasuvunakool et al.,2020) library for building RL
environments. The environment can be installed as a pip package
memory-maze
or from the source
code, available on the project website . There are four Memory Maze tasks with varying sizes and
difficulty: Memory 9x9, Memory 11x11, Memory 13x13, and Memory 15x15.
The task is inspired by a game known as scavenger hunt or treasure hunt. The agent starts in a
randomly generated maze containing several objects of different colors. The agent is prompted to
find the target object of a specific color, indicated by the border color in the observation image. Once
the agent finds and touches the correct object, it gets a +1 reward, and the next random object is
chosen as a target. If the agent touches the object of the wrong color, there is no effect. Throughout
the episode, the maze layout and the locations of the objects do not change. The episode continues
for a fixed amount of time, so the total episode return is equal to the number of targets the agent can
find in the given time. See Figure 1 for an illustration.
The episode return is inversely proportional to the average time it takes for the agent to locate the
target objects. If the agent remembers the location of the prompted object and how the rooms are
connected, the agent can take the shortest path to the object and thus reach it quickly. On the other
hand, an agent without memory cannot remember the object position and wall layout and thus has
to randomly explore the maze until it sees the requested object, resulting in several times longer
duration. Thus, the score on the Memory Maze tasks correlates with the ability to remember the maze
layout, particularly object locations and paths to them.
Memory Maze sidesteps the hard exploration problem present in many T-Maze and Watermaze tasks.
Due to the random maze layout in each episode, the agent will sometimes spawn close to the object
of the prompted color and easily collect the reward. This allows the agent to quickly bootstrap to
a policy that navigates to the target object once it is visible, and from that point, it can improve by
developing memory. This makes training much faster compared to, for example, DM Memory Suite
(Fortunato et al.,2019).
The sizes are designed such that the Memory 15x15 environment is challenging for a human player
and out of reach for state-of-the-art RL algorithms, whereas Memory 9x9 is easy for a human player
and solvable with RL, with 11x11 and 13x13 as intermediate stepping stones. See Table 1 for details
and Figure 2 for an illustration.
3.2 OFFLINE DATASET
We collect a diverse offline dataset of recorded experience from the Memory Maze environments.
This dataset is used in the present work for the offline probing benchmark and also enables other
applications, such as offline RL.
https://github.com/jurgisp/memory-maze
3
Parameter
Memory
9x9
Memory
11x11
Memory
13x13
Memory
15x15
Number of objects 3456
Number of rooms 34 4 6 5 6 9
Room size 35 3 5 3 5 3
Episode length (steps at 4Hz) 1000 2000 3000 4000
Mean maximum score (oracle) 34.8 58.0 74.5 87.7
Table 1: Memory Maze environment details.
We release two datasets: Memory Maze 9x9 (30M) and Memory Maze 15x15 (30M). Each dataset
contains 30 thousand trajectories from Memory Maze 9x9 and 15x15 environments respectively.
A single trajectory is 1000 steps long, even for the larger maze to increase the diversity of mazes
included while keeping the download size small. The datasets are split into 29k trajectories for
training and 1k for evaluation.
The data is generated by running a scripted policy on the corresponding environment. The policy
uses an MPC planner (Richards,2005) that performs breadth-first-search to navigate to randomly
chosen points in the maze under action noise. This choice of policy was made to generate diverse
trajectories that explore the maze effectively and that form loops in space, which can be important for
learning long-term memory. We intentionally avoid recording data with a trained agent to ensure a
diverse data distribution (Yarats et al.,2022) and to avoid dataset bias that could favor some methods
over others.
The trajectories include not only the information visible to the agent – first-person image observations,
actions, rewards – but also additional semantic information about the environment, including the
maze layout, agent position, and the object locations. The details of the data keys are in Table 2.
3.3 OFFLINE PROBING
Unsupervised representation learning aims to learn representations that can later be used for down-
stream tasks of interest. In the context of partially observable environments, we would like unsuper-
vised representations to summarize the history of observations into a representation that contains
information about the state of the environment beyond what is visible in the current observation by
remembering salient information about the environment. Unsupervised representations are commonly
evaluated by probing (Oord et al.,2018;Chen et al.,2020;Gregor et al.,2019;Anand et al.,2019),
where a separate network is trained to predict relevant properties from the frozen representations.
We introduce the following four Memory Maze offline probing benchmarks: Memory 9x9 Walls,
Memory 15x15 Walls, Memory 9x9 Objects, and Memory 15x15 Objects. These are based on either
using the maze wall layout (
maze_layout
) or agent-centric object locations (
targets_vec
) as
the probe prediction target, trained and evaluated on either Memory Maze 9x9 (30M) or Memory
Maze 15x15 (30M) offline datasets.
The evaluation procedure is as follows. First, a sequence representation model (which may be a
component of a model-based RL agent) is trained on the offline dataset with a semi-supervised
loss based on the first-person image observations conditioned by actions. Then a separate probe
network is trained to predict the probe observation (either maze wall layout or agent-centric object
locations) from the internal state of the model. Crucially, the gradients from the probe network are
not propagated into the model, so it only learns to decode the information already present in the
internal state, but it does not drive the representation. Finally, the predictions of the probe network are
evaluated on the hold-out dataset. When predicting the wall layout, the evaluation metric is prediction
accuracy, averaged across all tiles of the maze layout. When predicting the object locations, the
evaluation metric is the mean-squared error (MSE), averaged over the objects. The final score is
calculated by averaging the evaluation metric over the second half (500 steps) of each trajectory in
the evaluation dataset. This is done to remove the initial exploratory part of each trajectory, during
which the model has no way of knowing the full layout of the maze (see Figure C.1). We make this
choice so that a model with perfect memory could reach
0.0
MSE on the Objects benchmark and
100% accuracy on the Walls benchmark.
The architecture of the probe network is defined as part of the benchmark to ensure comparability:
it is an MLP with 4 hidden layers, 1024 units each, with layer normalization and ELU activation
after each layer (see Table E.3). The input to the probe network is the representation of the model —
4
摘要:

EVALUATINGLONG-TERMMEMORYIN3DMAZESJurgisPasukonis12TimothyLillicrap35DanijarHafner3467ABSTRACTIntelligentagentsneedtoremembersalientinformationtoreasoninpartially-observedenvironments.Forexample,agentswitharst-personviewshouldrememberthepositionsofrelevantobjectseveniftheygooutofview.Similarly,toef...

展开>> 收起<<
EVALUATING LONG -TERM MEMORY IN 3D M AZES Jurgis Pasukonis12Timothy Lillicrap35Danijar Hafner3467 ABSTRACT.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:3.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注