EVALUATING LONG -TERM MEMORY IN 3D M AZES Jurgis Pasukonis12Timothy Lillicrap35Danijar Hafner3467 ABSTRACT

2025-05-06 0 0 3.02MB 16 页 10玖币

侵权投诉

EVALUATING LONG-TERM MEMORY IN 3D MAZES

Jurgis Pasukonis12 Timothy Lillicrap35 Danijar Hafner3467

ABSTRACT

Intelligent agents need to remember salient information to reason in partially-

observed environments. For example, agents with a ﬁrst-person view should

remember the positions of relevant objects even if they go out of view. Similarly, to

effectively navigate through rooms agents need to remember the ﬂoor plan of how

rooms are connected. However, most benchmark tasks in reinforcement learning

do not test long-term memory in agents, slowing down progress in this important

research direction. In this paper, we introduce the Memory Maze, a 3D domain

of randomized mazes speciﬁcally designed for evaluating long-term memory in

agents. Unlike existing benchmarks, Memory Maze measures long-term memory

separate from confounding agent abilities and requires the agent to localize itself

by integrating information over time. With Memory Maze, we propose an online

reinforcement learning benchmark, a diverse ofﬂine dataset, and an ofﬂine probing

evaluation. Recording a human player establishes a strong baseline and veriﬁes the

need to build up and retain memories, which is reﬂected in their gradually increasing

rewards within each episode. We ﬁnd that current algorithms beneﬁt from training

with truncated backpropagation through time and succeed on small mazes, but

fall short of human performance on the large mazes, leaving room for future

algorithmic designs to be evaluated on the Memory Maze. Videos are available on

the website: https://github.com/jurgisp/memory-maze

1 INTRODUCTION

Deep reinforcement learning (RL) has made tremendous progress in recent years, outperforming

humans on Atari games (Mnih et al.,2015;Badia et al.,2020), board games (Silver et al.,2016;

Schrittwieser et al.,2019), and advances in robot learning (Akkaya et al.,2019;Wu et al.,2022).

Much of this progress has been driven by the availability of challenging benchmarks that are easy to

use and allow for standardized comparison (Bellemare et al.,2013;Tassa et al.,2018;Cobbe et al.,

2020). What is more, the RL algorithms developed on these benchmarks are often general enough to

solve later solve completely unrelated challenges, such as ﬁnetuning large language models from

human preferences (Ziegler et al.,2019), optimizing video compression parameters (Mandhane et al.,

2022), or promising results in controlling the plasma of nuclear fusion reactors (Degrave et al.,2022).

Agent

Inputs

Underlying

Trajectory

t= 0 30 60 90 120 150

Figure 1:

The ﬁrst 150 time steps of an episode in the Memory Maze 9x9 environment. The bottom

row shows the top-down view of a randomly generated maze with 3 colored objects. The agent only

observes the ﬁrst-person view (top row) which includes a prompt for the next object to ﬁnd as a

border of the corresponding color. The agent receives +1 reward when it reaches the object of the

prompted color. During the episode, the agent has to visit the same objects multiple times, testing its

ability to memorize their positions, the way the rooms are connected, and its own location.

Verses Research Lab,

Minds.ai,

DeepMind,

Google Research,

University College London,

University of

Toronto, 7University of Berkeley, California. Corresponding author: Jurgis Pasukonis <jurgisp@gmail.com>

arXiv:2210.13383v1 [cs.AI] 24 Oct 2022

Despite the progress in RL, many current algorithms are still limited to environments that are mostly

fully observed and struggle in partially-observed scenarios where the agent needs to integrate and

retain information over many time steps. Despite this, the ability to remember over long time

horizons is a central aspect of human intelligence and a major limitation on the applicability of

current algorithms. While many existing benchmarks are partially observable to some extent, memory

is rarely the limiting factor of agent performance (Oh et al.,2015;Cobbe et al.,2020;Beattie et al.,

2016;Hafner,2021). Instead, these benchmarks evaluate a wide range of skills at once, making it

challenging to measure improvements in an agent’s ability to remember.

Ideally, we would like a memory benchmark to fulﬁll the following requirements: (1) isolate

the challenge of long-term memory from confounding challenges such as exploration and credit

assignment, so that performance improvements can be attributed to better memory. (2) The tasks

should challenge an average human player but be solvable for them, giving an estimate of how

far current algorithms are away from human memory abilities. (3) The task requires remembering

multiple pieces of information rather than a single bit or position, e.g. whether to go left or right at

the end of a long corridor. (4) The benchmark should be open source and easy to use.

We introduce the Memory Maze, a benchmark platform for evaluating long-term memory in RL

agents and sequence models. The Memory Maze features randomized 3D mazes in which the agent

is tasked with repeatedly navigating to one of the multiple objects. To ﬁnd the objects quickly, the

agent has to remember their locations, the wall layout of the maze, as well as its own location. The

contributions of this paper are summarized as follows:

•Environment

We introduce the Memory Maze environment, which is speciﬁcally designed

to measure memory isolated from other challenges and overcomes the limitations of existing

benchmarks. We open source the environment and make it easy to install and use.

•Human Performance

We record the performance of a human player and ﬁnd that the benchmark

is challenging but solvable for them. This offers and estimate of how far current algorithms are

from the memory ability of a human.

•Memory Challenge

We conﬁrm that memory is indeed the leading challenge in this benchmark,

by observing that the rewards of the human player increases within each episode, as well as by

ﬁnding strong improvements of training agents with truncated backpropagation through time.

•Ofﬂine Dataset

We collect a diverse ofﬂine dataset that includes semantic information, such

as the top-down view, object positions, and the wall layout. This enables ofﬂine RL as well as

evaluating representations through probing of both task-speciﬁc and task-agnostic information.

•Baseline Scores

We benchmark a strong model-free and model-based agent on the four sizes of

the Memory Maze and ﬁnd that they make progress on the smaller mazes but lag far behind human

performance on the larger mazes, showing that the benchmark is of appropriate difﬁculty.

2 RELATED WORK

Several benchmarks for measuring memory abilities have been proposed. This section summarizes

important examples and discusses the limitations that motivated the design of the Memory Maze.

DMLab

(Beattie et al.,2016) features various tasks, some of which require memory among other

challenges. Parisotto et al. (2020) identiﬁed a subset of 8 DMLab tasks relating to memory but these

tasks have largely been solved by R2D2 and IMPALA (see Figure 11 in Kapturowski et al. (2018)).

Moreover, DMLab features a skyline in the background that makes it trivial for the agent to localize

itself, so the agent does not need to remember its location in the maze.

SimCore

(Gregor et al.,2019) studied the memory abilities of agents by probing representations

and compared a range of agent objectives and memory mechanisms, an approach that we build upon

in this paper. However, their datasets and implementations were not released, making it difﬁcult for

the research community to build upon the work. A standardized probe benchmark is available for

Atari (Anand et al.,2019), but those tasks require almost no memory.

DM Memory Suite

(Fortunato et al.,2019) consists of 5 existing DMLab tasks and 7 variations of

T-Maze and Watermaze tasks implemented in the Unity game engine, which neccessitates interfacing

with a provided Docker container via networking. These tasks pose an exploration challenge due

to the initialization far away from the goal, creating a confounding factor in agent performance.

Moreover, the tasks tend to require only small memory capacity, namely 1 bit for T-Mazes and 1

coordinate for Watermazes.

Memory 9x9 Memory 11x11 Memory 13x13 Memory 15x15

Figure 2: Examples of randomly generated Memory Maze layouts of the four sizes.

3 THE MEMORY MAZE

Memory Maze is a 3D domain of randomized mazes speciﬁcally designed for evaluating the long-term

memory abilities of RL agents. Memory Maze isolates long-term memory from confounding agent

abilities, such as exploration, and requires remembering several pieces of information: the positions

of objects, the wall layout, and the agent’s own position. This section introduces three aspects of the

benchmark: (1) an online reinforcement learning environment with four tasks, (2) an ofﬂine dataset,

and (3) a protocol for evaluating representations on this dataset by probing.

3.1 ENVIRONMENT

The Memory Maze environment is implemented using MuJoCo (Todorov et al.,2012) as the physics

and graphics engine and the dm_control (Tunyasuvunakool et al.,2020) library for building RL

environments. The environment can be installed as a pip package

memory-maze

or from the source

code, available on the project website . There are four Memory Maze tasks with varying sizes and

difﬁculty: Memory 9x9, Memory 11x11, Memory 13x13, and Memory 15x15.

The task is inspired by a game known as scavenger hunt or treasure hunt. The agent starts in a

randomly generated maze containing several objects of different colors. The agent is prompted to

ﬁnd the target object of a speciﬁc color, indicated by the border color in the observation image. Once

the agent ﬁnds and touches the correct object, it gets a +1 reward, and the next random object is

chosen as a target. If the agent touches the object of the wrong color, there is no effect. Throughout

the episode, the maze layout and the locations of the objects do not change. The episode continues

for a ﬁxed amount of time, so the total episode return is equal to the number of targets the agent can

ﬁnd in the given time. See Figure 1 for an illustration.

The episode return is inversely proportional to the average time it takes for the agent to locate the

target objects. If the agent remembers the location of the prompted object and how the rooms are

connected, the agent can take the shortest path to the object and thus reach it quickly. On the other

hand, an agent without memory cannot remember the object position and wall layout and thus has

to randomly explore the maze until it sees the requested object, resulting in several times longer

duration. Thus, the score on the Memory Maze tasks correlates with the ability to remember the maze

layout, particularly object locations and paths to them.

Memory Maze sidesteps the hard exploration problem present in many T-Maze and Watermaze tasks.

Due to the random maze layout in each episode, the agent will sometimes spawn close to the object

of the prompted color and easily collect the reward. This allows the agent to quickly bootstrap to

a policy that navigates to the target object once it is visible, and from that point, it can improve by

developing memory. This makes training much faster compared to, for example, DM Memory Suite

(Fortunato et al.,2019).

The sizes are designed such that the Memory 15x15 environment is challenging for a human player

and out of reach for state-of-the-art RL algorithms, whereas Memory 9x9 is easy for a human player

and solvable with RL, with 11x11 and 13x13 as intermediate stepping stones. See Table 1 for details

and Figure 2 for an illustration.

3.2 OFFLINE DATASET

We collect a diverse ofﬂine dataset of recorded experience from the Memory Maze environments.

This dataset is used in the present work for the ofﬂine probing benchmark and also enables other

applications, such as ofﬂine RL.

https://github.com/jurgisp/memory-maze

Parameter

Memory

9x9

Memory

11x11

Memory

13x13

Memory

15x15

Number of objects 3456

Number of rooms 3−4 4 −6 5 −6 9

Room size 3−5 3 −5 3 −5 3

Episode length (steps at 4Hz) 1000 2000 3000 4000

Mean maximum score (oracle) 34.8 58.0 74.5 87.7

Table 1: Memory Maze environment details.

We release two datasets: Memory Maze 9x9 (30M) and Memory Maze 15x15 (30M). Each dataset

contains 30 thousand trajectories from Memory Maze 9x9 and 15x15 environments respectively.

A single trajectory is 1000 steps long, even for the larger maze to increase the diversity of mazes

included while keeping the download size small. The datasets are split into 29k trajectories for

training and 1k for evaluation.

The data is generated by running a scripted policy on the corresponding environment. The policy

uses an MPC planner (Richards,2005) that performs breadth-ﬁrst-search to navigate to randomly

chosen points in the maze under action noise. This choice of policy was made to generate diverse

trajectories that explore the maze effectively and that form loops in space, which can be important for

learning long-term memory. We intentionally avoid recording data with a trained agent to ensure a

diverse data distribution (Yarats et al.,2022) and to avoid dataset bias that could favor some methods

over others.

The trajectories include not only the information visible to the agent – ﬁrst-person image observations,

actions, rewards – but also additional semantic information about the environment, including the

maze layout, agent position, and the object locations. The details of the data keys are in Table 2.

3.3 OFFLINE PROBING

Unsupervised representation learning aims to learn representations that can later be used for down-

stream tasks of interest. In the context of partially observable environments, we would like unsuper-

vised representations to summarize the history of observations into a representation that contains

information about the state of the environment beyond what is visible in the current observation by

remembering salient information about the environment. Unsupervised representations are commonly

evaluated by probing (Oord et al.,2018;Chen et al.,2020;Gregor et al.,2019;Anand et al.,2019),

where a separate network is trained to predict relevant properties from the frozen representations.

We introduce the following four Memory Maze ofﬂine probing benchmarks: Memory 9x9 Walls,

Memory 15x15 Walls, Memory 9x9 Objects, and Memory 15x15 Objects. These are based on either

using the maze wall layout (

maze_layout

) or agent-centric object locations (

targets_vec

) as

the probe prediction target, trained and evaluated on either Memory Maze 9x9 (30M) or Memory

Maze 15x15 (30M) ofﬂine datasets.

The evaluation procedure is as follows. First, a sequence representation model (which may be a

component of a model-based RL agent) is trained on the ofﬂine dataset with a semi-supervised

loss based on the ﬁrst-person image observations conditioned by actions. Then a separate probe

network is trained to predict the probe observation (either maze wall layout or agent-centric object

locations) from the internal state of the model. Crucially, the gradients from the probe network are

not propagated into the model, so it only learns to decode the information already present in the

internal state, but it does not drive the representation. Finally, the predictions of the probe network are

evaluated on the hold-out dataset. When predicting the wall layout, the evaluation metric is prediction

accuracy, averaged across all tiles of the maze layout. When predicting the object locations, the

evaluation metric is the mean-squared error (MSE), averaged over the objects. The ﬁnal score is

calculated by averaging the evaluation metric over the second half (500 steps) of each trajectory in

the evaluation dataset. This is done to remove the initial exploratory part of each trajectory, during

which the model has no way of knowing the full layout of the maze (see Figure C.1). We make this

choice so that a model with perfect memory could reach

0.0

MSE on the Objects benchmark and

100% accuracy on the Walls benchmark.

The architecture of the probe network is deﬁned as part of the benchmark to ensure comparability:

it is an MLP with 4 hidden layers, 1024 units each, with layer normalization and ELU activation

after each layer (see Table E.3). The input to the probe network is the representation of the model —

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EVALUATINGLONG-TERMMEMORYIN3DMAZESJurgisPasukonis12TimothyLillicrap35DanijarHafner3467ABSTRACTIntelligentagentsneedtoremembersalientinformationtoreasoninpartially-observedenvironments.Forexample,agentswitharst-personviewshouldrememberthepositionsofrelevantobjectseveniftheygooutofview.Similarly,toef...

展开>> 收起<<

EVALUATING LONG -TERM MEMORY IN 3D M AZES Jurgis Pasukonis12Timothy Lillicrap35Danijar Hafner3467 ABSTRACT.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EVALUATING LONG -TERM MEMORY IN 3D M AZES Jurgis Pasukonis12Timothy Lillicrap35Danijar Hafner3467 ABSTRACT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: