Despite the progress in RL, many current algorithms are still limited to environments that are mostly
fully observed and struggle in partially-observed scenarios where the agent needs to integrate and
retain information over many time steps. Despite this, the ability to remember over long time
horizons is a central aspect of human intelligence and a major limitation on the applicability of
current algorithms. While many existing benchmarks are partially observable to some extent, memory
is rarely the limiting factor of agent performance (Oh et al.,2015;Cobbe et al.,2020;Beattie et al.,
2016;Hafner,2021). Instead, these benchmarks evaluate a wide range of skills at once, making it
challenging to measure improvements in an agent’s ability to remember.
Ideally, we would like a memory benchmark to fulfill the following requirements: (1) isolate
the challenge of long-term memory from confounding challenges such as exploration and credit
assignment, so that performance improvements can be attributed to better memory. (2) The tasks
should challenge an average human player but be solvable for them, giving an estimate of how
far current algorithms are away from human memory abilities. (3) The task requires remembering
multiple pieces of information rather than a single bit or position, e.g. whether to go left or right at
the end of a long corridor. (4) The benchmark should be open source and easy to use.
We introduce the Memory Maze, a benchmark platform for evaluating long-term memory in RL
agents and sequence models. The Memory Maze features randomized 3D mazes in which the agent
is tasked with repeatedly navigating to one of the multiple objects. To find the objects quickly, the
agent has to remember their locations, the wall layout of the maze, as well as its own location. The
contributions of this paper are summarized as follows:
•Environment
We introduce the Memory Maze environment, which is specifically designed
to measure memory isolated from other challenges and overcomes the limitations of existing
benchmarks. We open source the environment and make it easy to install and use.
•Human Performance
We record the performance of a human player and find that the benchmark
is challenging but solvable for them. This offers and estimate of how far current algorithms are
from the memory ability of a human.
•Memory Challenge
We confirm that memory is indeed the leading challenge in this benchmark,
by observing that the rewards of the human player increases within each episode, as well as by
finding strong improvements of training agents with truncated backpropagation through time.
•Offline Dataset
We collect a diverse offline dataset that includes semantic information, such
as the top-down view, object positions, and the wall layout. This enables offline RL as well as
evaluating representations through probing of both task-specific and task-agnostic information.
•Baseline Scores
We benchmark a strong model-free and model-based agent on the four sizes of
the Memory Maze and find that they make progress on the smaller mazes but lag far behind human
performance on the larger mazes, showing that the benchmark is of appropriate difficulty.
2 RELATED WORK
Several benchmarks for measuring memory abilities have been proposed. This section summarizes
important examples and discusses the limitations that motivated the design of the Memory Maze.
DMLab
(Beattie et al.,2016) features various tasks, some of which require memory among other
challenges. Parisotto et al. (2020) identified a subset of 8 DMLab tasks relating to memory but these
tasks have largely been solved by R2D2 and IMPALA (see Figure 11 in Kapturowski et al. (2018)).
Moreover, DMLab features a skyline in the background that makes it trivial for the agent to localize
itself, so the agent does not need to remember its location in the maze.
SimCore
(Gregor et al.,2019) studied the memory abilities of agents by probing representations
and compared a range of agent objectives and memory mechanisms, an approach that we build upon
in this paper. However, their datasets and implementations were not released, making it difficult for
the research community to build upon the work. A standardized probe benchmark is available for
Atari (Anand et al.,2019), but those tasks require almost no memory.
DM Memory Suite
(Fortunato et al.,2019) consists of 5 existing DMLab tasks and 7 variations of
T-Maze and Watermaze tasks implemented in the Unity game engine, which neccessitates interfacing
with a provided Docker container via networking. These tasks pose an exploration challenge due
to the initialization far away from the goal, creating a confounding factor in agent performance.
Moreover, the tasks tend to require only small memory capacity, namely 1 bit for T-Mazes and 1
coordinate for Watermazes.
2