In this paper, we introduce the NetHack Learning Dataset (
NLD
), an open and accessible dataset
for large-scale offline RL and learning from demonstrations.
NLD
consists of three parts: first,
NLD-NAO
: a collection of state-only trajectories from 1.5 million human games of NetHack played
on
nethack.alt.org (NAO)
servers between 2009 and 2020; second,
NLD-AA
: a collection of
state-action-score trajectories from 100,000
NLE
games played by the symbolic-bot winner of the
2021 NetHack Challenge [
15
]; third,
TtyrecDataset
: a highly-scalable tool for efficient training
on any NetHack and
NLE
-generated trajectories and metadata.
NLD
, in combination with
NLE
,
enables computationally-accessible research in multiple areas including imitation learning, offline
RL, learning from sequences of only observations, as well as combining learning from offline data
with learning from online interactions. In contrast with other large-scale datasets of demonstrations,
NLD
is highly efficient in both memory and compute.
NLD-NAO
can fit on a $30 hard drive, after being
compressed (by a factor of 160) from 38TB to 229GB. In addition,
NLD-NAO
can be processed in
under 15 hours, achieving a throughput of 288,000 frames per second with only 10 CPUs.
NLD
’s
low memory and computational requirements makes large-scale learning from demonstrations more
accessible for academic and independent researchers.
To summarize, the key characteristics of
NLD
are that: it is a scalable dataset of demonstrations (i.e.
large and cheap) for a highly-complex sequential decision making challenge; it enables research in
multiple areas such as imitation learning, offline RL, learning from observations of demonstrations,
learning from both static data and environment interaction; and it has many properties of real-world
domains such as partial observability, stochastic dynamics, sparse reward, long trajectories, rich
environment, diverse behaviors, and a procedurally generated environment.
In this paper, we make the following core contributions: (i) we introduce
NLD-NAO
, a large-scale
dataset of almost 10 billion state transitions, from 1.5 million NetHack games played by humans;
(ii) we also introduce
NLD-AA
, a large-scale dataset of over 3 billion state-action-score transitions,
from 100,000 games played by the symbolic-bot winner of the NeurIPS 2022 NetHack Competition;
(iii) we open-source code for users to record, load, and stream any collection of NetHack trajectories
in a highly compressed form; and (iv) we show that, while current state-of-the-art methods in offline
RL and learning from demonstrations can effectively make use of the dataset, playing NetHack at
human-level performance remains an open research challenge.
2 Related Work
Offline RL Benchmarks. Recently, there has been a growing interest in developing better offline
RL methods [
25
,
39
,
11
,
60
,
61
,
21
,
20
,
6
,
2
,
49
] which aim to learn from datasets of trajectories.
With it, a number of offline RL benchmarks have been released [
1
,
62
,
10
,
42
,
22
,
43
]. While these
benchmarks focus specifically on offline RL, our datasets enable research on multiple areas including
imitation learning, learning from observations only (i.e. without access to actions or rewards), as
well as learning from both offline and online interactions. In order to make progress on difficult
RL tasks such as NetHack, we will likely need to learn from both human data and environment
interaction, as was the case with other challenging games like StarCraft [
58
] or Dota [
3
]. In contrast,
the tasks proposed in current offline RL benchmarks are much easier and can be solved by training
either only online or only offline [
10
,
62
]. In addition, current offline RL benchmarks test agents
on the exact same environment where the data was collected. As emphasized by [
57
], imitation
learning algorithms drastically overfit to their environments, so it is essential to evaluate them on
new scenarios in order to develop robust methods. In contrast,
NLE
has long procedurally generated
episodes which require both long-term planning and systematic generalization in order to succeed.
This is shown in [
30
], which investigates transfer learning between policies trained on different
NLE-based environments.
For many real-world applications such as autonomous driving or robotic manipulation, learning from
human data is essential due to safety concerns and time constraints [
40
,
10
,
8
,
5
,
27
,
52
,
55
,
17
,
38
,
18
].
However, most offline RL benchmarks contain synthetic trajectories generated by either random
exploration, pretrained RL agents, or simple hard-coded behaviors [
10
]. In contrast, one of our
datasets consists entirely of human replays, while the other one is generated by the winner of the
NetHack Competition at NeurIPS 2022 which is a complex symbolic bot with built-in knowledge of
the game. Human data (like the set contained in
NLD-NAO
) is significantly more diverse and messy
than synthetic data, as humans can vary widely in their expertise, optimize for different objectives
(such as fun or discovery), have access to external information (such as the NetHack Wiki [
34
]),
2