Dungeons and Data A Large-Scale NetHack Dataset Eric Hambro

2025-08-18 1 0 1.78MB 30 页 10玖币
侵权投诉
Dungeons and Data: A Large-Scale NetHack Dataset
Eric Hambro
Meta AI Roberta Raileanu
Meta AI Danielle Rothermel
Meta AI Vegard Mella
Meta AI
Tim Rocktäschel
University College London
Heinrich Küttler
Inflection AI
Naila Murray
Meta AI
Abstract
Recent breakthroughs in the development of agents to solve challenging sequential
decision making problems such as Go [
50
], StarCraft [
58
], or DOTA [
3
], have
relied on both simulated environments and large-scale datasets. However, progress
on this research has been hindered by the scarcity of open-sourced datasets and the
prohibitive computational cost to work with them. Here we present the NetHack
Learning Dataset (
NLD
), a large and highly-scalable dataset of trajectories from
the popular game of NetHack, which is both extremely challenging for current
methods and very fast to run [
23
].
NLD
consists of three parts: 10 billion state
transitions from 1.5 million human trajectories collected on the
NAO
public NetHack
server from 2009 to 2020; 3 billion state-action-score transitions from 100,000
trajectories collected from the symbolic bot winner of the NetHack Challenge 2021;
and, accompanying code for users to record, load and stream any collection of such
trajectories in a highly compressed form. We evaluate a wide range of existing
algorithms including online and offline RL, as well as learning from demonstrations,
showing that significant research advances are needed to fully leverage large-scale
datasets for challenging sequential decision making tasks.
1 Introduction
Recent progress on deep reinforcement learning (RL) methods has led to significant breakthroughs
such as training autonomous agents to play Atari [
33
], Go [
50
], StarCraft [
58
], and Dota [
3
], or to
perform complex robotic tasks [
24
,
44
,
26
,
35
,
16
,
41
]. In many of these cases, success relied on
having access to large-scale datasets of human demonstrations [
58
,
3
,
41
,
26
]. Without access to such
demonstrations, training RL agents to operate effectively in these environments remains challenging
due to the hard exploration problem posed by their vast state and action spaces. In addition, having
access to a simulator is key for training agents that can discover new strategies not exhibited by human
demonstrations. Therefore, training agents using a combination of offline data and online interaction
has proven to be a highly successful approach for solving a variety of challenging sequential decision
making tasks. However, this requires access to complex simulators and large-scale offline datasets,
which tend to be computationally expensive.
The NetHack Learning Environment (
NLE
) was recently introduced as a testbed for RL, providing
an environment which is both extremely challenging [
15
] and exceptionally fast to simulate [
23
].
NLE
is a stochastic, partially observed, and procedurally generated RL environment based on the
popular game of NetHack. Due to its long episodes (i.e. tens or hundreds of thousands of steps)
and large state and action spaces,
NLE
poses a uniquely hard exploration challenge for current
RL methods. Thus, one of the most promising research avenues towards progress on NetHack is
leveraging human or symbolic-bot demonstrations to bootstrap performance, which also proved
successful for StarCraft [58] and Dota [3].
Correspondence to ehambro@fb.com.
Work done while at Meta AI.
36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.
arXiv:2211.00539v3 [cs.LG] 24 Nov 2023
In this paper, we introduce the NetHack Learning Dataset (
NLD
), an open and accessible dataset
for large-scale offline RL and learning from demonstrations.
NLD
consists of three parts: first,
NLD-NAO
: a collection of state-only trajectories from 1.5 million human games of NetHack played
on
nethack.alt.org (NAO)
servers between 2009 and 2020; second,
NLD-AA
: a collection of
state-action-score trajectories from 100,000
NLE
games played by the symbolic-bot winner of the
2021 NetHack Challenge [
15
]; third,
TtyrecDataset
: a highly-scalable tool for efficient training
on any NetHack and
NLE
-generated trajectories and metadata.
NLD
, in combination with
NLE
,
enables computationally-accessible research in multiple areas including imitation learning, offline
RL, learning from sequences of only observations, as well as combining learning from offline data
with learning from online interactions. In contrast with other large-scale datasets of demonstrations,
NLD
is highly efficient in both memory and compute.
NLD-NAO
can fit on a $30 hard drive, after being
compressed (by a factor of 160) from 38TB to 229GB. In addition,
NLD-NAO
can be processed in
under 15 hours, achieving a throughput of 288,000 frames per second with only 10 CPUs.
NLD
s
low memory and computational requirements makes large-scale learning from demonstrations more
accessible for academic and independent researchers.
To summarize, the key characteristics of
NLD
are that: it is a scalable dataset of demonstrations (i.e.
large and cheap) for a highly-complex sequential decision making challenge; it enables research in
multiple areas such as imitation learning, offline RL, learning from observations of demonstrations,
learning from both static data and environment interaction; and it has many properties of real-world
domains such as partial observability, stochastic dynamics, sparse reward, long trajectories, rich
environment, diverse behaviors, and a procedurally generated environment.
In this paper, we make the following core contributions: (i) we introduce
NLD-NAO
, a large-scale
dataset of almost 10 billion state transitions, from 1.5 million NetHack games played by humans;
(ii) we also introduce
NLD-AA
, a large-scale dataset of over 3 billion state-action-score transitions,
from 100,000 games played by the symbolic-bot winner of the NeurIPS 2022 NetHack Competition;
(iii) we open-source code for users to record, load, and stream any collection of NetHack trajectories
in a highly compressed form; and (iv) we show that, while current state-of-the-art methods in offline
RL and learning from demonstrations can effectively make use of the dataset, playing NetHack at
human-level performance remains an open research challenge.
2 Related Work
Offline RL Benchmarks. Recently, there has been a growing interest in developing better offline
RL methods [
25
,
39
,
11
,
60
,
61
,
21
,
20
,
6
,
2
,
49
] which aim to learn from datasets of trajectories.
With it, a number of offline RL benchmarks have been released [
1
,
62
,
10
,
42
,
22
,
43
]. While these
benchmarks focus specifically on offline RL, our datasets enable research on multiple areas including
imitation learning, learning from observations only (i.e. without access to actions or rewards), as
well as learning from both offline and online interactions. In order to make progress on difficult
RL tasks such as NetHack, we will likely need to learn from both human data and environment
interaction, as was the case with other challenging games like StarCraft [
58
] or Dota [
3
]. In contrast,
the tasks proposed in current offline RL benchmarks are much easier and can be solved by training
either only online or only offline [
10
,
62
]. In addition, current offline RL benchmarks test agents
on the exact same environment where the data was collected. As emphasized by [
57
], imitation
learning algorithms drastically overfit to their environments, so it is essential to evaluate them on
new scenarios in order to develop robust methods. In contrast,
NLE
has long procedurally generated
episodes which require both long-term planning and systematic generalization in order to succeed.
This is shown in [
30
], which investigates transfer learning between policies trained on different
NLE-based environments.
For many real-world applications such as autonomous driving or robotic manipulation, learning from
human data is essential due to safety concerns and time constraints [
40
,
10
,
8
,
5
,
27
,
52
,
55
,
17
,
38
,
18
].
However, most offline RL benchmarks contain synthetic trajectories generated by either random
exploration, pretrained RL agents, or simple hard-coded behaviors [
10
]. In contrast, one of our
datasets consists entirely of human replays, while the other one is generated by the winner of the
NetHack Competition at NeurIPS 2022 which is a complex symbolic bot with built-in knowledge of
the game. Human data (like the set contained in
NLD-NAO
) is significantly more diverse and messy
than synthetic data, as humans can vary widely in their expertise, optimize for different objectives
(such as fun or discovery), have access to external information (such as the NetHack Wiki [
34
]),
2
and even have different observation or action spaces than RL agents. Hence, learning directly from
human data is essential for making progress on real-world problems in RL.
Large-Scale Human Datasets. A number of large-scale datasets of human replays have been
released for StarCraft [
59
], Dota [
3
], and MineRL [
14
]. However, training models on these datasets
requires massive computational resources, which makes it unfeasible for academic or independent
researchers. In contrast,
NLD
strikes a better balance between scale (i.e. a large number of diverse
human demonstrations on a complex task) and efficiency (i.e. cheap to use and fast to run).
For many real-world applications such as robotic manipulation, we only have access to the demon-
strator’s observations and not their actions [
29
,
48
,
40
,
8
,
51
]. Research on this setting has been
slower [
9
,
56
,
7
], in part due to the lack of efficient large-scale datasets. While there are some datasets
containing only observations, they are either much smaller than
NLD
[
32
,
57
,
48
], too computationally
expensive [59, 3], or lack a simulator which prevents learning from online interactions [12, 28, 8].
3 Background: The NetHack Learning Environment
The NetHack Learning Environment (
NLE
) is a gym environment [
4
] based on the popular “dungeon-
crawler” game, NetHack [
19
]. Despite the visual simplicity, NetHack is widely considered one of the
hardest video games in history since it can take years for humans to win the game [
54
]. Players need
to explore the dungeon, manage their resources, as well as learn about the many entities and their
dynamics (often by relying on external knowledge sources like the NetHack Wiki [
34
]). NetHack
has a clearly defined goal, namely descend the dungeon, retrieve an amulet, and ascend back to win
the game. At the beginning of each game, players are randomly assigned a given multidimensional
character defined by role, race, alignment, and gender (which have varying properties and challenges),
so they need to master all characters in order to win consistently. Thus,
NLE
offers a unique set of
properties which make it well-positioned to advance research on RL and learning from demonstrations:
it is a highly complex environment, containing hundreds of entities with different dynamics; it is
procedurally generated, allowing researchers to test generalization; it is partially observed, highly
stochastic, and has very long episodes (i.e. one or two orders of magnitude longer that Starcraft
II [59]).
Following its release, several works have built on
NLE
to leverage its complexity in different ways.
MiniHack [
45
] allows researchers to design their own environments to test particular capabilities
of RL agents, by leveraging the full set of entities and dynamics from NetHack. The NetHack
Challenge [
15
] was a competition at NeurIPS 2021, which sought to incentivise a showdown between
symbolic and deep RL methods on
NLE
. Symbolic bots decisively outperformed deep RL methods,
with the best performing symbolic bots surpassing state-of-the-art deep RL methods by a factor of 5.
4 The NetHack Learning Dataset
The NetHack Learning Dataset (NLD) contains three components:
1. NLD-NAO — a directory of ttyrec.bz2 files containing almost 10 billion state trajectories
and metadata from 1,500,000 human games of NetHack played on nethack.alt.org.
2. NLD-AA
— a directory of
ttyrec3.bz2
files containing over 3 billion state-action-score
trajectories and metadata from 100,000 games collected from the winning bot of the NetHack
Challenge [15].
3. TtyrecDataset
— a Python class that can scalably load directories of
ttyrec.bz2
/
ttyrec3.bz2 files and their metadata into numpy arrays.
We are also releasing a new version of the NetHack environment,
NLE v0.9.0
, which contains new
features and ensures compatibility with NLD (see Appendix C).
File Format. The
ttyrec3
file format stores sequences of terminal instructions (equivalent to
observations in RL), along with the times at which to play them. In
NLE v0.9.0
, we adapt this format
to also store keypress inputs to the terminal (equivalent to actions in RL), and in-game scores over
time (equivalent to rewards in RL), allowing a reconstruction of state-action-score trajectories. This
3https://nethackwiki.com/wiki/Ttyrec
3
adapted format is known as
ttyrec3
. The
ttyrec.bz2
and
ttyrec3.bz2
formats, compressed
versions of
ttyrec
and
ttyrec3
, are the primary data formats used in
NLD
. Using
TtyrecDataset
,
these compressed files can be written and read on-the-fly, resulting in data compression ratios of more
than 138. The files can be decompressed into the state trajectory on a terminal, by using a terminal
emulator and querying its screen. For more details see Appendix D.
Throughout the paper, we refer to a player’s input at a given time as either state or observation.
However, note that NetHack is partially observed, so the player doesn’t have access to the full state
of the game. We also sometimes use the terms score and reward interchangeably, since the increment
in in-game score is a natural choice for the reward used to train RL agents on NetHack. Similarly, a
human’s keypress corresponds to an agent’s action in the game.
Metadata. NetHack has an optional built-in feature for the logging of game metadata, used for the
maintenance of all-time high-score lists. At the end of a game, 26 fields of data are logged to a
common
xlogfile4
for posterity. These fields include the character’s name,race,role,score,cause
of death,maximum dungeon depth, and more. See Appendix E for more details on these fields. With
NLE v0.9.0
, an
xlogfile
is generated for all
NLE
recordings. These files are used to populate all
metadata contained in NLD.
State-Action-Score Transitions. As mentioned,
NLD-AA
contains sequences of state-action-score
transitions from symbolic-bot plays, while
NLD-NAO
contains sequences of state transitions from
human plays. These transitions are efficiently processed using the
TtyrecDataset
. The states
consist of:
tty_chars
(the characters at each point on the screen),
tty_colors
(the colors at
each point on the screen),
tty_cursor
(the location of the cursor on the screen),
timestamps
(when the state was recorded), and
gameids
(an identifier for the game in question). Additionally,
keypresses
and
score
observations are available for
ttyrec3
files, as in the
NLD-AA
dataset. The
states, keypresses, and scores from
NLD
map directly to an agent’s observations, actions, and rewards
in NLE. More information about these transitions can be found in Appendix F.
API. The
TtyrecDataset
follows the API of an
IterableDataset
defined in PyTorch. This allows
for the batched streaming of
ttyrec.bz2
/
ttyrec3.bz2
files directly into fixed NumPy arrays
of a chosen shape. Episodes are read sequentially in chunks defined by the unroll length, batched
with a different game per batch index. The order of these games can be shuffled, predetermined or
even looped to provide a never-ending dataset. This class allows users to load any state-action-score
trajectory recorded from NLE v0.9.0 onwards.
The
TtyrecDataset
wraps a small sqlite3 database where it stores metadata and the paths to files.
This design allows for the simple querying of metadata for any game, along with the dynamic subse-
lection of games streamed from the
TtyrecDataset
itself. For example, in Figure 1, we generate
sub-datasets from
NLD-NAO
and
NLD-AA
, selecting trajectories where the player has completed the
game (‘Ascended’) or played a ‘Human Monk’ character, respectively. Appendix G shows how to
load only a subset of trajectories, for example where the player has ascended, or a certain role has
been used.
Scalability. The
TtyrecDataset
is designed to make our large-scale
NLD
datasets accessible even
when the computational resources are limited. To that end, several optimizations are made to
improve the memory efficiency and throughput of the data. Most notably, TtyrecDataset streams
recordings directly from their compressed
ttyrec.bz2
files. This format compresses the 30TB of
frame data included in
NLD-NAO
down to 229GB, which can fit on a $30 SSD
5
. This decompression
requires on-the-fly unzipping and terminal emulation. The
TtyrecDataset
performs these in GIL-
released C/C++. This process is fast and trivially parallelizable with Python Threadpool, resulting
in throughputs of over 1.7 GB/s on 80 CPUs. This performance allows the processing of almost 10
billion frames of
NLD-NAO
in 15 hours, with 10 CPUs. See Table 1 for a quantitative description of
our two datasets.
5 Dataset Analysis
In this section we perform an in-depth analysis of the characteristics of NLD-AA and NLD-NAO.
4https://nethackwiki.com/wiki/Xlogfile
5https://www.amazon.com/HP-240GB-Internal-Solid-State/dp/B09KFHTYWH
4
Table 1: NLD-AA and NLD-NAO in numbers.
NLD-AA NLD-NAO
Episodes 109,545 1,511,228
Transitions 3,481,605,009 9,858,127,896
Policies (Players) 1 48,454
Policies Type symbolic bot human
Transition (state, action, score) state
Disk Size (Compressed) 96.7 GB 229 GB
Data Size (Uncompressed) 13.4 TB 38.0 TB
Compression Ratio 138 166
Mean Epsiode Score 10,105 127,218
Median Episode Score 5,422 836
Median Episode Transitions 28,181 1,724.0
Median Episode Game Turns 20,414 3,766
Epoch Time (10 CPUs) 4h 49m 14h 37m
5.1 NLD-AA
To our knowledge,
AutoAscend6
is currently the best open-sourced artificial agent for playing
NetHack
3.6.6
, having achieved first place in the 2021 NeurIPS NetHack Challenge by a considerable
margin [
15
].
AutoAscend
is a symbolic bot, forgoing any neural network and instead relying on a
priority list of hard-coded subroutines. These subroutines are complex, context dependant, and make
significant use of NetHack domain knowledge and all NetHack actions. For instances, the bot keeps
track of multiple properties for encountered entities and can even solve challenging puzzles such as
Sokoban. A full description of
AutoAscend
s algorithm and behavior can be found in the NetHack
Challenge Report [15].
NLD-AA
was generated by running
AutoAscend
on the
NetHackChallenge-v0
[
15
] task in
NLE
v0.9.0
, utilising its built-in recording feature to generate
ttyrec3.bz2
files. It consists of over 3
billion state-action-score transitions, drawn from 100,000 episodes generated by
AutoAscend
. Of
the
NLE
tasks,
NetHackChallenge-v0
most closely reproduces the full game of NetHack
3.6.6
,
and was introduced in
NLE v0.7.0
to grant NetHack Challenge competitors access to the widest
possible action space, and force an automated randomisation of the starting character (by race, role,
alignment, and gender). By virtue of using
ttyrec3.bz2
files, in-game scores and actions (in the
form of keypresses) are stored along with the states, and metadata about the episodes.
Dataset Skill. The
AutoAscend
trajectories demonstrate a strong and reliable NetHack player,
far exceeding all the deep learning based approaches, but still falling short of an expert human
player. NetHack broadly defines a character with less than 2000 score as a ‘Beginner’
7
.
AutoAscend
comfortably surpasses the ‘Beginner’ classification in more than 75% of games for all roles, and in
95% of games for easier roles like Monk (see Figure 1). Given the high variance nature of NetHack
games, and the challenge of playing with the more difficult roles, this is an impressive feat.
Compared to the human players in
NLD-NAO
,
AutoAscend
s policy finds itself just within the top
15% of all players when ranked by mean score. When ranked by median score
8
it comes within
the top 7%. However, these metrics are somewhat distorted by the long tail of dilettante players
who played only a few games. If instead we define a ‘competent’ human player as one to have ever
advanced beyond the Beginner classification, then
AutoAscend
ranks in the top 33% of players by
mean score, and top 15% by median.
The competence of
AutoAscend
contrasts both with the poor performance of deep RL bots, and with
the exceptional performance required to beat the game. As the winning symbolic bot of the NetHack
Challenge,
AutoAscend
beat the winning deep learning bot by almost a factor of 3 in median score,
and close to a factor of 5 in mean score. This performance is far outside what deep RL agents can
currently achieve, and in some domains
NLD-AA
may be considered an “expert" dataset. However,
6https://github.com/maciej-sypetkowski/autoascend
7https://nethackwiki.com/wiki/Beginner
8Median score was the primary metric in the NetHack Challenge
5
摘要:

DungeonsandData:ALarge-ScaleNetHackDatasetEricHambro∗MetaAIRobertaRaileanuMetaAIDanielleRothermelMetaAIVegardMellaMetaAITimRocktäschelUniversityCollegeLondon†HeinrichKüttlerInflectionAI†NailaMurrayMetaAIAbstractRecentbreakthroughsinthedevelopmentofagentstosolvechallengingsequentialdecisionmakingprob...

展开>> 收起<<
Dungeons and Data A Large-Scale NetHack Dataset Eric Hambro.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.78MB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注