Dungeons and Data A Large-Scale NetHack Dataset Eric Hambro

2025-08-18 1 0 1.78MB 30 页 10玖币

侵权投诉

Dungeons and Data: A Large-Scale NetHack Dataset

Eric Hambro∗

Meta AI Roberta Raileanu

Meta AI Danielle Rothermel

Meta AI Vegard Mella

Meta AI

Tim Rocktäschel

University College London†

Heinrich Küttler

Inﬂection AI†

Naila Murray

Meta AI

Abstract

Recent breakthroughs in the development of agents to solve challenging sequential

decision making problems such as Go [

], StarCraft [

], or DOTA [

], have

relied on both simulated environments and large-scale datasets. However, progress

on this research has been hindered by the scarcity of open-sourced datasets and the

prohibitive computational cost to work with them. Here we present the NetHack

Learning Dataset (

NLD

), a large and highly-scalable dataset of trajectories from

the popular game of NetHack, which is both extremely challenging for current

methods and very fast to run [

NLD

consists of three parts: 10 billion state

transitions from 1.5 million human trajectories collected on the

NAO

public NetHack

server from 2009 to 2020; 3 billion state-action-score transitions from 100,000

trajectories collected from the symbolic bot winner of the NetHack Challenge 2021;

and, accompanying code for users to record, load and stream any collection of such

trajectories in a highly compressed form. We evaluate a wide range of existing

algorithms including online and ofﬂine RL, as well as learning from demonstrations,

showing that signiﬁcant research advances are needed to fully leverage large-scale

datasets for challenging sequential decision making tasks.

1 Introduction

Recent progress on deep reinforcement learning (RL) methods has led to signiﬁcant breakthroughs

such as training autonomous agents to play Atari [

], Go [

], StarCraft [

], and Dota [

], or to

perform complex robotic tasks [

]. In many of these cases, success relied on

having access to large-scale datasets of human demonstrations [

]. Without access to such

demonstrations, training RL agents to operate effectively in these environments remains challenging

due to the hard exploration problem posed by their vast state and action spaces. In addition, having

access to a simulator is key for training agents that can discover new strategies not exhibited by human

demonstrations. Therefore, training agents using a combination of ofﬂine data and online interaction

has proven to be a highly successful approach for solving a variety of challenging sequential decision

making tasks. However, this requires access to complex simulators and large-scale ofﬂine datasets,

which tend to be computationally expensive.

The NetHack Learning Environment (

NLE

) was recently introduced as a testbed for RL, providing

an environment which is both extremely challenging [

] and exceptionally fast to simulate [

NLE

is a stochastic, partially observed, and procedurally generated RL environment based on the

popular game of NetHack. Due to its long episodes (i.e. tens or hundreds of thousands of steps)

and large state and action spaces,

NLE

poses a uniquely hard exploration challenge for current

RL methods. Thus, one of the most promising research avenues towards progress on NetHack is

leveraging human or symbolic-bot demonstrations to bootstrap performance, which also proved

successful for StarCraft [58] and Dota [3].

∗Correspondence to ehambro@fb.com.

†Work done while at Meta AI.

36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.

arXiv:2211.00539v3 [cs.LG] 24 Nov 2023

In this paper, we introduce the NetHack Learning Dataset (

NLD

), an open and accessible dataset

for large-scale ofﬂine RL and learning from demonstrations.

NLD

consists of three parts: ﬁrst,

NLD-NAO

: a collection of state-only trajectories from 1.5 million human games of NetHack played

nethack.alt.org (NAO)

servers between 2009 and 2020; second,

NLD-AA

: a collection of

state-action-score trajectories from 100,000

NLE

games played by the symbolic-bot winner of the

2021 NetHack Challenge [

]; third,

TtyrecDataset

: a highly-scalable tool for efﬁcient training

on any NetHack and

NLE

-generated trajectories and metadata.

NLD

, in combination with

NLE

enables computationally-accessible research in multiple areas including imitation learning, ofﬂine

RL, learning from sequences of only observations, as well as combining learning from ofﬂine data

with learning from online interactions. In contrast with other large-scale datasets of demonstrations,

NLD

is highly efﬁcient in both memory and compute.

NLD-NAO

can ﬁt on a $30 hard drive, after being

compressed (by a factor of 160) from 38TB to 229GB. In addition,

NLD-NAO

can be processed in

under 15 hours, achieving a throughput of 288,000 frames per second with only 10 CPUs.

NLD

’s

low memory and computational requirements makes large-scale learning from demonstrations more

accessible for academic and independent researchers.

To summarize, the key characteristics of

NLD

are that: it is a scalable dataset of demonstrations (i.e.

large and cheap) for a highly-complex sequential decision making challenge; it enables research in

multiple areas such as imitation learning, ofﬂine RL, learning from observations of demonstrations,

learning from both static data and environment interaction; and it has many properties of real-world

domains such as partial observability, stochastic dynamics, sparse reward, long trajectories, rich

environment, diverse behaviors, and a procedurally generated environment.

In this paper, we make the following core contributions: (i) we introduce

NLD-NAO

, a large-scale

dataset of almost 10 billion state transitions, from 1.5 million NetHack games played by humans;

(ii) we also introduce

NLD-AA

, a large-scale dataset of over 3 billion state-action-score transitions,

from 100,000 games played by the symbolic-bot winner of the NeurIPS 2022 NetHack Competition;

(iii) we open-source code for users to record, load, and stream any collection of NetHack trajectories

in a highly compressed form; and (iv) we show that, while current state-of-the-art methods in ofﬂine

RL and learning from demonstrations can effectively make use of the dataset, playing NetHack at

human-level performance remains an open research challenge.

2 Related Work

Ofﬂine RL Benchmarks. Recently, there has been a growing interest in developing better ofﬂine

RL methods [

] which aim to learn from datasets of trajectories.

With it, a number of ofﬂine RL benchmarks have been released [

]. While these

benchmarks focus speciﬁcally on ofﬂine RL, our datasets enable research on multiple areas including

imitation learning, learning from observations only (i.e. without access to actions or rewards), as

well as learning from both ofﬂine and online interactions. In order to make progress on difﬁcult

RL tasks such as NetHack, we will likely need to learn from both human data and environment

interaction, as was the case with other challenging games like StarCraft [

] or Dota [

]. In contrast,

the tasks proposed in current ofﬂine RL benchmarks are much easier and can be solved by training

either only online or only ofﬂine [

]. In addition, current ofﬂine RL benchmarks test agents

on the exact same environment where the data was collected. As emphasized by [

], imitation

learning algorithms drastically overﬁt to their environments, so it is essential to evaluate them on

new scenarios in order to develop robust methods. In contrast,

NLE

has long procedurally generated

episodes which require both long-term planning and systematic generalization in order to succeed.

This is shown in [

], which investigates transfer learning between policies trained on different

NLE-based environments.

For many real-world applications such as autonomous driving or robotic manipulation, learning from

human data is essential due to safety concerns and time constraints [

However, most ofﬂine RL benchmarks contain synthetic trajectories generated by either random

exploration, pretrained RL agents, or simple hard-coded behaviors [

]. In contrast, one of our

datasets consists entirely of human replays, while the other one is generated by the winner of the

NetHack Competition at NeurIPS 2022 which is a complex symbolic bot with built-in knowledge of

the game. Human data (like the set contained in

NLD-NAO

) is signiﬁcantly more diverse and messy

than synthetic data, as humans can vary widely in their expertise, optimize for different objectives

(such as fun or discovery), have access to external information (such as the NetHack Wiki [

]),

and even have different observation or action spaces than RL agents. Hence, learning directly from

human data is essential for making progress on real-world problems in RL.

Large-Scale Human Datasets. A number of large-scale datasets of human replays have been

released for StarCraft [

], Dota [

], and MineRL [

]. However, training models on these datasets

requires massive computational resources, which makes it unfeasible for academic or independent

researchers. In contrast,

NLD

strikes a better balance between scale (i.e. a large number of diverse

human demonstrations on a complex task) and efﬁciency (i.e. cheap to use and fast to run).

For many real-world applications such as robotic manipulation, we only have access to the demon-

strator’s observations and not their actions [

]. Research on this setting has been

slower [

], in part due to the lack of efﬁcient large-scale datasets. While there are some datasets

containing only observations, they are either much smaller than

NLD

[

], too computationally

expensive [59, 3], or lack a simulator which prevents learning from online interactions [12, 28, 8].

3 Background: The NetHack Learning Environment

The NetHack Learning Environment (

NLE

) is a gym environment [

] based on the popular “dungeon-

crawler” game, NetHack [

]. Despite the visual simplicity, NetHack is widely considered one of the

hardest video games in history since it can take years for humans to win the game [

]. Players need

to explore the dungeon, manage their resources, as well as learn about the many entities and their

dynamics (often by relying on external knowledge sources like the NetHack Wiki [

]). NetHack

has a clearly deﬁned goal, namely descend the dungeon, retrieve an amulet, and ascend back to win

the game. At the beginning of each game, players are randomly assigned a given multidimensional

character deﬁned by role, race, alignment, and gender (which have varying properties and challenges),

so they need to master all characters in order to win consistently. Thus,

NLE

offers a unique set of

properties which make it well-positioned to advance research on RL and learning from demonstrations:

it is a highly complex environment, containing hundreds of entities with different dynamics; it is

procedurally generated, allowing researchers to test generalization; it is partially observed, highly

stochastic, and has very long episodes (i.e. one or two orders of magnitude longer that Starcraft

II [59]).

Following its release, several works have built on

NLE

to leverage its complexity in different ways.

MiniHack [

] allows researchers to design their own environments to test particular capabilities

of RL agents, by leveraging the full set of entities and dynamics from NetHack. The NetHack

Challenge [

] was a competition at NeurIPS 2021, which sought to incentivise a showdown between

symbolic and deep RL methods on

NLE

. Symbolic bots decisively outperformed deep RL methods,

with the best performing symbolic bots surpassing state-of-the-art deep RL methods by a factor of 5.

4 The NetHack Learning Dataset

The NetHack Learning Dataset (NLD) contains three components:

1. NLD-NAO — a directory of ttyrec.bz2 ﬁles containing almost 10 billion state trajectories

and metadata from 1,500,000 human games of NetHack played on nethack.alt.org.

2. NLD-AA

— a directory of

ttyrec3.bz2

ﬁles containing over 3 billion state-action-score

trajectories and metadata from 100,000 games collected from the winning bot of the NetHack

Challenge [15].

3. TtyrecDataset

— a Python class that can scalably load directories of

ttyrec.bz2

ttyrec3.bz2 ﬁles and their metadata into numpy arrays.

We are also releasing a new version of the NetHack environment,

NLE v0.9.0

, which contains new

features and ensures compatibility with NLD (see Appendix C).

File Format. The

ttyrec3

ﬁle format stores sequences of terminal instructions (equivalent to

observations in RL), along with the times at which to play them. In

NLE v0.9.0

, we adapt this format

to also store keypress inputs to the terminal (equivalent to actions in RL), and in-game scores over

time (equivalent to rewards in RL), allowing a reconstruction of state-action-score trajectories. This

3https://nethackwiki.com/wiki/Ttyrec

adapted format is known as

ttyrec3

. The

ttyrec.bz2

and

ttyrec3.bz2

formats, compressed

versions of

ttyrec

and

ttyrec3

, are the primary data formats used in

NLD

. Using

TtyrecDataset

these compressed ﬁles can be written and read on-the-ﬂy, resulting in data compression ratios of more

than 138. The ﬁles can be decompressed into the state trajectory on a terminal, by using a terminal

emulator and querying its screen. For more details see Appendix D.

Throughout the paper, we refer to a player’s input at a given time as either state or observation.

However, note that NetHack is partially observed, so the player doesn’t have access to the full state

of the game. We also sometimes use the terms score and reward interchangeably, since the increment

in in-game score is a natural choice for the reward used to train RL agents on NetHack. Similarly, a

human’s keypress corresponds to an agent’s action in the game.

Metadata. NetHack has an optional built-in feature for the logging of game metadata, used for the

maintenance of all-time high-score lists. At the end of a game, 26 ﬁelds of data are logged to a

common

xlogfile4

for posterity. These ﬁelds include the character’s name,race,role,score,cause

of death,maximum dungeon depth, and more. See Appendix E for more details on these ﬁelds. With

NLE v0.9.0

, an

xlogfile

is generated for all

NLE

recordings. These ﬁles are used to populate all

metadata contained in NLD.

State-Action-Score Transitions. As mentioned,

NLD-AA

contains sequences of state-action-score

transitions from symbolic-bot plays, while

NLD-NAO

contains sequences of state transitions from

human plays. These transitions are efﬁciently processed using the

TtyrecDataset

. The states

consist of:

tty_chars

(the characters at each point on the screen),

tty_colors

(the colors at

each point on the screen),

tty_cursor

(the location of the cursor on the screen),

timestamps

(when the state was recorded), and

gameids

(an identiﬁer for the game in question). Additionally,

keypresses

and

score

observations are available for

ttyrec3

ﬁles, as in the

NLD-AA

dataset. The

states, keypresses, and scores from

NLD

map directly to an agent’s observations, actions, and rewards

in NLE. More information about these transitions can be found in Appendix F.

API. The

TtyrecDataset

follows the API of an

IterableDataset

deﬁned in PyTorch. This allows

for the batched streaming of

ttyrec.bz2

ttyrec3.bz2

ﬁles directly into ﬁxed NumPy arrays

of a chosen shape. Episodes are read sequentially in chunks deﬁned by the unroll length, batched

with a different game per batch index. The order of these games can be shufﬂed, predetermined or

even looped to provide a never-ending dataset. This class allows users to load any state-action-score

trajectory recorded from NLE v0.9.0 onwards.

The

TtyrecDataset

wraps a small sqlite3 database where it stores metadata and the paths to ﬁles.

This design allows for the simple querying of metadata for any game, along with the dynamic subse-

lection of games streamed from the

TtyrecDataset

itself. For example, in Figure 1, we generate

sub-datasets from

NLD-NAO

and

NLD-AA

, selecting trajectories where the player has completed the

game (‘Ascended’) or played a ‘Human Monk’ character, respectively. Appendix G shows how to

load only a subset of trajectories, for example where the player has ascended, or a certain role has

been used.

Scalability. The

TtyrecDataset

is designed to make our large-scale

NLD

datasets accessible even

when the computational resources are limited. To that end, several optimizations are made to

improve the memory efﬁciency and throughput of the data. Most notably, TtyrecDataset streams

recordings directly from their compressed

ttyrec.bz2

ﬁles. This format compresses the 30TB of

frame data included in

NLD-NAO

down to 229GB, which can ﬁt on a $30 SSD

. This decompression

requires on-the-ﬂy unzipping and terminal emulation. The

TtyrecDataset

performs these in GIL-

released C/C++. This process is fast and trivially parallelizable with Python Threadpool, resulting

in throughputs of over 1.7 GB/s on 80 CPUs. This performance allows the processing of almost 10

billion frames of

NLD-NAO

in 15 hours, with 10 CPUs. See Table 1 for a quantitative description of

our two datasets.

5 Dataset Analysis

In this section we perform an in-depth analysis of the characteristics of NLD-AA and NLD-NAO.

4https://nethackwiki.com/wiki/Xlogfile

5https://www.amazon.com/HP-240GB-Internal-Solid-State/dp/B09KFHTYWH

Table 1: NLD-AA and NLD-NAO in numbers.

NLD-AA NLD-NAO

Episodes 109,545 1,511,228

Transitions 3,481,605,009 9,858,127,896

Policies (Players) 1 48,454

Policies Type symbolic bot human

Transition (state, action, score) state

Disk Size (Compressed) 96.7 GB 229 GB

Data Size (Uncompressed) 13.4 TB 38.0 TB

Compression Ratio 138 166

Mean Epsiode Score 10,105 127,218

Median Episode Score 5,422 836

Median Episode Transitions 28,181 1,724.0

Median Episode Game Turns 20,414 3,766

Epoch Time (10 CPUs) 4h 49m 14h 37m

5.1 NLD-AA

To our knowledge,

AutoAscend6

is currently the best open-sourced artiﬁcial agent for playing

NetHack

3.6.6

, having achieved ﬁrst place in the 2021 NeurIPS NetHack Challenge by a considerable

margin [

AutoAscend

is a symbolic bot, forgoing any neural network and instead relying on a

priority list of hard-coded subroutines. These subroutines are complex, context dependant, and make

signiﬁcant use of NetHack domain knowledge and all NetHack actions. For instances, the bot keeps

track of multiple properties for encountered entities and can even solve challenging puzzles such as

Sokoban. A full description of

AutoAscend

’s algorithm and behavior can be found in the NetHack

Challenge Report [15].

NLD-AA

was generated by running

AutoAscend

on the

NetHackChallenge-v0

[

] task in

NLE

v0.9.0

, utilising its built-in recording feature to generate

ttyrec3.bz2

ﬁles. It consists of over 3

billion state-action-score transitions, drawn from 100,000 episodes generated by

AutoAscend

. Of

the

NLE

tasks,

NetHackChallenge-v0

most closely reproduces the full game of NetHack

3.6.6

and was introduced in

NLE v0.7.0

to grant NetHack Challenge competitors access to the widest

possible action space, and force an automated randomisation of the starting character (by race, role,

alignment, and gender). By virtue of using

ttyrec3.bz2

ﬁles, in-game scores and actions (in the

form of keypresses) are stored along with the states, and metadata about the episodes.

Dataset Skill. The

AutoAscend

trajectories demonstrate a strong and reliable NetHack player,

far exceeding all the deep learning based approaches, but still falling short of an expert human

player. NetHack broadly deﬁnes a character with less than 2000 score as a ‘Beginner’

AutoAscend

comfortably surpasses the ‘Beginner’ classiﬁcation in more than 75% of games for all roles, and in

95% of games for easier roles like Monk (see Figure 1). Given the high variance nature of NetHack

games, and the challenge of playing with the more difﬁcult roles, this is an impressive feat.

Compared to the human players in

NLD-NAO

AutoAscend

’s policy ﬁnds itself just within the top

15% of all players when ranked by mean score. When ranked by median score

it comes within

the top 7%. However, these metrics are somewhat distorted by the long tail of dilettante players

who played only a few games. If instead we deﬁne a ‘competent’ human player as one to have ever

advanced beyond the Beginner classiﬁcation, then

AutoAscend

ranks in the top 33% of players by

mean score, and top 15% by median.

The competence of

AutoAscend

contrasts both with the poor performance of deep RL bots, and with

the exceptional performance required to beat the game. As the winning symbolic bot of the NetHack

Challenge,

AutoAscend

beat the winning deep learning bot by almost a factor of 3 in median score,

and close to a factor of 5 in mean score. This performance is far outside what deep RL agents can

currently achieve, and in some domains

NLD-AA

may be considered an “expert" dataset. However,

6https://github.com/maciej-sypetkowski/autoascend

7https://nethackwiki.com/wiki/Beginner

8Median score was the primary metric in the NetHack Challenge

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DungeonsandData:ALarge-ScaleNetHackDatasetEricHambro∗MetaAIRobertaRaileanuMetaAIDanielleRothermelMetaAIVegardMellaMetaAITimRocktäschelUniversityCollegeLondon†HeinrichKüttlerInflectionAI†NailaMurrayMetaAIAbstractRecentbreakthroughsinthedevelopmentofagentstosolvechallengingsequentialdecisionmakingprob...

展开>> 收起<<

Dungeons and Data A Large-Scale NetHack Dataset Eric Hambro.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dungeons and Data A Large-Scale NetHack Dataset Eric Hambro

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: