Atari-5 Distilling the Arcade Learning Environment down to Five Games Matthew Aitchison

2025-05-06 0 0 701.06KB 21 页 10玖币

侵权投诉

Atari-5: Distilling the Arcade Learning Environment

down to Five Games

Matthew Aitchison

Australian National University

matthew.aitchison@anu.edu.au

Penny Sweetser

Australian National University

Marcus Hutter

Australian National University / Deepmind

Abstract

The Arcade Learning Environment (ALE) has become an essential benchmark for

assessing the performance of reinforcement learning algorithms. However, the

computational cost of generating results on the entire 57-game dataset limits ALE’s

use and makes the reproducibility of many results infeasible. We propose a novel

solution to this problem in the form of a principled methodology for selecting small

but representative subsets of environments within a benchmark suite. We applied

our method to identify a subset of ﬁve ALE games, called

Atari-5

, which produces

57-game median score estimates within 10% of their true values. Extending the

subset to 10-games recovers 80% of the variance for log-scores for all games within

the 57-game set. We show this level of compression is possible due to a high degree

of correlation between many of the games in ALE.

1 Introduction

The Arcade Learning Environment (ALE) [

] has become the gold standard for evaluating the

performance of reinforcement learning (RL) algorithms on complex discrete control tasks. Since its

release in 2013, the benchmark has gained thousands of citations and almost all state-of-the-art RL

algorithms have featured it in their work [

]. However, results generated from

the full benchmark have typically been limited to a few large research groups.

We posit that the cost

of producing evaluations on the full dataset is not feasible for many researchers. Additionally, the

veriﬁcation of superiority claims through statistical tests on multiple seeds remains exceedingly rare,

likely, due to the high cost.

While technological improvements have reduced computation costs, this decrease has been greatly

outpaced by an increase in the scale at which RL algorithms are run. For example, the number

of frames used for training to produce state-of-the-art results on ALE increased by more than one-

thousand times in the ﬁve years between 2015’s Deep Q-learning model [

] and the more recent

Agent57 [

]. Training modern machine learning models can also produce non-trivial amounts of

carbon emissions [

], for which we provide some analysis in Appendix A. Previous works have

dealt with this challenge by self-selecting ad hoc subsets of games. However, this approach adds bias

and makes comparison between works more challenging.

In this paper,

we outline a principled approach to selecting subsets of environments from a RL

benchmark suite. We show that, when carefully chosen, a surprisingly small subset of ALE can

Of the 17 algorithms listed on

paperswithcode.com

with full Atari-57 results, only one was from a

research group outside Google or DeepMind.

2Source code for this paper can be found at https://github.com/maitchison/Atari-5.

Preprint. Under review.

arXiv:2210.02019v1 [cs.AI] 5 Oct 2022

Figure 1: The ﬁve games used in the Atari-5 Subset. From left to right, Battle Zone,Double Dunk,

Name this Game,Phoenix, and Q*bert.

capture most of the useful information of a full run. We present a new benchmark, Atari-5, which

produces scores that correlate very closely to median score estimates on the full dataset, but at less

than one-tenth the cost. We hope that this new benchmark will allow more researchers to participate in

this important ﬁeld of research, speed up the development of novel algorithms through faster iteration,

and make the replication of result in RL more feasible. Our primary contributions are as follows. First,

a methodology for selecting representative subsets of multi-environment RL benchmarks according

to a target summary score. Second, the introduction of the Atari-5 benchmark. Finally, evidence

demonstrating the high degree of correlation between scores for many games within ALE.

2 Background and Related Work

The Arcade Learning Environment (ALE).

Despite deceptively simple graphics, ALE [

] provides

a challenging set of environments for RL algorithms. Much of this challenge stems from the fact

that ALE contains games spanning multiple genres, including sport, shooter, maze and action games,

unlike most other RL benchmarks. This is important because being able to achieve goals in a

wide range of environments has often been suggested as a useful deﬁnition of machine intelligence

[

]. In addition, unlike many other RL benchmarks, the environments within ALE were designed

explicitly for human play, rather than machine play. As such, ALE includes games that are extremely

challenging for machines but for which we know a learnable (human) solution exists. Human

reference scores also provide a means by which ‘good’ scores can be quantiﬁed.

ALE Subsets.

Many researchers make use of ALE subsets when presenting results. However,

decisions about which games to include have varied from paper to paper. The Deep Q-learning

Network (DQN) paper [

] used a seven-game subset, while the Asynchronous Advantage Actor-

Critic (A3C) paper [

] used only ﬁve of these games. A useful taxonomy was introduced by

Bellemare et al. [

], which includes the often-cited hard exploration subset. A more comprehensive

list of papers and the subsets they used is given in Appendix B, highlighting that there is currently no

standard ALE subset. The key difference in our work is that our selection of games has been chosen

by a principled approach to be representative of the dataset as a whole. To our knowledge, this is the

ﬁrst work that investigates the selection and weighting of an RL benchmark suite.

Computationally Feasible Research.

As has been pointed out, the computational requirements for

generating results on ALE can be excessive [

]. There have been several attempts to reduce this cost,

including optimizations to the simulator codebase

, and a GPU implementation of the environments

[

]. However, as these changes reduce the cost of simulating the environment and not the much

higher cost of training a policy, they result in only minor improvements. Asynchronous parallel

training [

] partially addresses this by making better use of parallel hardware but still requires access

to large computer clusters to be effective. While these improvements have been helpful, they do not

go far enough to make ALE tractable for many researchers. Furthermore, our subsetting approach is

complementary to the approaches above.

Alternatives to ALE.

Since ALE’s introduction, many other RL benchmarks have been put forward,

such as [

]. However, ALE still dominates research, and at the time of publishing, ALE

has more than double the citations of these other benchmarks combined (see Table 1). Minatar [

]

addresses some of ALE’s issues by creating a new benchmark inspired by Atari that presents agents

with objects rather than pixels. This simpliﬁcation speeds up training but requires previous algorithms

3https://github.com/mgbellemare/Arcade-Learning-Environment/pull/265

Benchmark Citations

ALE [5] 2,295

Vizdoom [17] 610

DeepMind Lab [3] 380

Procgen [8] 166

Gym Retro [23] 119

Table 1: Popular discrete-action vision-based benchmarks in RL by citation. Citations are as of Mar

2022.

to be trained on the new environments and lacks the vision part of the task, an important part of

ALE’s challenge.

3 Summary Scores on Subsets of Benchmark Suites

When working with a benchmark made up of multiple environments it is often useful to distil the

performance of an algorithm, over all environments, down to a single ‘summary score’ [

]. If this

summary score uses scores from only a subset of the environments then the time taken to generate a

summary score can be reduced. Ideally this subset summary score would:

1. Preserve order.

2. Provide useful information about the performance of the algorithm.

3. Minimize the number of environments needed to be evaluated.

We formalize these desiderata as follows. Let

M={µi|i∈[1..n]}

be an ordered set of

unique

environments that make up the benchmark. For some algorithm

we write the empirical score on

the

-th environment as

. Then, for this set of environments, we say that for a pair of algorithms

a, b

, we have that

a≥b

if and only if

ai≥bi∀i∈[1..n]

and that

a>b

if there also exists

some

, such that

ai> bi

. We also deﬁne a summary score

ΦM

on scores from benchmark

a function

ΦM:Rn→R

reducing a score vector

ha1, a2, ..., ani

taken from benchmark

, to a

scalar indicating performance. For convenience, we write the summary of individual scores for an

algorithm,

ΦM(ha1, a2, ..., ani)

ΦM(a)

. Finally we deﬁne a subset score as a summary score

which uses a strict subset of M.

3.1 Order Preservation

Ideally a subset summary score would be strictly order preserving, in that if

a>b

then

Φ(a)>Φ(b)

Unfortunately this is not possible.

Proposition 1: No subset score can be strictly increasing.

Proof. Let

be the strict subset of environments used by

ΦM

, then select

z∈[1..n]

where

µz∈M\M0

. Now consider two algorithms

a, b

, where

ai=bi∀i6=z

, and

az> bz

. Hence we

have that

a>b

, but not

Φ(a)>Φ(b)

. Therefore, the next best thing we can do is to require that our

subset score be at least weakly order preserving, that is, if a > b we can not have ΦM(a)<ΦM(b).

Proposition 2: Any summary score of the form

ΦM(a) =

i=1

ciai

with ci∈R≥0, is weakly increasing.

Proof. See Appendix C.

We therefore ﬁnd that linear combination of scores, if constrained to non-negative weights, is weakly

order preserving. We note that this is not the case for more sophisticated non-linear models, such

as deep neural networks, which might reverse the order of algorithms. We also note that any

monotonically increasing function

φ:R→R

can be applied to the individual game scores, which

we will take advantage of by using a log transform. We can therefore guarantee that if algorithm

is strictly better than algorithm

, it can be no worse when measured using a weighted subset score,

which is, as shown, the best we can do.

3.2 Provide Useful Information

As shown, a weighted subset produces a monotonic measure of an algorithm’s performance. However,

the question of which subset to use and how to weigh the scores remains. A trivial solution would

be to weigh each score by

, which would be fast to generate and weakly order-preserving but

completely uninformative. Therefore, we propose that individual scores should be weighted such

that the summary score provides meaningful information about the performance of the algorithm.

For established benchmarks, meaningful summary performance measures have often already been

established. For example, with ALE, the most common summary score is median score performance.

Therefore, we suggest that the prediction of an established summary score (target score) provides a

sensible strategy for selecting and weighting the environments within the suite. That is, to ﬁnd the

subset of environments that best predicts the target score and weigh them according to their linear

regression coefﬁcients.

3.3 Procedure

We therefore propose that using linear regression models to predicate a meaningful target summary

score provides a good solution to the desiderta outlined above, and formalize the procedure. Given

algorithms

ak:k= 1...m

together with their individual evaluations

i∈R

on environments

i∈ {1 : n}

and total/aggregate score

tk∈R

, e.g.

tk=sk

1+... +sk

or the average or the median.

The “training” data set is hence

D={(sk

1, ..., sk

n;tk) : k= 1...m}

. Let

I⊆ {1 : n}

be a subset

of games and

sI:= (si:i∈I)

be a corresponding score sub-vector. We want to ﬁnd a mapping

f∗

I:RI→R

that best predicts

from

, i.e. formally

f∗

I= arg minfPm

k=1(tk−f(sk

I))2

. Since

for

I={1 : d}

, the best

is perfect and linear, it is natural to consider linear

, which leads to an

efﬁciently solvable linear regression problem. We further want to ﬁnd small

with small error. We

hence further minimize over all

I⊆ {1 : n}

of some ﬁxed size

|I|=C

. The optimal set of games

I∗

with best linear evaluation function f∗based on games in I∗hence are

(I∗, f∗) := arg min

|I|=C

arg min

f∈LinearI

k=1

(tk−f(sk

I))2.

4 Application to ALE

This section provides details for the application of our method to the popular ALE benchmark,

but note that this method could also be applied to other benchmark suites such as the Procgen [

MuJoCo [27], or even benchmarks outside of RL.

4.1 Data and Processing

We used the website paperswithcode

as the primary source of data for our experiments. This website

contains scores for algorithms with published results on various benchmarks, including ALE. The

dataset was then supplemented with additional results from papers not included on the website.

Additions included various algorithms, such as evolutionary algorithms and shallow neural networks.

A full list of algorithms and their sources contained within the dataset can be found in Appendix D.

We removed any algorithms that had results for fewer than 40 of the 57 games in the standard ALE

dataset on the grounds that a reasonable median estimate could not be obtained. This reduced our

dataset of 116 algorithms down to 62. When ﬁtting models for subsets, any algorithm missing a

score for a required game was excluded for that subset. We also excluded any games with results for

fewer than 40 of the remaining 62 algorithms. The only game meeting this criteria was Surround. All

scores were normalized using the standard approach from [5]

Zi(x) := 100 ×xi−ri

hi−ri

(1)

4https://paperswithcode.com/dataset/arcade-learning-environment

where

and

are the human and random scores for environment

respectively, which we include in

Appendix 12. We found that, even after normalization, scores differed across games by many orders

of magnitude and produced non-normal residuals. We, therefore, applied the following transform

φ(x) := log10 (1 + max(0, x)) (2)

φ−1(x) = 10x−1(3)

to the normalized scores, producing log normalised scores. Clipping was required as algorithms

occasionally produce scores slightly worse than random, generating a negative normalized score.

Both input and target variables were transformed using

, and all loss measures were calculated

in transformed space. This transformation also has the effect that larger scoring games would not

dominate the loss.

4.2 Predictive Ability of Individual Games

To better understand the games within ALE, we produced single-game linear models where the score

from a single environment was used to predict the median Atari-57 score. We then ranked games

according to their predictive ability as measured by their R

. Due to a small parameter count (n=2),

cross-validation was not used.

4.3 Subset Search

To assess the best selection of games for each subset, we evaluated all subsets of the ALE games

of size ﬁve three or one that met our criteria as deﬁned in Section 4.1.

Subsets were evaluated

by ﬁtting linear regression models to the data using the log normalized scores of games within the

subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated

mean-squared-error. For these models, we disabled intercepts as we wanted a random policy to

produce a score of 0.

To make the best use of the 57 games available, we carefully ordered the subset searches to maximize

the performance of the most common subset size while maintaining the property that smaller subsets

are true subsets of their larger counterparts. All subsets of ﬁve games were evaluated, with the best

subset selected as Atari-5. The Atari-3 subset was then selected as the best triple game subset of

the Atari-5 subset and Atari-1 as the best single-game subset of Atari-3. Doing this ensures that

only ﬁve games are included in the test set and also that evaluations on Atari-1 or Atari-3 can more

easily be upgraded to Atari-5 if required. For the validation subsets, we selected the best three games,

excluding the games used in Atari-5, as the validation set. We then extended this set to 5 games by

choosing the best two additional games. Finally, the ten-game Atari-10 set was generated by selecting

the best ﬁve additional games, not in Atari-5 nor Atari-5-Val.

Searching over all subsets took approximately 1-hour on a 12-core machine. No GPU resources were

used. Source code to reproduce the results are provided in the supplementary material.

4.4 Prediction over All Game Scores

To better understand how much information our subsets capture about the 57 game dataset, we trained

57 linear regression models for the best ﬁve-game and ten-game subsets, predicting log-normalized

scores for each game within the 57-game set. We then evaluated each of the 57 models according to

their R

, allowing us to identify well or poorly captured environments by the subset. These models

also enable score predictions for any game within the 57-game set.

5 Results

In this section, we present the results of our subset search on the ALE dataset using the Atari-57

median score as the target metric.

We found diminishing returns after ﬁve games, however, we do also provide models for a larger ten-game

set if additional precision is required.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Atari-5:DistillingtheArcadeLearningEnvironmentdowntoFiveGamesMatthewAitchisonAustralianNationalUniversitymatthew.aitchison@anu.edu.auPennySweetserAustralianNationalUniversityMarcusHutterAustralianNationalUniversity/DeepmindAbstractTheArcadeLearningEnvironment(ALE)hasbecomeanessentialbenchmarkforasse...

展开>> 收起<<

Atari-5 Distilling the Arcade Learning Environment down to Five Games Matthew Aitchison.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Atari-5 Distilling the Arcade Learning Environment down to Five Games Matthew Aitchison

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: