Atari-5 Distilling the Arcade Learning Environment down to Five Games Matthew Aitchison

2025-05-06 0 0 701.06KB 21 页 10玖币
侵权投诉
Atari-5: Distilling the Arcade Learning Environment
down to Five Games
Matthew Aitchison
Australian National University
matthew.aitchison@anu.edu.au
Penny Sweetser
Australian National University
Marcus Hutter
Australian National University / Deepmind
Abstract
The Arcade Learning Environment (ALE) has become an essential benchmark for
assessing the performance of reinforcement learning algorithms. However, the
computational cost of generating results on the entire 57-game dataset limits ALE’s
use and makes the reproducibility of many results infeasible. We propose a novel
solution to this problem in the form of a principled methodology for selecting small
but representative subsets of environments within a benchmark suite. We applied
our method to identify a subset of five ALE games, called
Atari-5
, which produces
57-game median score estimates within 10% of their true values. Extending the
subset to 10-games recovers 80% of the variance for log-scores for all games within
the 57-game set. We show this level of compression is possible due to a high degree
of correlation between many of the games in ALE.
1 Introduction
The Arcade Learning Environment (ALE) [
5
] has become the gold standard for evaluating the
performance of reinforcement learning (RL) algorithms on complex discrete control tasks. Since its
release in 2013, the benchmark has gained thousands of citations and almost all state-of-the-art RL
algorithms have featured it in their work [
25
,
16
,
14
,
13
,
12
,
21
]. However, results generated from
the full benchmark have typically been limited to a few large research groups.
1
We posit that the cost
of producing evaluations on the full dataset is not feasible for many researchers. Additionally, the
verification of superiority claims through statistical tests on multiple seeds remains exceedingly rare,
likely, due to the high cost.
While technological improvements have reduced computation costs, this decrease has been greatly
outpaced by an increase in the scale at which RL algorithms are run. For example, the number
of frames used for training to produce state-of-the-art results on ALE increased by more than one-
thousand times in the five years between 2015’s Deep Q-learning model [
22
] and the more recent
Agent57 [
2
]. Training modern machine learning models can also produce non-trivial amounts of
carbon emissions [
26
], for which we provide some analysis in Appendix A. Previous works have
dealt with this challenge by self-selecting ad hoc subsets of games. However, this approach adds bias
and makes comparison between works more challenging.
In this paper,
2
we outline a principled approach to selecting subsets of environments from a RL
benchmark suite. We show that, when carefully chosen, a surprisingly small subset of ALE can
1
Of the 17 algorithms listed on
paperswithcode.com
with full Atari-57 results, only one was from a
research group outside Google or DeepMind.
2Source code for this paper can be found at https://github.com/maitchison/Atari-5.
Preprint. Under review.
arXiv:2210.02019v1 [cs.AI] 5 Oct 2022
Figure 1: The five games used in the Atari-5 Subset. From left to right, Battle Zone,Double Dunk,
Name this Game,Phoenix, and Q*bert.
capture most of the useful information of a full run. We present a new benchmark, Atari-5, which
produces scores that correlate very closely to median score estimates on the full dataset, but at less
than one-tenth the cost. We hope that this new benchmark will allow more researchers to participate in
this important field of research, speed up the development of novel algorithms through faster iteration,
and make the replication of result in RL more feasible. Our primary contributions are as follows. First,
a methodology for selecting representative subsets of multi-environment RL benchmarks according
to a target summary score. Second, the introduction of the Atari-5 benchmark. Finally, evidence
demonstrating the high degree of correlation between scores for many games within ALE.
2 Background and Related Work
The Arcade Learning Environment (ALE).
Despite deceptively simple graphics, ALE [
5
] provides
a challenging set of environments for RL algorithms. Much of this challenge stems from the fact
that ALE contains games spanning multiple genres, including sport, shooter, maze and action games,
unlike most other RL benchmarks. This is important because being able to achieve goals in a
wide range of environments has often been suggested as a useful definition of machine intelligence
[
18
]. In addition, unlike many other RL benchmarks, the environments within ALE were designed
explicitly for human play, rather than machine play. As such, ALE includes games that are extremely
challenging for machines but for which we know a learnable (human) solution exists. Human
reference scores also provide a means by which ‘good’ scores can be quantified.
ALE Subsets.
Many researchers make use of ALE subsets when presenting results. However,
decisions about which games to include have varied from paper to paper. The Deep Q-learning
Network (DQN) paper [
21
] used a seven-game subset, while the Asynchronous Advantage Actor-
Critic (A3C) paper [
20
] used only five of these games. A useful taxonomy was introduced by
Bellemare et al. [
4
], which includes the often-cited hard exploration subset. A more comprehensive
list of papers and the subsets they used is given in Appendix B, highlighting that there is currently no
standard ALE subset. The key difference in our work is that our selection of games has been chosen
by a principled approach to be representative of the dataset as a whole. To our knowledge, this is the
first work that investigates the selection and weighting of an RL benchmark suite.
Computationally Feasible Research.
As has been pointed out, the computational requirements for
generating results on ALE can be excessive [
7
]. There have been several attempts to reduce this cost,
including optimizations to the simulator codebase
3
, and a GPU implementation of the environments
[
10
]. However, as these changes reduce the cost of simulating the environment and not the much
higher cost of training a policy, they result in only minor improvements. Asynchronous parallel
training [
14
] partially addresses this by making better use of parallel hardware but still requires access
to large computer clusters to be effective. While these improvements have been helpful, they do not
go far enough to make ALE tractable for many researchers. Furthermore, our subsetting approach is
complementary to the approaches above.
Alternatives to ALE.
Since ALE’s introduction, many other RL benchmarks have been put forward,
such as [
8
,
17
,
3
,
23
]. However, ALE still dominates research, and at the time of publishing, ALE
has more than double the citations of these other benchmarks combined (see Table 1). Minatar [
29
]
addresses some of ALE’s issues by creating a new benchmark inspired by Atari that presents agents
with objects rather than pixels. This simplification speeds up training but requires previous algorithms
3https://github.com/mgbellemare/Arcade-Learning-Environment/pull/265
2
Benchmark Citations
ALE [5] 2,295
Vizdoom [17] 610
DeepMind Lab [3] 380
Procgen [8] 166
Gym Retro [23] 119
Table 1: Popular discrete-action vision-based benchmarks in RL by citation. Citations are as of Mar
2022.
to be trained on the new environments and lacks the vision part of the task, an important part of
ALE’s challenge.
3 Summary Scores on Subsets of Benchmark Suites
When working with a benchmark made up of multiple environments it is often useful to distil the
performance of an algorithm, over all environments, down to a single ‘summary score’ [
5
]. If this
summary score uses scores from only a subset of the environments then the time taken to generate a
summary score can be reduced. Ideally this subset summary score would:
1. Preserve order.
2. Provide useful information about the performance of the algorithm.
3. Minimize the number of environments needed to be evaluated.
We formalize these desiderata as follows. Let
M={µi|i[1..n]}
be an ordered set of
n
unique
environments that make up the benchmark. For some algorithm
a
we write the empirical score on
the
i
-th environment as
ai
. Then, for this set of environments, we say that for a pair of algorithms
a, b
, we have that
ab
if and only if
aibii[1..n]
and that
a>b
if there also exists
some
i
, such that
ai> bi
. We also define a summary score
ΦM
on scores from benchmark
M
as
a function
ΦM:RnR
reducing a score vector
ha1, a2, ..., ani
taken from benchmark
M
, to a
scalar indicating performance. For convenience, we write the summary of individual scores for an
algorithm,
ΦM(ha1, a2, ..., ani)
as
ΦM(a)
. Finally we define a subset score as a summary score
which uses a strict subset of M.
3.1 Order Preservation
Ideally a subset summary score would be strictly order preserving, in that if
a>b
then
Φ(a)>Φ(b)
.
Unfortunately this is not possible.
Proposition 1: No subset score can be strictly increasing.
Proof. Let
M0
be the strict subset of environments used by
ΦM
, then select
z[1..n]
where
µzM\M0
. Now consider two algorithms
a, b
, where
ai=bii6=z
, and
az> bz
. Hence we
have that
a>b
, but not
Φ(a)>Φ(b)
. Therefore, the next best thing we can do is to require that our
subset score be at least weakly order preserving, that is, if a > b we can not have ΦM(a)<ΦM(b).
Proposition 2: Any summary score of the form
ΦM(a) =
n
X
i=1
ciai
with ciR0, is weakly increasing.
Proof. See Appendix C.
We therefore find that linear combination of scores, if constrained to non-negative weights, is weakly
order preserving. We note that this is not the case for more sophisticated non-linear models, such
as deep neural networks, which might reverse the order of algorithms. We also note that any
monotonically increasing function
φ:RR
can be applied to the individual game scores, which
3
we will take advantage of by using a log transform. We can therefore guarantee that if algorithm
a
is strictly better than algorithm
b
, it can be no worse when measured using a weighted subset score,
which is, as shown, the best we can do.
3.2 Provide Useful Information
As shown, a weighted subset produces a monotonic measure of an algorithm’s performance. However,
the question of which subset to use and how to weigh the scores remains. A trivial solution would
be to weigh each score by
0
, which would be fast to generate and weakly order-preserving but
completely uninformative. Therefore, we propose that individual scores should be weighted such
that the summary score provides meaningful information about the performance of the algorithm.
For established benchmarks, meaningful summary performance measures have often already been
established. For example, with ALE, the most common summary score is median score performance.
Therefore, we suggest that the prediction of an established summary score (target score) provides a
sensible strategy for selecting and weighting the environments within the suite. That is, to find the
subset of environments that best predicts the target score and weigh them according to their linear
regression coefficients.
3.3 Procedure
We therefore propose that using linear regression models to predicate a meaningful target summary
score provides a good solution to the desiderta outlined above, and formalize the procedure. Given
m
algorithms
ak:k= 1...m
together with their individual evaluations
sk
iR
on environments
i∈ {1 : n}
and total/aggregate score
tkR
, e.g.
tk=sk
1+... +sk
d
or the average or the median.
The “training” data set is hence
D={(sk
1, ..., sk
n;tk) : k= 1...m}
. Let
I⊆ {1 : n}
be a subset
of games and
sI:= (si:iI)
be a corresponding score sub-vector. We want to find a mapping
f
I:RIR
that best predicts
t
from
sI
, i.e. formally
f
I= arg minfPm
k=1(tkf(sk
I))2
. Since
for
I={1 : d}
, the best
f
is perfect and linear, it is natural to consider linear
f
, which leads to an
efficiently solvable linear regression problem. We further want to find small
I
with small error. We
hence further minimize over all
I⊆ {1 : n}
of some fixed size
|I|=C
. The optimal set of games
I
with best linear evaluation function fbased on games in Ihence are
(I, f) := arg min
|I|=C
arg min
fLinearI
m
X
k=1
(tkf(sk
I))2.
4 Application to ALE
This section provides details for the application of our method to the popular ALE benchmark,
but note that this method could also be applied to other benchmark suites such as the Procgen [
8
],
MuJoCo [27], or even benchmarks outside of RL.
4.1 Data and Processing
We used the website paperswithcode
4
as the primary source of data for our experiments. This website
contains scores for algorithms with published results on various benchmarks, including ALE. The
dataset was then supplemented with additional results from papers not included on the website.
Additions included various algorithms, such as evolutionary algorithms and shallow neural networks.
A full list of algorithms and their sources contained within the dataset can be found in Appendix D.
We removed any algorithms that had results for fewer than 40 of the 57 games in the standard ALE
dataset on the grounds that a reasonable median estimate could not be obtained. This reduced our
dataset of 116 algorithms down to 62. When fitting models for subsets, any algorithm missing a
score for a required game was excluded for that subset. We also excluded any games with results for
fewer than 40 of the remaining 62 algorithms. The only game meeting this criteria was Surround. All
scores were normalized using the standard approach from [5]
Zi(x) := 100 ×xiri
hiri
(1)
4https://paperswithcode.com/dataset/arcade-learning-environment
4
where
ri
and
hi
are the human and random scores for environment
i
respectively, which we include in
Appendix 12. We found that, even after normalization, scores differed across games by many orders
of magnitude and produced non-normal residuals. We, therefore, applied the following transform
φ(x) := log10 (1 + max(0, x)) (2)
φ1(x) = 10x1(3)
to the normalized scores, producing log normalised scores. Clipping was required as algorithms
occasionally produce scores slightly worse than random, generating a negative normalized score.
Both input and target variables were transformed using
φ
, and all loss measures were calculated
in transformed space. This transformation also has the effect that larger scoring games would not
dominate the loss.
4.2 Predictive Ability of Individual Games
To better understand the games within ALE, we produced single-game linear models where the score
from a single environment was used to predict the median Atari-57 score. We then ranked games
according to their predictive ability as measured by their R
2
. Due to a small parameter count (n=2),
cross-validation was not used.
4.3 Subset Search
To assess the best selection of games for each subset, we evaluated all subsets of the ALE games
of size five three or one that met our criteria as defined in Section 4.1.
5
Subsets were evaluated
by fitting linear regression models to the data using the log normalized scores of games within the
subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated
mean-squared-error. For these models, we disabled intercepts as we wanted a random policy to
produce a score of 0.
To make the best use of the 57 games available, we carefully ordered the subset searches to maximize
the performance of the most common subset size while maintaining the property that smaller subsets
are true subsets of their larger counterparts. All subsets of five games were evaluated, with the best
subset selected as Atari-5. The Atari-3 subset was then selected as the best triple game subset of
the Atari-5 subset and Atari-1 as the best single-game subset of Atari-3. Doing this ensures that
only five games are included in the test set and also that evaluations on Atari-1 or Atari-3 can more
easily be upgraded to Atari-5 if required. For the validation subsets, we selected the best three games,
excluding the games used in Atari-5, as the validation set. We then extended this set to 5 games by
choosing the best two additional games. Finally, the ten-game Atari-10 set was generated by selecting
the best five additional games, not in Atari-5 nor Atari-5-Val.
Searching over all subsets took approximately 1-hour on a 12-core machine. No GPU resources were
used. Source code to reproduce the results are provided in the supplementary material.
4.4 Prediction over All Game Scores
To better understand how much information our subsets capture about the 57 game dataset, we trained
57 linear regression models for the best five-game and ten-game subsets, predicting log-normalized
scores for each game within the 57-game set. We then evaluated each of the 57 models according to
their R
2
, allowing us to identify well or poorly captured environments by the subset. These models
also enable score predictions for any game within the 57-game set.
5 Results
In this section, we present the results of our subset search on the ALE dataset using the Atari-57
median score as the target metric.
5
We found diminishing returns after five games, however, we do also provide models for a larger ten-game
set if additional precision is required.
5
摘要:

Atari-5:DistillingtheArcadeLearningEnvironmentdowntoFiveGamesMatthewAitchisonAustralianNationalUniversitymatthew.aitchison@anu.edu.auPennySweetserAustralianNationalUniversityMarcusHutterAustralianNationalUniversity/DeepmindAbstractTheArcadeLearningEnvironment(ALE)hasbecomeanessentialbenchmarkforasse...

展开>> 收起<<
Atari-5 Distilling the Arcade Learning Environment down to Five Games Matthew Aitchison.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:701.06KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注