
Figure 1: The five games used in the Atari-5 Subset. From left to right, Battle Zone,Double Dunk,
Name this Game,Phoenix, and Q*bert.
capture most of the useful information of a full run. We present a new benchmark, Atari-5, which
produces scores that correlate very closely to median score estimates on the full dataset, but at less
than one-tenth the cost. We hope that this new benchmark will allow more researchers to participate in
this important field of research, speed up the development of novel algorithms through faster iteration,
and make the replication of result in RL more feasible. Our primary contributions are as follows. First,
a methodology for selecting representative subsets of multi-environment RL benchmarks according
to a target summary score. Second, the introduction of the Atari-5 benchmark. Finally, evidence
demonstrating the high degree of correlation between scores for many games within ALE.
2 Background and Related Work
The Arcade Learning Environment (ALE).
Despite deceptively simple graphics, ALE [
5
] provides
a challenging set of environments for RL algorithms. Much of this challenge stems from the fact
that ALE contains games spanning multiple genres, including sport, shooter, maze and action games,
unlike most other RL benchmarks. This is important because being able to achieve goals in a
wide range of environments has often been suggested as a useful definition of machine intelligence
[
18
]. In addition, unlike many other RL benchmarks, the environments within ALE were designed
explicitly for human play, rather than machine play. As such, ALE includes games that are extremely
challenging for machines but for which we know a learnable (human) solution exists. Human
reference scores also provide a means by which ‘good’ scores can be quantified.
ALE Subsets.
Many researchers make use of ALE subsets when presenting results. However,
decisions about which games to include have varied from paper to paper. The Deep Q-learning
Network (DQN) paper [
21
] used a seven-game subset, while the Asynchronous Advantage Actor-
Critic (A3C) paper [
20
] used only five of these games. A useful taxonomy was introduced by
Bellemare et al. [
4
], which includes the often-cited hard exploration subset. A more comprehensive
list of papers and the subsets they used is given in Appendix B, highlighting that there is currently no
standard ALE subset. The key difference in our work is that our selection of games has been chosen
by a principled approach to be representative of the dataset as a whole. To our knowledge, this is the
first work that investigates the selection and weighting of an RL benchmark suite.
Computationally Feasible Research.
As has been pointed out, the computational requirements for
generating results on ALE can be excessive [
7
]. There have been several attempts to reduce this cost,
including optimizations to the simulator codebase
3
, and a GPU implementation of the environments
[
10
]. However, as these changes reduce the cost of simulating the environment and not the much
higher cost of training a policy, they result in only minor improvements. Asynchronous parallel
training [
14
] partially addresses this by making better use of parallel hardware but still requires access
to large computer clusters to be effective. While these improvements have been helpful, they do not
go far enough to make ALE tractable for many researchers. Furthermore, our subsetting approach is
complementary to the approaches above.
Alternatives to ALE.
Since ALE’s introduction, many other RL benchmarks have been put forward,
such as [
8
,
17
,
3
,
23
]. However, ALE still dominates research, and at the time of publishing, ALE
has more than double the citations of these other benchmarks combined (see Table 1). Minatar [
29
]
addresses some of ALE’s issues by creating a new benchmark inspired by Atari that presents agents
with objects rather than pixels. This simplification speeds up training but requires previous algorithms
3https://github.com/mgbellemare/Arcade-Learning-Environment/pull/265
2