
SCALING LAWS FOR A MULTI-AGENT REINFORCE-
MENT LEARNING MODEL
Oren Neumann & Claudius Gros
Institute for Theoretical Physics
Goethe University Frankfurt
Frankfurt am Main, Germany
{neumann,gros}@itp.uni-frankfurt.de
ABSTRACT
The recent observation of neural power-law scaling relations has made a signifi-
cant impact in the field of deep learning. A substantial amount of attention has
been dedicated as a consequence to the description of scaling laws, although
mostly for supervised learning and only to a reduced extent for reinforcement
learning frameworks. In this paper we present an extensive study of performance
scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the ba-
sis of a relationship between Elo rating, playing strength and power-law scaling,
we train AlphaZero agents on the games Connect Four and Pentago and analyze
their performance. We find that player strength scales as a power law in neural
network parameter count when not bottlenecked by available compute, and as a
power of compute when training optimally sized agents. We observe nearly iden-
tical scaling exponents for both games. Combining the two observed scaling laws
we obtain a power law relating optimal size to compute similar to the ones ob-
served for language models. We find that the predicted scaling of optimal neural
network size fits our data for both games. We also show that large AlphaZero
models are more sample efficient, performing better than smaller models with the
same amount of training data.
1 INTRODUCTION
In recent years, power-law scaling of performance indicators has been observed in a range of
machine-learning architectures (Hestness et al.,2017;Kaplan et al.,2020;Henighan et al.,2020;
Gordon et al.,2021;Hernandez et al.,2021;Zhai et al.,2022), such as Transformers, LSTMs, Rout-
ing Networks (Clark et al.,2022) and ResNets (Bello et al.,2021). The range of fields investigated
include natural language processing and computer vision (Rosenfeld et al.,2019). Most of these
scaling laws regard the dependency of test loss on either dataset size, number of neural network
parameters, or training compute. The robustness of the observed scaling laws across many orders of
magnitude led to the creation of large models, with parameters numbering in the tens and hundreds
of billions (Brown et al.,2020;Hoffmann et al.,2022;Alayrac et al.,2022).
Until now, evidence for power-law scaling has come in most part from supervised learning methods.
Considerably less effort has been dedicated to the scaling of reinforcement learning algorithms, such
as performance scaling with model size (Reed et al.,2022;Lee et al.,2022). At times, scaling laws
remained unnoticed, given that they show up not as power laws, but as log-linear relations when Elo
scores are taken as the performance measure in multi-agent reinforcement learning (MARL) (Jones,
2021;Liu et al.,2021) (see Section 3.2). Of particular interest in this context is the AlphaZero
family of models, AlphaGo Zero (Silver et al.,2017b), AlphaZero (Silver et al.,2017a), and MuZero
(Schrittwieser et al.,2020), which achieved state-of-the-art performance on several board games
without access to human gameplay datasets by applying a tree search guided by a neural network.
Here we present an extensive study of power-law scaling in the context of two-player open-
information games. Our study constitutes, to our knowledge, the first investigation of power-law
scaling phenomena for a MARL algorithm. Measuring the performance of the AlphaZero algorithm
using Elo rating, we follow a similar path as Kaplan et al. (2020) by providing evidence of power-law
1
arXiv:2210.00849v2 [cs.LG] 13 Feb 2023