SCALING LAWS FOR A MULTI -AGENT REINFORCE - MENT LEARNING MODEL Oren Neumann Claudius Gros

2025-05-03 0 0 1.9MB 19 页 10玖币
侵权投诉
SCALING LAWS FOR A MULTI-AGENT REINFORCE-
MENT LEARNING MODEL
Oren Neumann & Claudius Gros
Institute for Theoretical Physics
Goethe University Frankfurt
Frankfurt am Main, Germany
{neumann,gros}@itp.uni-frankfurt.de
ABSTRACT
The recent observation of neural power-law scaling relations has made a signifi-
cant impact in the field of deep learning. A substantial amount of attention has
been dedicated as a consequence to the description of scaling laws, although
mostly for supervised learning and only to a reduced extent for reinforcement
learning frameworks. In this paper we present an extensive study of performance
scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the ba-
sis of a relationship between Elo rating, playing strength and power-law scaling,
we train AlphaZero agents on the games Connect Four and Pentago and analyze
their performance. We find that player strength scales as a power law in neural
network parameter count when not bottlenecked by available compute, and as a
power of compute when training optimally sized agents. We observe nearly iden-
tical scaling exponents for both games. Combining the two observed scaling laws
we obtain a power law relating optimal size to compute similar to the ones ob-
served for language models. We find that the predicted scaling of optimal neural
network size fits our data for both games. We also show that large AlphaZero
models are more sample efficient, performing better than smaller models with the
same amount of training data.
1 INTRODUCTION
In recent years, power-law scaling of performance indicators has been observed in a range of
machine-learning architectures (Hestness et al.,2017;Kaplan et al.,2020;Henighan et al.,2020;
Gordon et al.,2021;Hernandez et al.,2021;Zhai et al.,2022), such as Transformers, LSTMs, Rout-
ing Networks (Clark et al.,2022) and ResNets (Bello et al.,2021). The range of fields investigated
include natural language processing and computer vision (Rosenfeld et al.,2019). Most of these
scaling laws regard the dependency of test loss on either dataset size, number of neural network
parameters, or training compute. The robustness of the observed scaling laws across many orders of
magnitude led to the creation of large models, with parameters numbering in the tens and hundreds
of billions (Brown et al.,2020;Hoffmann et al.,2022;Alayrac et al.,2022).
Until now, evidence for power-law scaling has come in most part from supervised learning methods.
Considerably less effort has been dedicated to the scaling of reinforcement learning algorithms, such
as performance scaling with model size (Reed et al.,2022;Lee et al.,2022). At times, scaling laws
remained unnoticed, given that they show up not as power laws, but as log-linear relations when Elo
scores are taken as the performance measure in multi-agent reinforcement learning (MARL) (Jones,
2021;Liu et al.,2021) (see Section 3.2). Of particular interest in this context is the AlphaZero
family of models, AlphaGo Zero (Silver et al.,2017b), AlphaZero (Silver et al.,2017a), and MuZero
(Schrittwieser et al.,2020), which achieved state-of-the-art performance on several board games
without access to human gameplay datasets by applying a tree search guided by a neural network.
Here we present an extensive study of power-law scaling in the context of two-player open-
information games. Our study constitutes, to our knowledge, the first investigation of power-law
scaling phenomena for a MARL algorithm. Measuring the performance of the AlphaZero algorithm
using Elo rating, we follow a similar path as Kaplan et al. (2020) by providing evidence of power-law
1
arXiv:2210.00849v2 [cs.LG] 13 Feb 2023
Figure 1: Left: Optimal number of neural network parameters for different amounts of available
compute. The optimal agent size scales for both Connect Four and Pentago as a single power law
with available compute. The predicted slope αopt
C=αCNof Eq. (7) matches the observed data,
where αCand αNare the compute and model size scaling exponents, respectively. See Table 3for
the numerical values. Right: The same graph zoomed out to include the resources used to create
AlphaZero (Silver et al.,2017a) and AlphaGoZero (Silver et al.,2017b). These models stand well
bellow the optimal trend for Connect Four and Pentago.
scaling of playing strength with model size and compute, as well as a power law of optimal model
size with respect to available compute. Focusing on AlphaZero-agents that are guided by neural nets
with fully connected layers, we test our hypothesis on two popular board games: Connect Four and
Pentago. These games are selected for being different from each other with respect to branching
factors and game lengths.
Using the Bradley-Terry model definition of playing strength (Bradley & Terry,1952), we start
by showing that playing strength scales as a power law with neural network size when models are
trained until convergence in the limit of abundant compute. We find that agents trained on Connect
Four and Pentago scale with similar exponents.
In a second step we investigate the trade-off between model size and compute. Similar to scaling
observed in the game Hex (Jones,2021), we observe power-law scaling when compute is limited,
again with similar exponents for Connect Four and Pentago. Finally we utilize these two scaling laws
to find a scaling law for the optimal model size given the amount of compute available, as shown in
Fig. 1. We find that the optimal neural network size scales as a power law with compute, with an
exponent that can be derived from the individual size-scaling and compute-scaling exponents. All
code and data used in our experiments are available online 1
2 RELATED WORK
Little work on power-law scaling has been published for MARL algorithms. Schrittwieser et al.
(2021) report reward scaling as a power law with data frames when training a data efficient vari-
ant of MuZero. Jones (2021), the closest work to our own, shows evidence of power-law scaling of
performance with compute, by measuring the performance of AlphaZero agents on small-board vari-
ants of the game of Hex. For board-sizes 3-9, log-scaling of Elo rating with compute is found when
plotting the maximal scores reached among training runs. Without making an explicit connection to
power-law scaling, the results reported by Jones (2021) can be characterized by a compute exponent
of αC1.3, which can be shown when using Eq. 3. In the paper, the author suggests a phenomeno-
logical explanation for the observation that an agent with twice the compute of its opponent seems to
win with a probability of roughly 2/3, which in fact corresponds to a compute exponent of αC= 1.
Similarly, Liu et al. (2021) report Elo scores that appear to scale as a log of environment frames for
humanoid agents playing football, which would correspond to a power-law exponent of roughly 0.5
for playing strength scaling with data. Lee et al. (2022) apply the Transformer architecture to Atari
games and plot performance scaling with the number of model parameters. Due to the substantially
increased cost of calculating model-size scaling compared to compute or dataset-size scaling, they
obtain only a limited number of data points, each generated by a single training seed. On this ba-
1https://github.com/OrenNeumann/AlphaZero-scaling-laws
2
sis, analyzing the scaling behavior seems difficult. First indications of the model-size scaling law
presented in this work can be found in Neumann & Gros (2022).
3 BACKGROUND
3.1 LANGUAGE MODEL SCALING LAWS
Kaplan et al. (2020) showed that the cross-entropy loss of autoregressive Transformers (Vaswani
et al.,2017) scales as a power law with model size, dataset size and compute. These scaling laws
hold when training is not bottlenecked by the other two resources. Specifically, the model size power
law applies when models are trained to convergence, while the compute scaling law is valid when
training optimal-sized models. By combining these laws a power-law scaling of optimal model
size with compute is obtained, with an exponent derived from the loss scaling exponents. These
exponents, later recalculated by Hoffmann et al. (2022), tend to fall in the range [0,1].
3.2 POWER LAW SCALING AND ELO RATING
The Elo rating system (Elo,1978) is a popular standard for benchmarking of MARL algorithms, as
well as for rating human players in zero sum games. The ratings riare calculated to fit the expected
score of player iin games against player j:
Ei=1
1 + 10(rjri)/400 ,(1)
where possible game scores are {0,0.5,1}for a loss/draw/win respectively. This rating system is
built on the assumption that game statistics adhere to the Bradley-Terry model (Bradley & Terry,
1952), for which each player iis assigned a number γirepresenting their strength. The player-
specific strength determines the expected game outcomes according to:
Ei=γi
γi+γj
.(2)
Elo rating is a log-scale representation of player strengths (Coulom,2007): ri= 400 log10(γi).
An observed logarithmic scaling of the Elo score is hence equivalent to a power-law scaling of the
individual strengths: If there exists a constant csuch that ri=c·log10(Xi)for some variable Xi
(e.g. number of parameters), then the playing strength of player iscales as a power of Xi:
γiXα
i,(3)
where α=c/400. Note that multiplying γiby a constant is equivalent to adding a constant to the
Elo score ri, which would not change the predicted game outcomes. This power law produces a
simple expression for the expected result of a game between iand j:
Ei=1
1+(Xj/Xi)α.(4)
4 EXPERIMENTAL SETTING
We train agents with the AlphaZero algorithm using Monte Carlo tree search (MCTS) (Coulom,
2006;Browne et al.,2012) guided by a multilayer perceptron. Kaplan et al. (2020) have observed
Table 1: Summary of scaling exponents. αNdescribes the scaling of performance with model size,
αCof performance with compute, and αopt
Cof the optimal model size with compute. Connect Four
and Pentago have nearly identical values.
Exponents Connect Four Pentago
αN0.88 0.87
αC0.55 0.55
αopt
C0.62 0.63
3
that autoregressive Transformer models display the same size scaling trend regardless of shape de-
tails, provided the number of layers is at least two. We therefore use a neural network architecture
where the policy and value heads, each containing a fully-connected hidden layer, are mounted on
a torso of two fully-connected hidden layers. All hidden layers are of equal width, which is varied
between 4 and 256 neurons. Training is done using the AlphaZero Python implementation available
in OpenSpiel (Lanctot et al.,2019).
In order to make our analysis as general as possible we specifically avoid hyperparameter tuning
whenever possible, fixing most hyperparameters to the ones suggested in OpenSpiel’s AlphaZero
example code, see Appendix A. The only parameter tailored to each board game is the temperature
drop, as we find that it has a substantial influence on agents’ ability to learn effectively within a span
of 104training steps, the number used in our simulations.
We focus on two different games, Connect Four and Pentago, which are both popular two-player
zero-sum open-information games. In Connect Four, players drop tokens in turn into a vertical
board, trying to win by connecting four of their tokens in a line. In Pentago, players place tokens on
a board in an attempt to connect a line of five, with each turn ending with a rotation of one of the
four quadrants of the board. These two games are non-trivial to learn and light enough to allow for
training a larger number of agents with a reasonable amount of resources. Furthermore, they allow
for the benchmarking of the trained agents against solvers that are either perfect (for Connect Four)
or near-to-perfect (for Pentago). Our analysis starts with Connect Four, for which we train agents
with varying model sizes. Elo scores are evaluated both within the group of trained agents and with
respect to an open source game solver (Pons,2015). We follow up the Connect Four results with
agents trained on Pentago, which has a much larger branching factor and shorter games (on average,
see Table 2).
For both games we train AlphaZero agents with different neural network sizes and/or distinct com-
pute budgets, repeating each training six times with different random seeds to ensure reproducibility
(Henderson et al.,2018). We run matches with all trained agents, in order to calculate the Elo rat-
ings of the total pools of 1326 and 714 agents for Connect Four and Pentago, respectively. Elo score
calculation is done using BayesElo (Coulom,2008).
5 RESULTS
5.1 NEURAL NETWORK SIZE SCALING
Fig. 2(top) shows the improvement of Connect Four Elo scores when increasing neural network
sizes across several orders of magnitude. We look at the infinite compute limit, in the sense that
performance is not bottlenecked by compute-time limitations, by training all agents exactly 104
optimization steps. One observes that Elo scores follow a clear logarithmic trend which breaks only
when approaching the hard boundary of perfect play. In order to verify that the observed plateau
in Elo is indeed the maximal rating achievable, we plot the Elo difference between each agent and
an optimal player guided by a game solver (Pons,2015) using alpha-beta pruning (Knuth & Moore,
1975). A vertical line in both plots marks the point beyond which we exclude data from the fit. This
is also the point where agents get within 10 Elo points difference from the solver.
Combining the logarithmic fit with Eq. 1yields a simple relation between the expected game score
Eifor player iagainst player jand the ratio Ni/Nj, of the respective numbers of neural network
Table 2: Game details. Connect Four games last longer on average, but the branching factor of
Pentago is substantially larger. Game lengths are averaged over all training runs, citing the maximal
allowed number of turns as well.
Game Branching Factor Average Game Length Maximal Game Length
Connect Four 725.4 42
Pentago 288 16.8 36
4
摘要:

SCALINGLAWSFORAMULTI-AGENTREINFORCE-MENTLEARNINGMODELOrenNeumann&ClaudiusGrosInstituteforTheoreticalPhysicsGoetheUniversityFrankfurtFrankfurtamMain,Germanyfneumann,grosg@itp.uni-frankfurt.deABSTRACTTherecentobservationofneuralpower-lawscalingrelationshasmadeasigni-cantimpactintheeldofdeeplearning....

展开>> 收起<<
SCALING LAWS FOR A MULTI -AGENT REINFORCE - MENT LEARNING MODEL Oren Neumann Claudius Gros.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:1.9MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注