SCALING LAWS FOR A MULTI -AGENT REINFORCE - MENT LEARNING MODEL Oren Neumann Claudius Gros

2025-05-03 0 0 1.9MB 19 页 10玖币

侵权投诉

SCALING LAWS FOR A MULTI-AGENT REINFORCE-

MENT LEARNING MODEL

Oren Neumann & Claudius Gros

Institute for Theoretical Physics

Goethe University Frankfurt

Frankfurt am Main, Germany

{neumann,gros}@itp.uni-frankfurt.de

ABSTRACT

The recent observation of neural power-law scaling relations has made a signiﬁ-

cant impact in the ﬁeld of deep learning. A substantial amount of attention has

been dedicated as a consequence to the description of scaling laws, although

mostly for supervised learning and only to a reduced extent for reinforcement

learning frameworks. In this paper we present an extensive study of performance

scaling for a cornerstone reinforcement learning algorithm, AlphaZero. On the ba-

sis of a relationship between Elo rating, playing strength and power-law scaling,

we train AlphaZero agents on the games Connect Four and Pentago and analyze

their performance. We ﬁnd that player strength scales as a power law in neural

network parameter count when not bottlenecked by available compute, and as a

power of compute when training optimally sized agents. We observe nearly iden-

tical scaling exponents for both games. Combining the two observed scaling laws

we obtain a power law relating optimal size to compute similar to the ones ob-

served for language models. We ﬁnd that the predicted scaling of optimal neural

network size ﬁts our data for both games. We also show that large AlphaZero

models are more sample efﬁcient, performing better than smaller models with the

same amount of training data.

1 INTRODUCTION

In recent years, power-law scaling of performance indicators has been observed in a range of

machine-learning architectures (Hestness et al.,2017;Kaplan et al.,2020;Henighan et al.,2020;

Gordon et al.,2021;Hernandez et al.,2021;Zhai et al.,2022), such as Transformers, LSTMs, Rout-

ing Networks (Clark et al.,2022) and ResNets (Bello et al.,2021). The range of ﬁelds investigated

include natural language processing and computer vision (Rosenfeld et al.,2019). Most of these

scaling laws regard the dependency of test loss on either dataset size, number of neural network

parameters, or training compute. The robustness of the observed scaling laws across many orders of

magnitude led to the creation of large models, with parameters numbering in the tens and hundreds

of billions (Brown et al.,2020;Hoffmann et al.,2022;Alayrac et al.,2022).

Until now, evidence for power-law scaling has come in most part from supervised learning methods.

Considerably less effort has been dedicated to the scaling of reinforcement learning algorithms, such

as performance scaling with model size (Reed et al.,2022;Lee et al.,2022). At times, scaling laws

remained unnoticed, given that they show up not as power laws, but as log-linear relations when Elo

scores are taken as the performance measure in multi-agent reinforcement learning (MARL) (Jones,

2021;Liu et al.,2021) (see Section 3.2). Of particular interest in this context is the AlphaZero

family of models, AlphaGo Zero (Silver et al.,2017b), AlphaZero (Silver et al.,2017a), and MuZero

(Schrittwieser et al.,2020), which achieved state-of-the-art performance on several board games

without access to human gameplay datasets by applying a tree search guided by a neural network.

Here we present an extensive study of power-law scaling in the context of two-player open-

information games. Our study constitutes, to our knowledge, the ﬁrst investigation of power-law

scaling phenomena for a MARL algorithm. Measuring the performance of the AlphaZero algorithm

using Elo rating, we follow a similar path as Kaplan et al. (2020) by providing evidence of power-law

arXiv:2210.00849v2 [cs.LG] 13 Feb 2023

Figure 1: Left: Optimal number of neural network parameters for different amounts of available

compute. The optimal agent size scales for both Connect Four and Pentago as a single power law

with available compute. The predicted slope αopt

C=αC/αNof Eq. (7) matches the observed data,

where αCand αNare the compute and model size scaling exponents, respectively. See Table 3for

the numerical values. Right: The same graph zoomed out to include the resources used to create

AlphaZero (Silver et al.,2017a) and AlphaGoZero (Silver et al.,2017b). These models stand well

bellow the optimal trend for Connect Four and Pentago.

scaling of playing strength with model size and compute, as well as a power law of optimal model

size with respect to available compute. Focusing on AlphaZero-agents that are guided by neural nets

with fully connected layers, we test our hypothesis on two popular board games: Connect Four and

Pentago. These games are selected for being different from each other with respect to branching

factors and game lengths.

Using the Bradley-Terry model deﬁnition of playing strength (Bradley & Terry,1952), we start

by showing that playing strength scales as a power law with neural network size when models are

trained until convergence in the limit of abundant compute. We ﬁnd that agents trained on Connect

Four and Pentago scale with similar exponents.

In a second step we investigate the trade-off between model size and compute. Similar to scaling

observed in the game Hex (Jones,2021), we observe power-law scaling when compute is limited,

again with similar exponents for Connect Four and Pentago. Finally we utilize these two scaling laws

to ﬁnd a scaling law for the optimal model size given the amount of compute available, as shown in

Fig. 1. We ﬁnd that the optimal neural network size scales as a power law with compute, with an

exponent that can be derived from the individual size-scaling and compute-scaling exponents. All

code and data used in our experiments are available online 1

2 RELATED WORK

Little work on power-law scaling has been published for MARL algorithms. Schrittwieser et al.

(2021) report reward scaling as a power law with data frames when training a data efﬁcient vari-

ant of MuZero. Jones (2021), the closest work to our own, shows evidence of power-law scaling of

performance with compute, by measuring the performance of AlphaZero agents on small-board vari-

ants of the game of Hex. For board-sizes 3-9, log-scaling of Elo rating with compute is found when

plotting the maximal scores reached among training runs. Without making an explicit connection to

power-law scaling, the results reported by Jones (2021) can be characterized by a compute exponent

of αC≈1.3, which can be shown when using Eq. 3. In the paper, the author suggests a phenomeno-

logical explanation for the observation that an agent with twice the compute of its opponent seems to

win with a probability of roughly 2/3, which in fact corresponds to a compute exponent of αC= 1.

Similarly, Liu et al. (2021) report Elo scores that appear to scale as a log of environment frames for

humanoid agents playing football, which would correspond to a power-law exponent of roughly 0.5

for playing strength scaling with data. Lee et al. (2022) apply the Transformer architecture to Atari

games and plot performance scaling with the number of model parameters. Due to the substantially

increased cost of calculating model-size scaling compared to compute or dataset-size scaling, they

obtain only a limited number of data points, each generated by a single training seed. On this ba-

1https://github.com/OrenNeumann/AlphaZero-scaling-laws

sis, analyzing the scaling behavior seems difﬁcult. First indications of the model-size scaling law

presented in this work can be found in Neumann & Gros (2022).

3 BACKGROUND

3.1 LANGUAGE MODEL SCALING LAWS

Kaplan et al. (2020) showed that the cross-entropy loss of autoregressive Transformers (Vaswani

et al.,2017) scales as a power law with model size, dataset size and compute. These scaling laws

hold when training is not bottlenecked by the other two resources. Speciﬁcally, the model size power

law applies when models are trained to convergence, while the compute scaling law is valid when

training optimal-sized models. By combining these laws a power-law scaling of optimal model

size with compute is obtained, with an exponent derived from the loss scaling exponents. These

exponents, later recalculated by Hoffmann et al. (2022), tend to fall in the range [0,1].

3.2 POWER LAW SCALING AND ELO RATING

The Elo rating system (Elo,1978) is a popular standard for benchmarking of MARL algorithms, as

well as for rating human players in zero sum games. The ratings riare calculated to ﬁt the expected

score of player iin games against player j:

Ei=1

1 + 10(rj−ri)/400 ,(1)

where possible game scores are {0,0.5,1}for a loss/draw/win respectively. This rating system is

built on the assumption that game statistics adhere to the Bradley-Terry model (Bradley & Terry,

1952), for which each player iis assigned a number γirepresenting their strength. The player-

speciﬁc strength determines the expected game outcomes according to:

Ei=γi

γi+γj

.(2)

Elo rating is a log-scale representation of player strengths (Coulom,2007): ri= 400 log10(γi).

An observed logarithmic scaling of the Elo score is hence equivalent to a power-law scaling of the

individual strengths: If there exists a constant csuch that ri=c·log10(Xi)for some variable Xi

(e.g. number of parameters), then the playing strength of player iscales as a power of Xi:

γi∝Xα

i,(3)

where α=c/400. Note that multiplying γiby a constant is equivalent to adding a constant to the

Elo score ri, which would not change the predicted game outcomes. This power law produces a

simple expression for the expected result of a game between iand j:

Ei=1

1+(Xj/Xi)α.(4)

4 EXPERIMENTAL SETTING

We train agents with the AlphaZero algorithm using Monte Carlo tree search (MCTS) (Coulom,

2006;Browne et al.,2012) guided by a multilayer perceptron. Kaplan et al. (2020) have observed

Table 1: Summary of scaling exponents. αNdescribes the scaling of performance with model size,

αCof performance with compute, and αopt

Cof the optimal model size with compute. Connect Four

and Pentago have nearly identical values.

Exponents Connect Four Pentago

αN0.88 0.87

αC0.55 0.55

αopt

C0.62 0.63

that autoregressive Transformer models display the same size scaling trend regardless of shape de-

tails, provided the number of layers is at least two. We therefore use a neural network architecture

where the policy and value heads, each containing a fully-connected hidden layer, are mounted on

a torso of two fully-connected hidden layers. All hidden layers are of equal width, which is varied

between 4 and 256 neurons. Training is done using the AlphaZero Python implementation available

in OpenSpiel (Lanctot et al.,2019).

In order to make our analysis as general as possible we speciﬁcally avoid hyperparameter tuning

whenever possible, ﬁxing most hyperparameters to the ones suggested in OpenSpiel’s AlphaZero

example code, see Appendix A. The only parameter tailored to each board game is the temperature

drop, as we ﬁnd that it has a substantial inﬂuence on agents’ ability to learn effectively within a span

of 104training steps, the number used in our simulations.

We focus on two different games, Connect Four and Pentago, which are both popular two-player

zero-sum open-information games. In Connect Four, players drop tokens in turn into a vertical

board, trying to win by connecting four of their tokens in a line. In Pentago, players place tokens on

a board in an attempt to connect a line of ﬁve, with each turn ending with a rotation of one of the

four quadrants of the board. These two games are non-trivial to learn and light enough to allow for

training a larger number of agents with a reasonable amount of resources. Furthermore, they allow

for the benchmarking of the trained agents against solvers that are either perfect (for Connect Four)

or near-to-perfect (for Pentago). Our analysis starts with Connect Four, for which we train agents

with varying model sizes. Elo scores are evaluated both within the group of trained agents and with

respect to an open source game solver (Pons,2015). We follow up the Connect Four results with

agents trained on Pentago, which has a much larger branching factor and shorter games (on average,

see Table 2).

For both games we train AlphaZero agents with different neural network sizes and/or distinct com-

pute budgets, repeating each training six times with different random seeds to ensure reproducibility

(Henderson et al.,2018). We run matches with all trained agents, in order to calculate the Elo rat-

ings of the total pools of 1326 and 714 agents for Connect Four and Pentago, respectively. Elo score

calculation is done using BayesElo (Coulom,2008).

5 RESULTS

5.1 NEURAL NETWORK SIZE SCALING

Fig. 2(top) shows the improvement of Connect Four Elo scores when increasing neural network

sizes across several orders of magnitude. We look at the inﬁnite compute limit, in the sense that

performance is not bottlenecked by compute-time limitations, by training all agents exactly 104

optimization steps. One observes that Elo scores follow a clear logarithmic trend which breaks only

when approaching the hard boundary of perfect play. In order to verify that the observed plateau

in Elo is indeed the maximal rating achievable, we plot the Elo difference between each agent and

an optimal player guided by a game solver (Pons,2015) using alpha-beta pruning (Knuth & Moore,

1975). A vertical line in both plots marks the point beyond which we exclude data from the ﬁt. This

is also the point where agents get within 10 Elo points difference from the solver.

Combining the logarithmic ﬁt with Eq. 1yields a simple relation between the expected game score

Eifor player iagainst player jand the ratio Ni/Nj, of the respective numbers of neural network

Table 2: Game details. Connect Four games last longer on average, but the branching factor of

Pentago is substantially larger. Game lengths are averaged over all training runs, citing the maximal

allowed number of turns as well.

Game Branching Factor Average Game Length Maximal Game Length

Connect Four ≤725.4 42

Pentago ≤288 16.8 36

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SCALINGLAWSFORAMULTI-AGENTREINFORCE-MENTLEARNINGMODELOrenNeumann&ClaudiusGrosInstituteforTheoreticalPhysicsGoetheUniversityFrankfurtFrankfurtamMain,Germanyfneumann,grosg@itp.uni-frankfurt.deABSTRACTTherecentobservationofneuralpower-lawscalingrelationshasmadeasigni-cantimpactintheeldofdeeplearning....

展开>> 收起<<

SCALING LAWS FOR A MULTI -AGENT REINFORCE - MENT LEARNING MODEL Oren Neumann Claudius Gros.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SCALING LAWS FOR A MULTI -AGENT REINFORCE - MENT LEARNING MODEL Oren Neumann Claudius Gros

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: