Turbocharging Solution Concepts Solving NEs CEs and CCEs with Neural Equilibrium Solvers Luke Marris

2025-05-06 0 0 761.99KB 23 页 10玖币

侵权投诉

Turbocharging Solution Concepts: Solving NEs, CEs

and CCEs with Neural Equilibrium Solvers

Luke Marris

DeepMind (marris@deepmind.com), UCL

Ian Gemp

DeepMind

Thomas Anthony

DeepMind

Andrea Tacchetti

DeepMind

Siqi Liu

DeepMind, UCL

Karl Tuyls

DeepMind

Abstract

Solution concepts such as Nash Equilibria, Correlated Equilibria, and Coarse

Correlated Equilibria are useful components for many multiagent machine learning

algorithms. Unfortunately, solving a normal-form game could take prohibitive

or non-deterministic time to converge, and could fail. We introduce the Neural

Equilibrium Solver which utilizes a special equivariant neural network architecture

to approximately solve the space of all games of ﬁxed shape, buying speed and

determinism. We deﬁne a ﬂexible equilibrium selection framework, that is capable

of uniquely selecting an equilibrium that minimizes relative entropy, or maximizes

welfare. The network is trained without needing to generate any supervised training

data. We show remarkable zero-shot generalization to larger games. We argue that

such a network is a powerful component for many possible multiagent algorithms.

1 Introduction

Normal-form solution concepts such as Nash Equilibrium (NE) [

], Correlated Equilibrium

(CE) [

], and Coarse Correlated Equilibrium (CCE) [

] are useful components and subroutines for

many multiagent machine learning algorithms. For example, value-based reinforcement learning

algorithms for solving Markov games, such as Nash Q-learning [

] and Correlated Q-Learning

[

] maintain state action values for every player in the game. These action values are equivalent

to per-state normal-form games, and policies are equilibrium solutions of these games. Critically,

this policy will need to be recomputed each time the action-value is updated during training, and for

large or continuous state-space Markov games, every time the agents need to take an action. Another

class of multiagent algorithms are those in the space of Empirical Game Theoretic Analysis (EGTA)

[

] including PSRO [

], JPSRO [

], and NeuPL [

]. These algorithms are capable

of training policies in extensive-form games, and require ﬁnding equilibria of empirically estimated

normal-form games as a subroutine (the “meta-solver” step). In particular, these algorithms have

been critical in driving agents to superhuman performance in Go [

], Chess [

], and StarCraft [

Unfortunately, solving for an equilibrium can be computationally complex. NEs are known to be

PPAD [

]. (C)CEs are deﬁned by linear constraints, and if a linear objective is used to select an

equilibrium, can be solved by linear programs (LPs) in polynomial time. However, in general, the

solutions to LPs are non-unique (e.g. zero-sum games), and therefore are unsuitable equilibrium

selection methods for many algorithms, and unsuitable for training neural networks which beneﬁt

from unambiguous targets. Objectives such as Maximum Gini [

] (a quadratic program (QP)), and

Maximum Entropy [48] (a nonlinear program), are unique but are more complex to solve.

As a result, solving for equilibria often requires deploying iterative solvers, which theoretically can

scale to large normal-form games but may (i) take an unpredictable amount of time to converge, (ii)

Preprint. Under review.

arXiv:2210.09257v2 [cs.LG] 15 Apr 2023

take a prohibitive amount of time to do so, and (iii) may fail unpredictably on ill-conditioned problems.

Furthermore, classical methods [

] (i) do not scale, and (ii) are non-differentiable. This

limits the applicability of equilibrium solution concepts in multiagent machine learning algorithms.

Therefore, there exists an important niche for approximately solving equilibria in medium sized

normal-form games, quickly, in batches, reliably, and in a deterministic amount of time. With

appropriate care, this goal can be accomplished with a Neural Network which amortizes up-front

training cost to map normal-form payoffs to equilibrium solution concepts quickly at test time. We

propose the Neural Equilibirum Solver (NES). This network is trained to optimize a composite

objective function that weights accuracy of the returned equilibrium against auxiliary objectives

that a user may desire such as maximum entropy, maximum welfare, or minimum distance to some

target distribution. We introduce several innovations into the design and training of NES so that

it is efﬁcient and accurate. Unlike most supervised deep learning models, NES avoids the need to

explicitly construct a labeled dataset of (game, equilibrium) pairs. Instead we derive a loss function

that can be minimized in an unsupervised fashion from only game inputs. We also exploit the

duality of the equilibrium problem. Instead of solving for equilibria in the primal space, we train

NES to solve for them in the dual space, which has a much smaller representation. We utilize a

training distribution that efﬁciently represents the space of all normal-form games of a desired shape

and use an invariant preprocessing step to map games at test time to this space. In terms of the

network architecture, we design a series of layers that are equivariant to symmetries in games such

as permutations of players and strategies, which reduces the number of training steps and improves

generalization performance. The network architecture is independent of the number of strategies in

the game and we show interesting zero-shot generalization to larger games. This network can either

be pretrained before being deployed, trained online alongside another machine learning algorithm, or

a mixture of both.

2 Preliminaries

Game Theory

Game theory is the study of the interactive behaviour of rational payoff maximizing

agents in the presence of other agents. The environment that the agents operate in is called a game.

We focus on a particular type of single-shot, simultaneous move game called a normal-form game.

A normal-form game consists of

players, a set of strategies available to each player,

ap∈ Ap

and a payoff for each player under a particular joint action,

Gp(a)

, where

a= (a1, ..., aN) =

(ap, a−p)∈ A =⊗pAp

. The subscript notation

−p

is used to mean “all players apart from player

”.

Games are sometimes referred to by their shape, for example:

|A1| × ... × |AN|

. The distribution

of play is described by a joint

σ(a)

. The goal of each player is to maximize their expected payoff,

Pa∈A σ(a)Gp(a)

. Players could play independently by selecting a strategy according to their

marginal

σ(ap)

over joint strategies, such that

σ(a) = ⊗pσ(ap)

. However this is limiting because

it does not allow players to coordinate. A mediator called a correlation device could be employed

to allow players to execute arbitrary joint strategies

σ(a)

that do not necessarily factorize into their

marginals. Such a mediator would sample from a publicly known joint

σ(a)

and secretly communicate

to each player their recommended strategy. Game theory is most developed in a subset of games:

those with two players and a restriction on the payoffs,

G1(a1, a2) = −G2(a1, a2)

, known as zero-

sum. Particularly in N-player, general-sum games, it is difﬁcult to deﬁne a single criterion to ﬁnd

solutions to games. One approach is instead to consider joints that are in equilibrium: distributions

such that no player has incentive to unilaterally deviate from a recommendation.

Equilibrium Solution Concepts

Correlated Equilibria (CEs) [

] can be deﬁned in terms of linear

inequality constraints. The deviation gain of a player is the change in payoff the player achieves

when deviating to action a0

pfrom a recommended action a00

p, when the other players play a−p.

ACE

p(a0

p, a00

p, a) = ACE

p(a0

p, a00

p, ap, a−p) = Gp(a0

p, a−p)−Gp(a00

p, a−p)ap=a00

0otherwise (1)

A distribution,

σ(a)

, is in



-CE if the deviation gain is no more than some constant

p≤

for every

pair of recommendation,

a00

, and deviation strategies,

, for every player,

. These linear constraints

geometrically form a convex polytope of feasible solutions.

-CE: X

a∈A

σ(a)ACE

p(a0

p, a00

p, a)≤p∀p∈[1, N], a00

p6=a0

p∈ Ap(2)

Coarse Correlated Equilibria (CCEs) [

] are similar to CEs but a player may only consider deviating

before receiving a recommended strategy. Therefore the deviation gain for CCEs can be derived from

the CE deﬁnition by summing over all possible recommended strategies a00

ACCE

p(a0

p, a) = X

a00

p∈Ap

ACE

p(a0

p, a00

p, a) = Gp(a0

p, a−p)−Gp(a)(3)

A distribution,

σ(a)

, is in



-CCE if the deviation gain is no more than some constant

p≤

for every

deviation strategy, a0

p, and for every player, p.

-CCE: X

a∈A

σ(a)ACCE

p(a0

p, a)≤p∀p∈[1, N], a0

p∈ Ap(4)

NEs [

] have similar deﬁnitions to CCEs but have an extra constraint: the joint distribution factorizes

⊗pσ(ap) = σ(a), resulting in nonlinear constraints1.

-NE: X

a∈A

⊗qσ(aq)ACCE

p(a0

p, a)≤p∀p∈[1, N], a0

p∈ Ap(5)

Note that the deﬁnition of the NE uses the same deviation gain as the CCE deﬁnition. Another

remarkable fact is that the marginals of any joint CCE in two-player constant-sum games,

σ(ap) =

Pa−pσ(a)

, is also an NE, when

p= 0

. Therefore we can use CCE machinery to solve for NEs in

such classes of games.

When a distribution is in equilibrium, no player has incentive to unilaterally deviate from it to achieve

a better payoff. There can however be many equilibria in a game, choosing amongst these is known as

the equilibrium selection problem [

]. For (C)CEs, the valid solution space is convex (Figure 1), so

any strictly convex function will sufﬁce (in particular, Maximum Entropy (ME) [

] and Maximum

Gini (MG) [

] have been proposed). For NEs, the solution space is convex for only certain classes of

games such as two-player constant-sum games. Indeed, NEs are considered fundamental in this class

where they have powerful properties, such as being unexploitable, interchangeable, and tractable to

compute. However, for N-player general-sum games (C)CEs may be more suitable as they remain

tractable and permit coordination between players which results in higher-welfare equilibria. If

p≥0

, there must exist at least one NE, CE, and CCE for any ﬁnite game, because a NE always

exists and NE ⊆CE ⊆CCE. Learning NEs [44] and CCEs [21] on a single game is well studied.

Neural Network Solvers

Approximating NEs using neural networks is known to be agnostic

PAC learnable [

]. There is also work learning (C)CEs [

] and training neural networks to

approximate NEs [

] on subclasses of games. Learned NE meta-solvers have been deployed

in PSRO [

]. Differentiable neural networks have been developed to learn QREs [

]. NEs for

contextual games have been learned using ﬁxed point (deep equilibrium) networks [

]. A related

ﬁeld, L2O [

], aims to learn an iterative optimizer more suited to a particular distribution of inputs,

while this work focuses on learning a direct mapping. To the best of our knowledge, no work exists

training a general approximate mapping from the full space of games to (C)CEs with ﬂexible selection

criteria.

3 Maximum Welfare Minimum Relative Entropy (C)CEs

Previous work has argued that having a unique objective to solve for equilibrium selection is important.

The principle of maximum entropy [

] has been used to ﬁnd unique equilibria [

]. In maximum

entropy selection, payoffs are ignored and selection is based on minimizing the distance to the

uniform distribution. This has two interesting properties: (i) it makes deﬁning unique solutions easy,

(ii) the solution is invariant transformations (such as offset and positive scaling) of the payoff tensor.

While these solutions are unique, they both result in weak and low payoff equilibria because they

ﬁnd solutions on the boundary of the polytope. Meanwhile, the literature tends to favour Maximum

Welfare (MW) because it results in high value for the agents and is a linear objective, however in

general it is not unique. We consider a composite objective function composed of (i) Minimum

1This is why NEs are harder to compute than (C)CEs.

Relative Entropy (MRE, also known as Kullback-Leibler divergence) between a target joint,

ˆσ(a)

and the equilibrium joint,

σ(a)

, (ii) distance between a target approximation,

ˆp

, and the equilibrium

approximation,

p

, (iii) maximum of a linear objective,

W(a)

. The objective is constrained by the (i)

distribution constraints (

Paσ(a) = 1

and

σ(a)≥0

) and, (ii) either CCE constraints (Equation

(4)

)

or CE constraints (Equation (2)).

arg max

σ,p

µX

a∈A

σ(a)W(a)−X

a∈A

σ(a) ln σ(a)

ˆσ(a)−ρX

p+

p−pln 1

exp(1)

+

p−p

(+

p−ˆp)

(6)

The approximation weight,

, and welfare weight,

, are hyperparameters that control the balance

of the optimization. The maximum approximation parameter,

+

, is another constant that is usually

chosen to be equal to the payoff scale (Section 4.1). The approximation term is designed to have

a similar form to the relative entropy and is maximum when

ˆp=p

. We refer to this equilibrium

selection framework as Target Approximate Maximum Welfare Minimum Relative Entropy (

ˆ

MWMRE).

3.1 Dual of -MWMRE (C)CEs

Rather than performing a constrained optimization, it is easier to solve the dual problem,

arg minαpL(C)CE

(derived in Section A), where

αCE

p(a0

p, a00

p)≥0

are the dual deviation gains

corresponding to the CE constraints, and

αCCE

p(a0

p)≥0

are the dual deviation gains corresponding to

the CCE constraints. Note that we do not need to optimize over the primal joint,

σ(a)

. Choosing one

of the elements in the curly brackets, the Lagrangian is deﬁned:

(C)CE

L= ln X

a∈A

ˆσ(a) exp (C)CE

l(a)!+X

+

p



X

p,a00

αp(a0

p, a00

p),X

CCE

αp(a0

p)





−ρX

(C)CE

p(7)

The logits l(C)CE(a)are deﬁned:

(C)CE

l(a) = µX

W(a)−X

p



X

p,ap

αp(a0

p, ap)CE

Ap(a0

p, a00

p, a),X

CCE

αp(a0

p)CCE

Ap(a0

p, a)





(8)

The primal joint and primal approximation parameters are deﬁned:

(C)CE

σ(a) =

ˆσ(a) exp (C)CE

l(a)

a∈A

ˆσ(a) exp (C)CE

l(a)

(9)

(C)CE

p= (ˆp−+

p) exp 

−1

ρ



X

p,a00

αp(a0

p, a00

p),X

CCE

αp(a0

p)





++

(10)

4 Neural Network Training

The network maps the payoffs of a game,

Gp(a)

, and the targets (

ˆσ(a)

ˆp

W(a)

) to the dual

deviation gains,

α(C)CE

, that deﬁne the equilibrium. The duals are a signiﬁcantly more space efﬁcient

objective target (

Pp|Ap|2

for CEs and

Pp|Ap|

for CCEs) than the full joint (

Qp|Ap|

), particularly

when scaling the number of strategies and players. The joint,

σ(a)

, and approximation,

p

, can be

computed analytically from the dual deviation gains and the inputs using Equations

(9)

and

(10)

. The

network is trained by minimizing the loss,

L(C)CE

(Equation

(7)

). We call the resulting architecture a

Neural Equilibrium Solver (NES).

4.1 Input and Preprocessing

The MRE objective, and the (C)CEs constraints are invariant to payoff offset and scaling. Therefore

we can assume that the payoffs have been standardized without loss of generality. Each player’s

payoff should have zero-mean. Furthermore, it should be scaled such that the

Lm=|| · ||m

norm of

(a) Trafﬁc Lights (b) Pure Coordination (c) Prisoner’s Dilemma

Figure 1: Diagrams for three

2×2

normal-form games, showing their (C)CE solution polytope on

the joint simplex (in two-strategy games CEs and CCEs are equivalent). An MWME NES, trained

by sampling over the space of payoffs and welfare targets, is used to approximate the MW(C)CE

solution (

). An MRE NES, trained by sampling over the space of payoffs and joint targets, is used

to approximate the ME(C)CE (

), and all pure joint target MRE(C)CEs (

◦

). The networks have never

trained on these games.

the payoff tensor equals

, where

is a scale hyperparameter chosen such that the elements of

the inputs have unit variance (a property that ensures neural networks train quickly with standard

parameter initializations [

]). We will ensure both these properties by including a preprocessing

layer in NES. The preprocessed inputs (

Gp(a)

ˆσ(a)

ˆp

W(a)

) are then broadcast and concatenated

together so that they result in an input of shape

[C, N, |A1|, ..., |AN|]

, where the channel dimension

C= 4, if all inputs are required.

4.2 Training Distribution

The literature favours sampling games from the uniform or normal distribution. This introduces

two problems: (i) it biases the distribution of games solvable by the network, and (ii) unnecessarily

requires the network to learn offset and scale invariance in the payoffs. Recall that the space of

equilibria are invariant to offset and positive scale transformations to the payoff. Zero-mean and

norm scaling geometrically results in the surface of an

(|A| − 1)

-ball centered on the origin

(Section D.4). We propose using this invariant subspace for training. We choose the norm scaling

constant, Z2=p|A|, such that the elements of the payoffs maintain unit variance.

4.3 Gradient Calculation

The gradient update is found by taking the derivative of the loss (Equation

(7)

) with respect to the

dual variables,

. Note that computing a gradient does not require knowing the optimal joint

σ∗(a)

so the network can be trained in an unsupervised fashion, from randomly generated inputs,

Gp(a)

ˆσ(a),ˆp, and W(a).

∂L(C)CE

∂αCE

p(a0

p, a00

p), αCCE

p(a0

p)=(C)CE

p−X

aCE

Ap(a0

p, a00

p, a),CCE

Ap(a0

p, a)(C)CE

σ(a)(11)

The dual variables,

{αCE

p(a0

p, a00

p), αCCE

p(a0

p)}

, are outputs of the neural network, with learned param-

eters θ. Gradients for these parameters can be derived using the chain rule:

∂L(C)CE

∂θ =∂L(C)CE

∂αCE

p(a0

p, a00

p), αCCE

p(a0

p)

∂αCE

p(a0

p, a00

p), αCCE

p(a0

p)

∂θ

Backprop efﬁciently calculates these gradients, and many powerful neural network optimizers

[57, 34, 15] and ML frameworks [1, 5, 49] can be leveraged to update the network parameters.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TurbochargingSolutionConcepts:SolvingNEs,CEsandCCEswithNeuralEquilibriumSolversLukeMarrisDeepMind(marris@deepmind.com),UCLIanGempDeepMindThomasAnthonyDeepMindAndreaTacchettiDeepMindSiqiLiuDeepMind,UCLKarlTuylsDeepMindAbstractSolutionconceptssuchasNashEquilibria,CorrelatedEquilibria,andCoarseCorrelat...

展开>> 收起<<

Turbocharging Solution Concepts Solving NEs CEs and CCEs with Neural Equilibrium Solvers Luke Marris.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Turbocharging Solution Concepts Solving NEs CEs and CCEs with Neural Equilibrium Solvers Luke Marris

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: