Turbocharging Solution Concepts Solving NEs CEs and CCEs with Neural Equilibrium Solvers Luke Marris

2025-05-06 0 0 761.99KB 23 页 10玖币
侵权投诉
Turbocharging Solution Concepts: Solving NEs, CEs
and CCEs with Neural Equilibrium Solvers
Luke Marris
DeepMind (marris@deepmind.com), UCL
Ian Gemp
DeepMind
Thomas Anthony
DeepMind
Andrea Tacchetti
DeepMind
Siqi Liu
DeepMind, UCL
Karl Tuyls
DeepMind
Abstract
Solution concepts such as Nash Equilibria, Correlated Equilibria, and Coarse
Correlated Equilibria are useful components for many multiagent machine learning
algorithms. Unfortunately, solving a normal-form game could take prohibitive
or non-deterministic time to converge, and could fail. We introduce the Neural
Equilibrium Solver which utilizes a special equivariant neural network architecture
to approximately solve the space of all games of fixed shape, buying speed and
determinism. We define a flexible equilibrium selection framework, that is capable
of uniquely selecting an equilibrium that minimizes relative entropy, or maximizes
welfare. The network is trained without needing to generate any supervised training
data. We show remarkable zero-shot generalization to larger games. We argue that
such a network is a powerful component for many possible multiagent algorithms.
1 Introduction
Normal-form solution concepts such as Nash Equilibrium (NE) [
28
,
46
], Correlated Equilibrium
(CE) [
3
], and Coarse Correlated Equilibrium (CCE) [
45
] are useful components and subroutines for
many multiagent machine learning algorithms. For example, value-based reinforcement learning
algorithms for solving Markov games, such as Nash Q-learning [
27
] and Correlated Q-Learning
[
19
] maintain state action values for every player in the game. These action values are equivalent
to per-state normal-form games, and policies are equilibrium solutions of these games. Critically,
this policy will need to be recomputed each time the action-value is updated during training, and for
large or continuous state-space Markov games, every time the agents need to take an action. Another
class of multiagent algorithms are those in the space of Empirical Game Theoretic Analysis (EGTA)
[
60
,
61
] including PSRO [
35
,
44
], JPSRO [
40
], and NeuPL [
39
,
38
]. These algorithms are capable
of training policies in extensive-form games, and require finding equilibria of empirically estimated
normal-form games as a subroutine (the “meta-solver” step). In particular, these algorithms have
been critical in driving agents to superhuman performance in Go [
54
], Chess [
55
], and StarCraft [
59
].
Unfortunately, solving for an equilibrium can be computationally complex. NEs are known to be
PPAD [
9
,
8
]. (C)CEs are defined by linear constraints, and if a linear objective is used to select an
equilibrium, can be solved by linear programs (LPs) in polynomial time. However, in general, the
solutions to LPs are non-unique (e.g. zero-sum games), and therefore are unsuitable equilibrium
selection methods for many algorithms, and unsuitable for training neural networks which benefit
from unambiguous targets. Objectives such as Maximum Gini [
41
] (a quadratic program (QP)), and
Maximum Entropy [48] (a nonlinear program), are unique but are more complex to solve.
As a result, solving for equilibria often requires deploying iterative solvers, which theoretically can
scale to large normal-form games but may (i) take an unpredictable amount of time to converge, (ii)
Preprint. Under review.
arXiv:2210.09257v2 [cs.LG] 15 Apr 2023
take a prohibitive amount of time to do so, and (iii) may fail unpredictably on ill-conditioned problems.
Furthermore, classical methods [
16
,
36
,
43
] (i) do not scale, and (ii) are non-differentiable. This
limits the applicability of equilibrium solution concepts in multiagent machine learning algorithms.
Therefore, there exists an important niche for approximately solving equilibria in medium sized
normal-form games, quickly, in batches, reliably, and in a deterministic amount of time. With
appropriate care, this goal can be accomplished with a Neural Network which amortizes up-front
training cost to map normal-form payoffs to equilibrium solution concepts quickly at test time. We
propose the Neural Equilibirum Solver (NES). This network is trained to optimize a composite
objective function that weights accuracy of the returned equilibrium against auxiliary objectives
that a user may desire such as maximum entropy, maximum welfare, or minimum distance to some
target distribution. We introduce several innovations into the design and training of NES so that
it is efficient and accurate. Unlike most supervised deep learning models, NES avoids the need to
explicitly construct a labeled dataset of (game, equilibrium) pairs. Instead we derive a loss function
that can be minimized in an unsupervised fashion from only game inputs. We also exploit the
duality of the equilibrium problem. Instead of solving for equilibria in the primal space, we train
NES to solve for them in the dual space, which has a much smaller representation. We utilize a
training distribution that efficiently represents the space of all normal-form games of a desired shape
and use an invariant preprocessing step to map games at test time to this space. In terms of the
network architecture, we design a series of layers that are equivariant to symmetries in games such
as permutations of players and strategies, which reduces the number of training steps and improves
generalization performance. The network architecture is independent of the number of strategies in
the game and we show interesting zero-shot generalization to larger games. This network can either
be pretrained before being deployed, trained online alongside another machine learning algorithm, or
a mixture of both.
2 Preliminaries
Game Theory
Game theory is the study of the interactive behaviour of rational payoff maximizing
agents in the presence of other agents. The environment that the agents operate in is called a game.
We focus on a particular type of single-shot, simultaneous move game called a normal-form game.
A normal-form game consists of
N
players, a set of strategies available to each player,
ap∈ Ap
,
and a payoff for each player under a particular joint action,
Gp(a)
, where
a= (a1, ..., aN) =
(ap, ap)∈ A =pAp
. The subscript notation
p
is used to mean “all players apart from player
p
”.
Games are sometimes referred to by their shape, for example:
|A1| × ... × |AN|
. The distribution
of play is described by a joint
σ(a)
. The goal of each player is to maximize their expected payoff,
Pa∈A σ(a)Gp(a)
. Players could play independently by selecting a strategy according to their
marginal
σ(ap)
over joint strategies, such that
σ(a) = pσ(ap)
. However this is limiting because
it does not allow players to coordinate. A mediator called a correlation device could be employed
to allow players to execute arbitrary joint strategies
σ(a)
that do not necessarily factorize into their
marginals. Such a mediator would sample from a publicly known joint
σ(a)
and secretly communicate
to each player their recommended strategy. Game theory is most developed in a subset of games:
those with two players and a restriction on the payoffs,
G1(a1, a2) = G2(a1, a2)
, known as zero-
sum. Particularly in N-player, general-sum games, it is difficult to define a single criterion to find
solutions to games. One approach is instead to consider joints that are in equilibrium: distributions
such that no player has incentive to unilaterally deviate from a recommendation.
Equilibrium Solution Concepts
Correlated Equilibria (CEs) [
3
] can be defined in terms of linear
inequality constraints. The deviation gain of a player is the change in payoff the player achieves
when deviating to action a0
pfrom a recommended action a00
p, when the other players play ap.
ACE
p(a0
p, a00
p, a) = ACE
p(a0
p, a00
p, ap, ap) = Gp(a0
p, ap)Gp(a00
p, ap)ap=a00
p
0otherwise (1)
A distribution,
σ(a)
, is in
-CE if the deviation gain is no more than some constant
p
for every
pair of recommendation,
a00
p
, and deviation strategies,
a0
p
, for every player,
p
. These linear constraints
geometrically form a convex polytope of feasible solutions.
-CE: X
a∈A
σ(a)ACE
p(a0
p, a00
p, a)pp[1, N], a00
p6=a0
p∈ Ap(2)
2
Coarse Correlated Equilibria (CCEs) [
45
] are similar to CEs but a player may only consider deviating
before receiving a recommended strategy. Therefore the deviation gain for CCEs can be derived from
the CE definition by summing over all possible recommended strategies a00
p.
ACCE
p(a0
p, a) = X
a00
p∈Ap
ACE
p(a0
p, a00
p, a) = Gp(a0
p, ap)Gp(a)(3)
A distribution,
σ(a)
, is in
-CCE if the deviation gain is no more than some constant
p
for every
deviation strategy, a0
p, and for every player, p.
-CCE: X
a∈A
σ(a)ACCE
p(a0
p, a)pp[1, N], a0
p∈ Ap(4)
NEs [
46
] have similar definitions to CCEs but have an extra constraint: the joint distribution factorizes
pσ(ap) = σ(a), resulting in nonlinear constraints1.
-NE: X
a∈A
qσ(aq)ACCE
p(a0
p, a)pp[1, N], a0
p∈ Ap(5)
Note that the definition of the NE uses the same deviation gain as the CCE definition. Another
remarkable fact is that the marginals of any joint CCE in two-player constant-sum games,
σ(ap) =
Papσ(a)
, is also an NE, when
p= 0
. Therefore we can use CCE machinery to solve for NEs in
such classes of games.
When a distribution is in equilibrium, no player has incentive to unilaterally deviate from it to achieve
a better payoff. There can however be many equilibria in a game, choosing amongst these is known as
the equilibrium selection problem [
20
]. For (C)CEs, the valid solution space is convex (Figure 1), so
any strictly convex function will suffice (in particular, Maximum Entropy (ME) [
48
] and Maximum
Gini (MG) [
40
] have been proposed). For NEs, the solution space is convex for only certain classes of
games such as two-player constant-sum games. Indeed, NEs are considered fundamental in this class
where they have powerful properties, such as being unexploitable, interchangeable, and tractable to
compute. However, for N-player general-sum games (C)CEs may be more suitable as they remain
tractable and permit coordination between players which results in higher-welfare equilibria. If
p0
, there must exist at least one NE, CE, and CCE for any finite game, because a NE always
exists and NE CE CCE. Learning NEs [44] and CCEs [21] on a single game is well studied.
Neural Network Solvers
Approximating NEs using neural networks is known to be agnostic
PAC learnable [
14
]. There is also work learning (C)CEs [
4
,
32
] and training neural networks to
approximate NEs [
14
,
22
] on subclasses of games. Learned NE meta-solvers have been deployed
in PSRO [
17
]. Differentiable neural networks have been developed to learn QREs [
37
]. NEs for
contextual games have been learned using fixed point (deep equilibrium) networks [
24
]. A related
field, L2O [
7
], aims to learn an iterative optimizer more suited to a particular distribution of inputs,
while this work focuses on learning a direct mapping. To the best of our knowledge, no work exists
training a general approximate mapping from the full space of games to (C)CEs with flexible selection
criteria.
3 Maximum Welfare Minimum Relative Entropy (C)CEs
Previous work has argued that having a unique objective to solve for equilibrium selection is important.
The principle of maximum entropy [
30
] has been used to find unique equilibria [
48
]. In maximum
entropy selection, payoffs are ignored and selection is based on minimizing the distance to the
uniform distribution. This has two interesting properties: (i) it makes defining unique solutions easy,
(ii) the solution is invariant transformations (such as offset and positive scaling) of the payoff tensor.
While these solutions are unique, they both result in weak and low payoff equilibria because they
find solutions on the boundary of the polytope. Meanwhile, the literature tends to favour Maximum
Welfare (MW) because it results in high value for the agents and is a linear objective, however in
general it is not unique. We consider a composite objective function composed of (i) Minimum
1This is why NEs are harder to compute than (C)CEs.
3
Relative Entropy (MRE, also known as Kullback-Leibler divergence) between a target joint,
ˆσ(a)
,
and the equilibrium joint,
σ(a)
, (ii) distance between a target approximation,
ˆp
, and the equilibrium
approximation,
p
, (iii) maximum of a linear objective,
W(a)
. The objective is constrained by the (i)
distribution constraints (
Paσ(a) = 1
and
σ(a)0
) and, (ii) either CCE constraints (Equation
(4)
)
or CE constraints (Equation (2)).
arg max
σ,p
µX
a∈A
σ(a)W(a)X
a∈A
σ(a) ln σ(a)
ˆσ(a)ρX
p+
ppln 1
exp(1)
+
pp
(+
pˆp)
(6)
The approximation weight,
ρ
, and welfare weight,
µ
, are hyperparameters that control the balance
of the optimization. The maximum approximation parameter,
+
p
, is another constant that is usually
chosen to be equal to the payoff scale (Section 4.1). The approximation term is designed to have
a similar form to the relative entropy and is maximum when
ˆp=p
. We refer to this equilibrium
selection framework as Target Approximate Maximum Welfare Minimum Relative Entropy (
ˆ
-
MWMRE).
3.1 Dual of -MWMRE (C)CEs
Rather than performing a constrained optimization, it is easier to solve the dual problem,
arg minαpL(C)CE
(derived in Section A), where
αCE
p(a0
p, a00
p)0
are the dual deviation gains
corresponding to the CE constraints, and
αCCE
p(a0
p)0
are the dual deviation gains corresponding to
the CCE constraints. Note that we do not need to optimize over the primal joint,
σ(a)
. Choosing one
of the elements in the curly brackets, the Lagrangian is defined:
(C)CE
L= ln X
a∈A
ˆσ(a) exp (C)CE
l(a)!+X
p
+
p
X
a0
p,a00
p
CE
αp(a0
p, a00
p),X
a0
p
CCE
αp(a0
p)
ρX
p
(C)CE
p(7)
The logits l(C)CE(a)are defined:
(C)CE
l(a) = µX
a
W(a)X
p
X
a0
p,ap
CE
αp(a0
p, ap)CE
Ap(a0
p, a00
p, a),X
a0
p
CCE
αp(a0
p)CCE
Ap(a0
p, a)
(8)
The primal joint and primal approximation parameters are defined:
(C)CE
σ(a) =
ˆσ(a) exp (C)CE
l(a)
P
a∈A
ˆσ(a) exp (C)CE
l(a)
(9)
(C)CE
p= (ˆp+
p) exp
1
ρ
X
a0
p,a00
p
CE
αp(a0
p, a00
p),X
a0
p
CCE
αp(a0
p)
++
p
(10)
4 Neural Network Training
The network maps the payoffs of a game,
Gp(a)
, and the targets (
ˆσ(a)
,
ˆp
,
W(a)
) to the dual
deviation gains,
α(C)CE
p
, that define the equilibrium. The duals are a significantly more space efficient
objective target (
Pp|Ap|2
for CEs and
Pp|Ap|
for CCEs) than the full joint (
Qp|Ap|
), particularly
when scaling the number of strategies and players. The joint,
σ(a)
, and approximation,
p
, can be
computed analytically from the dual deviation gains and the inputs using Equations
(9)
and
(10)
. The
network is trained by minimizing the loss,
L(C)CE
(Equation
(7)
). We call the resulting architecture a
Neural Equilibrium Solver (NES).
4.1 Input and Preprocessing
The MRE objective, and the (C)CEs constraints are invariant to payoff offset and scaling. Therefore
we can assume that the payoffs have been standardized without loss of generality. Each player’s
payoff should have zero-mean. Furthermore, it should be scaled such that the
Lm=|| · ||m
norm of
4
(a) Traffic Lights (b) Pure Coordination (c) Prisoner’s Dilemma
Figure 1: Diagrams for three
2×2
normal-form games, showing their (C)CE solution polytope on
the joint simplex (in two-strategy games CEs and CCEs are equivalent). An MWME NES, trained
by sampling over the space of payoffs and welfare targets, is used to approximate the MW(C)CE
solution (
×
). An MRE NES, trained by sampling over the space of payoffs and joint targets, is used
to approximate the ME(C)CE (
+
), and all pure joint target MRE(C)CEs (
). The networks have never
trained on these games.
the payoff tensor equals
Zm
, where
Zm
is a scale hyperparameter chosen such that the elements of
the inputs have unit variance (a property that ensures neural networks train quickly with standard
parameter initializations [
18
]). We will ensure both these properties by including a preprocessing
layer in NES. The preprocessed inputs (
Gp(a)
,
ˆσ(a)
,
ˆp
,
W(a)
) are then broadcast and concatenated
together so that they result in an input of shape
[C, N, |A1|, ..., |AN|]
, where the channel dimension
C= 4, if all inputs are required.
4.2 Training Distribution
The literature favours sampling games from the uniform or normal distribution. This introduces
two problems: (i) it biases the distribution of games solvable by the network, and (ii) unnecessarily
requires the network to learn offset and scale invariance in the payoffs. Recall that the space of
equilibria are invariant to offset and positive scale transformations to the payoff. Zero-mean and
L2
norm scaling geometrically results in the surface of an
(|A| − 1)
-ball centered on the origin
(Section D.4). We propose using this invariant subspace for training. We choose the norm scaling
constant, Z2=p|A|, such that the elements of the payoffs maintain unit variance.
4.3 Gradient Calculation
The gradient update is found by taking the derivative of the loss (Equation
(7)
) with respect to the
dual variables,
α
. Note that computing a gradient does not require knowing the optimal joint
σ(a)
,
so the network can be trained in an unsupervised fashion, from randomly generated inputs,
Gp(a)
,
ˆσ(a),ˆp, and W(a).
L(C)CE
αCE
p(a0
p, a00
p), αCCE
p(a0
p)=(C)CE
pX
aCE
Ap(a0
p, a00
p, a),CCE
Ap(a0
p, a)(C)CE
σ(a)(11)
The dual variables,
{αCE
p(a0
p, a00
p), αCCE
p(a0
p)}
, are outputs of the neural network, with learned param-
eters θ. Gradients for these parameters can be derived using the chain rule:
L(C)CE
θ =L(C)CE
αCE
p(a0
p, a00
p), αCCE
p(a0
p)
αCE
p(a0
p, a00
p), αCCE
p(a0
p)
θ
Backprop efficiently calculates these gradients, and many powerful neural network optimizers
[57, 34, 15] and ML frameworks [1, 5, 49] can be leveraged to update the network parameters.
5
摘要:

TurbochargingSolutionConcepts:SolvingNEs,CEsandCCEswithNeuralEquilibriumSolversLukeMarrisDeepMind(marris@deepmind.com),UCLIanGempDeepMindThomasAnthonyDeepMindAndreaTacchettiDeepMindSiqiLiuDeepMind,UCLKarlTuylsDeepMindAbstractSolutionconceptssuchasNashEquilibria,CorrelatedEquilibria,andCoarseCorrelat...

展开>> 收起<<
Turbocharging Solution Concepts Solving NEs CEs and CCEs with Neural Equilibrium Solvers Luke Marris.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:761.99KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注