Game Theoretic Rating in N-player general-sum games with Equilibria

2025-04-22 0 0 1.2MB 21 页 10玖币
侵权投诉
Game Theoretic Rating in N-player general-sum games with Equilibria
Luke Marris 1 2 Marc Lanctot 1Ian Gemp 1Shayegan Omidshafiei 3Stephen McAleer 4Jerome Connor 1
Karl Tuyls 1Thore Graepel 2
Abstract
Rating strategies in a game is an important area
of research in game theory and artificial intel-
ligence, and can be applied to any real-world
competitive or cooperative setting. Traditionally,
only transitive dependencies between strategies
have been used to rate strategies (e.g. Elo), how-
ever recent work has expanded ratings to utilize
game theoretic solutions to better rate strategies
in non-transitive games. This work generalizes
these ideas and proposes novel algorithms suitable
for N-player, general-sum rating of strategies in
normal-form games according to the payoff rating
system. This enables well-established solution
concepts, such as equilibria, to be leveraged to
efficiently rate strategies in games with complex
strategic interactions, which arise in multiagent
training and real-world interactions between many
agents. We empirically validate our methods on
real world normal-form data (Premier League)
and multiagent reinforcement learning agent eval-
uation.
1. Introduction
Traditionally, rating systems assume transitive dependen-
cies of strategies in a game (such as Elo (Elo, 1978) and
TrueSkill (Herbrich et al., 2007)). That is, there exists an
unambiguous ordering of all strategies according to their
relative strengths. This ignores all other interesting inter-
actions between strategies, including cycles where strategy
S beats P beats R beats S in the classic game of Rock, Pa-
per, Scissors (Table 2). Many interesting games have this
so-called “strategic” dimension (Czarnecki et al., 2020), or
“gamescapes” (Balduzzi et al., 2018), that cannot be captured
by pairwise transitivity constraints.
Game theoretic rating of strategies is an emerging area of
study which seeks to overcome some of these drawbacks.
*
Equal contribution
1
DeepMind
2
University College London
3
Google
4
Carnegie Mellon University. Correspondence to: Luke
Marris <marris@deepmind.com>.
These methods can be employed in normal-form games, or
in empirical games, constructed normal-form games where
strategies are policies competing in a multiagent interac-
tion (e.g. a simulation or a game) and the payoffs are ap-
proximate expected returns of the players employing these
policies (Wellman, 2006; Walsh et al., 2002; Tuyls et al.,
2020).
The Nash Average (NA) (Balduzzi et al., 2018) algorithm
proposed a way of rating strategies in two-player, zero-sum,
normal-form games. This approach is known as maximal
lottery (Kreweras, 1965; Fishburn, 1984) in social choice
theory, where it first arose, and is so fundamental it has
been rediscovered across many fields (Brandt, 2017). In
particular, NA proposed two applications of rating: agent-
vs-agent interactions and agent-vs-task interactions. NA
possesses a number of interesting properties: its ratings are
invariant to strategy duplication, and it captures interesting
non-transitive interactions between strategies. However, the
technique is difficult to apply outside of two-player, zero-
sum domains due to computational tractability and equi-
librium selection difficulties. More recent work,
α
-Rank
(Omidshafiei et al., 2019), sought to remedy this by introduc-
ing a novel computationally feasible solution concept based
on the stationary distribution a discrete-time evolutionary
process. Its main advantages concerns its uniqueness and
efficient computation in N-player and general-sum games.
This work expands game theoretic rating techniques to es-
tablished equilibrium concepts correlated equilibrium (CE)
(Aumann, 1974), and coarse-correlated equilibrium (CCE)
(Moulin & Vial, 1978). In Section 2 we provide background
to rating algorithms, game theory, and equilibrium based
solution concepts. In particular, we describe CEs and CCEs
that are suitable in the N-player, general-sum setting. In
Section 3 we define a novel general rating definition: payoff
rating, which is equivalent to NA if the game is two-player
zero-sum. Payoff rating is the expected payoff under a joint
distribution, conditioned on taking a certain strategy. The
choice of joint distribution is what provides payoff ratings
with its interesting properties. In Section 4 we suggest joint
distributions to parameterize game theoretic rating algo-
rithms. In Section 5 we test these algorithms on instances of
N-player, general-sum games using real-world data. Finally,
Section 6 is a discussion of the connections of this work
arXiv:2210.02205v1 [cs.GT] 5 Oct 2022
Game Theoretic Rating in N-player general-sum games with Equilibria
to other areas of machine learning and the relevance of the
work to machine learning.
2. Preliminaries
First we cover background on game theory (Section 2.1),
rating and ranking (Section 2.2), equilibria (Section 2.3),
and game theoretic rating (Section 2.4).
2.1. Empirical Game Theory
A normal-form game is a single step game where all play-
ers take an action simultaneously, and each receive a pay-
off. With n players, we denote a joint strategy as the tuple
a= (a1, ..., an)∈ A
, where
ap∈ Ap
is the strategy space
available to player
p
. The payoff given to player
p
from a
joint strategy
a
is
Gp(a)
. A player’s objective is to maxi-
mize their payoff in the presence of other players, who are
maximizing their own payoffs.
It is possible to construct normal-form game representations
from observations of much larger systems. This process is
known as empirical game theoretic analysis (EGTA) (Well-
man, 2006; Walsh et al., 2002). The most common example
in artificial intelligence is when studying a set of policies
1
in large extensive form games (e.g., Chess or Go). Often the
set of policies is too large to enumerate entirely so we retain
a subset of them and track their performance against one an-
other, therefore constructing a normal-form game from the
policies’ expected returns within the environment (Lanctot
et al., 2017). For example, a given season of the Premier
League can be modeled as a normal-form game involving
a set of 20 team policies, out of many possible football
teams
2
. EGTA has proved essential to multiagent reinforce-
ment learning (MARL) recently in scaling to human-level
StarCraft (Vinyals et al., 2019).
2.2. Rating and Ranking
Ranking is the problem of assigning a partial ordering to
a set. Rating is the more general problem of assigning a
scalar value to each element of a set, which then could be
used to describe a ranking. The simplest rating procedure
is to take the mean performance of a strategy against all
other strategies. Viewed through a game theoretic lens, this
is equivalent to assuming each opponent strategy is equally
likely to be encountered: the opponent player is playing a
uniform distribution. The key drawback of this approach
is that it is heavily influenced by the strategies available,
and that an opponent player, in practice, will favour their
strongest strategies. This argument is made thoroughly in
Balduzzi et al. (2018).
1A mapping from information states to actions.
2The full permutation of all players and coaches in the world.
Perhaps the best known rating algorithm is Elo (Elo, 1978).
It is used to rate players in many two-player, zero-sum
games, most famously in Chess. It is also commonly used
for evaluating artificial agents in two-player settings (Silver
et al., 2016; Schrittwieser et al., 2019). The skill (Elo) of
each competitor,
a1
, is described using a single variable
r(a1)
which maps to a win probability compared to other
competitors,
a2
,
G1(a1, a2) = (1 10r(a1)r(a2)
400 )1
. This
therefore defines a two-player, symmetric, constant-sum
game where competitors are strategies, with payoff defined
as
G2= 1 G1=GT
1
. It is only suitable for describing
highly transitive games. Multi-dimensional Elo (Balduzzi
et al., 2018) is a method that attempts to overcome the
limitations of Elo by introducing additional variables for
each competitor which describe cyclic structure in the game.
This gives a more accurate approximation of the payoff,
however it does not provide a way of rating strategies on its
own
3
. Decomposed Elo (Jaderberg et al., 2019) is a method
that works for
m
vs
m
constant-sum, team games. It is
capable of assigning a rating to each competitor in the team
as well as a rating for the team. However, it is also only
suitable for transitive games, both where team compositions
and between-team interactions are transitive.
2.3. Equilibria
The joint strategy distribution over the set of all joint
strategies is denoted
σ(a) = σ(a1, ..., an)
, where
Pa∈A σ(a) = 1
and
σ(a)0a∈ A
. Furthermore
we define the marginal strategy distribution as
σ(ap) =
Pq∈−pPaq∈Aqσ(a1, ..., an)
, where
p={1, ..., p
1, p+1, ..., n}
. Sometimes it is possible to factorize the joint
strategy distribution into its marginals
σ(a) = pσ(ap)
. Fi-
nally, the conditional distribution
σ(ap|ap) = σ(a)
σ(ap)
is
defined if
σ(ap)>0
. Sometimes we may denote the joint
in terms of one players strategies versus all other players:
σ(a)σ(a1, ..., an)σ(ap, ap).
A popular class of solution concepts are equilibrium based:
joint distributions,
σ(a)
, where under certain definitions,
no player has incentive to deviate. The most well known
is Nash equilibrium (Nash, 1951) (NE), which is tractable,
interchangeable and unexploitable in two-player, zero-sum
games (Shoham & Leyton-Brown, 2009). NEs are always
factorizable joint distributions. A related solution concept is
correlated equilibrium (Aumann, 1974) (CE) which is more
suitable for N-player, general-sum settings where players
are allowed to coordinate strategies with each other if it is
mutually beneficial. Furthermore, CEs are more compatible
with the Bayesian perspective, and arise as a result of learn-
ing rules (Foster & Vohra, 1997; Cesa-Bianchi & Lugosi,
2006). The mechanism of implementing a CE is via a corre-
3It assigns vectors, not scalars, to strategies.
Game Theoretic Rating in N-player general-sum games with Equilibria
lation device which samples a joint strategy from a known
public distribution and recommends the sampled strategy
secretly to each player. A distribution is in correlated equi-
librium if no player is incentivised to unilaterally deviate
from the recommendation after receiving it. CEs that are fac-
torizable are also NEs. An additional solution concept, the
coarse correlated equilibrium (Moulin & Vial, 1978) (CCE),
requires players to commit to the recommendation before
it has been sampled. It is less computationally expensive
and permits even higher equilibrium payoffs. These sets
are related to each other
NE CE CCE
. The empirical
average policy of no-regret learning algorithms in self-play
are known to converge to CCEs (Foster & Vohra, 1997; Hart
& Mas-Colell, 2000).
All these equilibria have approximate forms which are pa-
rameterized by the approximation parameter
which de-
scribes the maximum allowed incentive to deviate to a
best response (across all players). There are two common
methods of defining an approximate equilibrium: the stan-
dard approximate equilibrium (Shoham & Leyton-Brown,
2009) describes the bound on incentive to deviate under the
joint, and the well-supported (WS) approximate equilibrium
(Goldberg & Papadimitriou, 2006) describes the bound on
incentive to deviate under the conditionals. When
= 0
,
these definitions become equivalent. The standard method
has the property that any
>min
will permit a full-support
equilibrium (Section A), where
min 0
is the minimum
that permits a feasible solution in a game. Each player may
have individual tolerances to deviation,
p
. For the rest of
this work we only consider the standard definition, similar
derivations can be adopted for the well-supported definition.
CEs can be defined in terms of linear inequality con-
straints, defined in terms of the deviation gain of a player:
ACE
p(a0
p, ap, ap) = Gp(a0
p, ap)Gp(ap, ap)
,
a0
p6=
ap∈ Ap, ap∈ Ap,p
, where each constraint corresponds
to a pair of strategies: apdeviating to a0
p.
-CE: X
ap∈Ap
σ(ap, ap)ACE
p(a0
p, ap, ap)p(1)
CCEs can be derived from the CE definition by summing
over strategies available to a player:
Papσ(ap)(·)
, there-
fore there is only a constraint for each possible deviation
a0
p∈ Ap,p
with a deviation gain of
ACCE
p(a0
p, a) =
Gp(a0
p, ap)Gp(a).
-CCE: X
a∈A
σ(ap, ap)ACCE
p(a0
p, a)p(2)
NEs have similar definitions to CEs but have an extra con-
straint: the joint distribution factorizes
pσ(ap) = σ(a)
,
resulting in nonlinear constraints4.
-NE: X
ap∈Ap
qσ(aq)ACE
p(a0
p, ap, ap)p(3)
When a distribution is in equilibrium, no player has incentive
to unilaterally deviate from it to achieve a better payoff.
There can however be many equilibria in a game, choosing
amongst these is known as the equilibrium selection problem
(Harsanyi & Selten, 1988).
For NEs it has been suggested to use a maximum entropy cri-
terion (MENE) (Balduzzi et al., 2018), which always exists
and is unique in two-player, constant-sum settings. Another
strategy is to regularize the NE of the game with Shannon
entropy resulting in the quantal response equilibrium (QRE)
(McKelvey & Palfrey, 1995). There exists a continuum of
QREs starting at the uniform distribution, finishing at the
limiting logit equilibrium (LLE), which is unique for almost
all games. Solvers (Gemp et al., 2021) can find LLEs, even
in scenarios with stochastic payoffs.
(C)CEs permit a convex polytope of valid solutions which
are defined by their linear inequality constraints. Con-
vex functions can be used to uniquely select from this
set. Multiple objectives have been proposed to select
from the set of valid solutions including maximum entropy,
Paσ(a) ln(σ(a))
(ME(C)CE) (Ortiz et al., 2007), maxi-
mum Gini,
Pa1σ(a)2
(MG(C)CE) (Marris et al., 2021a),
and maximum welfare5,PpPaσ(a)Gp(a)(MW(C)CE).
2.4. Game Theoretic Rating
It is natural to formulate rating problems as rating strategies
in a normal-form game. For example, a football league is a
symmetric, two-player game where strategies are teams and
payoffs are win probabilities or points. Teams can therefore
be ranked by analysing this empirical game. Single player
multi-task reinforcement learning can be formulated as a
game between an agent player and an environment player,
with strategies describing different policies and different
tasks respectively (Balduzzi et al., 2018). We call the prob-
lem of assigning a rating to each player’s strategies the game
theoretic rating problem. While any joint distribution can
be used to calculate a rating (Section 3), game theoretic
distributions, such as ones that are in equilibrium have a
number of advantages.
Nash Average (Balduzzi et al., 2018) leverages the proper-
ties of NE to rate strategies in two-player, zero-sum games
according to the payoff rating definition (Section 3.1). This
approach can also be used to rate relative strengths between
populations of strategies in sub-games: relative population
performance (RPP) (Balduzzi et al., 2019; Czarnecki et al.,
4This is why NEs are harder to compute than (C)CEs.
5Not convex and hence not always unique.
Game Theoretic Rating in N-player general-sum games with Equilibria
2020). It also has an interesting property that it is invariant
to strategy repeats.
Using NEs to rate general-sum games has not been explored,
however LLEs (McKelvey & Palfrey, 1995) could be lever-
aged to select an equilibrium that is unique in almost all
games. However it is difficult to compute and sophisticated
solvers such as gambit-logit (Turocy, 2005; McKelvey et al.,
2016) do not scale well to large N-player games (Gemp
et al., 2021). In contrast, (C)CEs which have not been con-
sidered as rating algorithms until now, a) have a convex
optimization formulation, b) have unique principled equi-
librium selection, c) can capture coordination in games, d)
are established and understood, and e) have a number of
interesting rating properties.
3. Game Theoretic Rating
We introduce a novel generalized rating for N-player,
general-sum: the payoff rating. The definition functions
for arbitrary joint strategy distributions, however we pro-
pose using equilibrium distributions to ensure the rating is
game theoretic.
3.1. Payoff Rating
The rating is defined terms of the payoff,
Gp
, and the joint
distribution players are assumed to be playing under,
σ
. We
define the payoff rating:
rσ
p(ap) =
σ(ap)X
a∈A
Gp(a)σ(a)(4)
=
σ(ap)X
ap∈Ap
Gp(ap, ap)σ(ap|ap)σ(ap)
=X
ap∈Ap
Gp(ap, ap)σ(ap|ap)(5)
Theorem 3.1.
(Nash Average Equivalence) When using an
MENE for the joint strategy distribution in two-player, zero-
sum games, payoff rating is equivalent to Nash Average
(NA).
Proof.
For NE, a player’s strategies are independent from
the other player’s strategies,
σ(a2|a1) = σ(a2)
. There-
fore
rσ
1(a1) = Pa2∈A2G1(a1, a2)σ(a2)
and
rσ
2(a2) =
Pa1∈A1G2(a1, a2)σ(a1)
, which is the definition of
NA.
This definition has two interpretations: a) the change
in the player’s payoff under a joint strategy distribution,
Pa∈A Gp(a)σ(a)
, with respect to the probability of se-
lecting that strategy,
σ(ap)
b) the expected strategy payoff
under a joint strategy distribution conditioned on that strat-
egy. When defined, the payoff rating is bounded between
the minimum and maximum values of a strategy’s payoff,
minaGp(ap, ap)rσ
p(ap)maxaGp(ap, ap).
Note the mathematical edge case that strategies with zero
marginal probability,
σ(ap)=0
, have undefined conditional
probability,
σ(ap|ap)
, and therefore have undefined payoff
rating. Consider a symmetric two-player zero-sum transitive
game where strategy
S
dominates
A
, and
A
dominates
W
.
Many game theoretic distributions (including NE, CE and
CCE) will place all probability mass on
(S, S)
, leaving
strategies
A
and
W
with undefined rating. This may be
unsatisfying for two reasons; firstly that there could be a
further ordering between
A
and
W
such that
S>A>W
reflected in the ranking, and secondly, that all strategies
should receive a rating value. It could argued that if a
strategy dominates all others then an ordering over the rest
is redundant. However there are ways to achieve ordering,
a) with approximate equilibria (Section 4), certain joint
strategies (such as
min +
-MECCE) are guaranteed to place
at least some mass on all strategies (Section A), b) assign
rσ
p(a) = minaGp(a)
for undefined values, and c) rate using
a sub-game with dominating strategies pruned.
3.2. Joint Strategy Distributions
We now turn our discussion to the joint distributions we
measure such a rating under. The most ubiquitous approach
is the uniform distribution which is equivalent to calculating
the mean payoff across all opponent strategies. As discussed
previously, this approach does not consider any interesting
dynamics of the game. It is, however, the distribution with
maximum entropy and therefore makes the fewest assump-
tions (Jaynes, 1957) .
In order to be more game theoretic, using distributions that
are in certain types of equilibrium is beneficial. Firstly, con-
sider the definitions of several equilibria (Equations
(1)
-
(3)
).
These equations are linear
6
inequality constraints between
strategies, so already closely resemble a partial ordering.
Rankings are nothing more than partial orderings between
elements. Secondly, values of the payoff ratings depend
entirely on the payoffs under distributions that all players
are not incentivized to deviate from. Therefore this set of
joint distributions are representative of ones which rational
agents may employ in practice. In contrast, the uniform
distribution is rarely within an equilibrium set. Therefore,
we argue, equilibrium distributions are a much more natu-
ral approach. Further mathematical justification is given in
Section B.
It is possible to mix the opinionated properties of an equi-
librium with the zero-assumption properties of the uniform:
there exists a principled continuum between the uniform
distribution and an equilibrium distribution (Marris et al.,
6In joint distribution space.
Game Theoretic Rating in N-player general-sum games with Equilibria
2021a) to achieve this balance. The uniform distribution
is recovered when using a large enough approximation pa-
rameter
puni
p
. The value of
uni
p
depends on the solution
concept, and can be determined directly from a payoff (Sec-
tion A).
3.3. Properties of Equilibria Ratings
Naively, one may want a rating strategy to differentiate the
strategies it is rating. Game theoretic rating does the oppo-
site: it groups strategies into similar ratings that should not
be differentiated, such as strategies that are in a cycle with
one another (Tables 1a, and 1b). We call this phenomenon
the grouping property. Other properties, such as strategic
dominance resulting in dominated ratings, and consistent
ratings over repeated strategies or between players in a sym-
metric game can also be achieved when using the maximum
entropy criterion (Section B).
4. Rating Algorithms
A generalized payoff rating algorithm (Algorithm 1) is there-
fore parameterized over an equilibrium concept, and an
equilibrium selection criterion. This section makes some
recommendations on suitable parameterizations.
Algorithm 1 Generalized Payoff Rating
1: σ(a)CONCEPTANDSELECTION(G(a), )
2: for p1...n do
3: rσ
p(ap)X
ap∈Ap
Gp(ap, ap)σ(ap|ap)
4: return (rσ
1(a1), ..., rσ
n(an))
4.1. min +-MECCE Payoff Rating
We recommend using Coarse Correlated Equilibrium (CCE)
as the joint strategy distribution, maximum entropy (ME) for
the equilibrium selection function. We consider the solution
when
min
(or equivalently with a sufficiently small
=min +
), where
min 0
is the minimum approxima-
tion parameter that permits a feasible solution (Section A)
(Marris et al., 2021b). We call the resulting rating
min +
-
MECCE Payoff Rating.
Using CCEs as the solution concept has a number of advan-
tages: a) full joint distributions allow cooperative as well
as competitive games to be rated; factorizable distributions
such as NE struggle with cooperative components of a game
b) CCEs are more tractable to compute than CEs and NEs, c)
full-support CCEs only require a single variable per strategy
to define
7
, d) they are amenable to equilibrium selection
because it permits a convex polytope of solutions, e) under
a CCE, no player has incentive to deviate from the joint
7With the payoff tensor.
(possibly correlated) distribution to any of their own strate-
gies unilaterally since it would not result in higher payoff,
and f) the empirical joint strategy of no-regret algorithms in
self-play converge to a CCE.
In combination with CCEs, ME with any
>min
(Sec-
tion A) spreads at least some mass over all joint strategies
(“full support” (Ortiz et al., 2007)) meaning that the condi-
tional distribution, and hence the payoff rating, is always
well defined. This equilibrium selection method is also in-
variant under affine transforms (Marris et al., 2021a) of the
payoff, scales well to large numbers of players and strate-
gies, and is principled in that it makes minimal assumptions
about the distribution (Jaynes, 1957). Empirically, it groups
strategies within strategic cycles with each other. Using
a solution near
min
allows for a strong, high value equi-
librium to be selected which is particularly important for
coordination games.
4.2.
uni -MECCE Payoff Rating
A drawback of using
=min +
is that sometimes (usu-
ally when strategies are strictly dominated by others) the
distribution needs to be computed to a very high precision,
otherwise numerical issues will complicate the calculation
of the conditional distributions.
In order to mitigate this problem let us use an approximate
equilibrium distribution which will spread more mass. It
is advantageous to normalize the approximation parame-
ter(Marris et al., 2021a),
uni
, where
uni
is the minimum
that permits the uniform distribution in the feasible set.
When
uni = 1
the uniform distribution is selected by ME,
when
uni = 0 the MECCE solution is recovered. For some
games, it is possible to set
uni 0
to produce ratings with
very robust distributions. This is similar in idea to the contin-
uum of QREs (McKelvey & Palfrey, 1995). Figure 7 shows
how ratings change with
uni
for a two-player, zero-sum
game.
5. Experiments
In order to build intuition and demonstrate the flexibility
of the rating algorithms presented, this section shows rat-
ings for a number of standard and real world data games.
We compare against uniform and
α
-Rank rating methods.
Section F contains further experiments.
5.1. Standard Normal Form Games
First let us consider the payoff, equilibrium and payoff rat-
ings of some two-player normal-form games (Table 1). RPS
has three strategies in a cycle, and therefore equilibrium
ratings dictate that we cannot differentiate between their
strengths. This is true even if the cycles are biased (Ta-
ble 1a) for the MECCE rating. In this case the probability
摘要:

GameTheoreticRatinginN-playergeneral-sumgameswithEquilibriaLukeMarris12MarcLanctot1IanGemp1ShayeganOmidshaei3StephenMcAleer4JeromeConnor1KarlTuyls1ThoreGraepel2AbstractRatingstrategiesinagameisanimportantareaofresearchingametheoryandarticialintel-ligence,andcanbeappliedtoanyreal-worldcompetitiveor...

展开>> 收起<<
Game Theoretic Rating in N-player general-sum games with Equilibria.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.2MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注