Oracles Followers Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning Matthias Gerstgrasser12and David C. Parkes13

2025-04-29 0 0 857.6KB 32 页 10玖币
侵权投诉
Oracles & Followers: Stackelberg Equilibria in Deep
Multi-Agent Reinforcement Learning
Matthias Gerstgrasser 1,2 and David C. Parkes 1,3
1John A. Paulson School of Engineering and Applied Science, Harvard University,
Cambridge, MA, USA
2Computer Science Department, Stanford University, Stanford, CA, USA
3Deepmind, London, UK
June 5, 2023
Abstract
Stackelberg equilibria arise naturally in a range of popular learning problems, such as in
security games or indirect mechanism design, and have received increasing attention in the
reinforcement learning literature. We present a general framework for implementing Stackelberg
equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices.
We discuss how previous approaches can be seen as specific instantiations of this framework.
As a key insight, we note that the design space allows for approaches not previously seen in the
literature, for instance by leveraging multitask and meta-RL techniques for follower convergence.
We propose one such approach using contextual policies, and evaluate it experimentally on both
standard and novel benchmark domains, showing greatly improved sample efficiency compared
to previous approaches. Finally, we explore the effect of adopting algorithm designs outside the
borders of our framework.
1 Introduction
Stackelberg equilibria are an important concept in economics, and in recent years have received
increasing attention in computer science and specifically in the multiagent learning community. They
model an asymmetric setting: a leader who commits to a strategy, and one or more followers who
respond. The leader aims to maximize their reward, knowing that followers in turn will best-respond
to the leader’s choice of strategy. These equilibria appear in a wide range of settings. In security
games, a defender wishes to choose an optimal strategy considering attackers will adapt to it (An
et al., 2017; Sinha et al., 2018). In mechanism design, a designer wants to allocate resources in an
efficient manner, knowing that participants may strategize (Nisan & Ronen, 1999; Swamy, 2007;
Brero et al., 2021a). More broadly, many multi-agent system design problems can be viewed as
matthias@seas.harvard.edu
parkes@seas.harvard.edu
1
arXiv:2210.11942v4 [cs.GT] 1 Jun 2023
Stackelberg equilibrium problems: we as the designer take on the role of the Stackelberg leader,
wishing to design a system that is robust to agent behavior.
In this paper, we are particularly interested in Stackelberg equilibria in sequential decision
making settings, i.e., stochastic Markov games, and using multi-agent reinforcement learning to
learn these equilibria. We make two key contributions:
1.
We introduce a new theoretical framework for framing Stackelberg equilibria as a multi-agent
reinforcement learning problem.
(a)
We give a theorem (Theorem 1) that unifies and generalizes several prior algorithmic
design choices in the literature.
(b)
Complementing this, we show that algorithm designs outside our framework can provably
fail (Theorem 2). In particular, we give to our knowledge the first demonstration that
reinforcement learning (RL) against best-responding opponents can provably diverge.
2. Inspired by Theorem 1, we introduce a novel approach to accelerating follower best-response
convergence, borrowing ideas from multitask and meta-RL, including an experimental evalua-
tion on existing and new benchmark domains.
Our theoretical framework (Section 3) allows a black-box reduction from learning Stackelberg
equilibria into separate leader and follower learning problems. This theory encompasses and
generalizes several prior approaches from the literature, in particular Brero et al. (2021a), and
gives a large design space beyond what has been explored previously. In particular, our broader
framework generalizes beyond learning Stackelberg equilibria along with follower learning dynamics
to general, query-based follower algorithms. The theory is complemented by necessary conditions,
with a demonstration of impossibility for successful Stackelberg learning through RL by the leader
when used together with followers that immediately best respond to the leader policy (Theorem 2).
Our practical meta-RL approach uses contextual policies, a common tool in multitask and meta-RL
(Wang et al., 2016), to the follower learning problem. This allows followers to generalize and quickly
adapt to leader policies, as demonstrated on existing benchmark domains, with greatly improved
speed of convergence compared to previous approaches. Beyond this, we use this meta-RL approach
to scale up Stackelberg learning beyond what has previously been shown, to a state-of-the-art RL
benchmark domain built on Atari 2600.
In the remainder of the paper, we will introduce Stackelberg equilibria and Markov games in
Section 2. In Section 3, we motivate and define our framework for learning Stackelberg equilibria
using multi-agent RL, and discuss its scope and limitations. We show sufficient conditions in Lemma 1
and Theorem 1. In Theorem 2 we show that RL against best-responding opponents can fail, which
implies that the construction in Theorem 1 is also necessary for RL leaders (Appendix C.1 discusses
a case of non-RL leaders that does not require Theorem 1). In Appendix B and Appendix C we
discuss further ablations and edge cases of our framework. We define our novel Meta-RL approach
in Section 4, and empirically evaluate it in Section 4.1 on existing and new benchmark domains.
1.1 Prior Work
Learning Stackelberg Equilibria. Most prior work on Stackelberg equilibria focuses on single-
shot settings such as normal-form games, a significantly simpler setting than Markov games. Some of
this work studies solving an optimization problem to find a Stackelberg equilibria, given an explicit
description of the problem Paruchuri et al. (2008); Xu et al. (2014); Blum et al. (2014); Li et al.
2
(2022). Among the first work to learn a Stackelberg equilibria was Letchford et al. (2009), who focus
on single-shot Bayesian games. Peng et al. (2019) also give results for sample access to the payoffs
of matrix games. Wang et al. (2022) give an approach that differentiates through optimality (KKT)
conditions, again for normal-form games. Bai et al. (2021) give lower and upper bounds on learning
Stackelberg equilibria in general-sum games, including so-called “bandit RL” games that have one
step for the leader and sequential decision-making for the followers. Few works consider Markov
games: Zhong et al. (2021) give algorithms that find Stackelberg equilibria in Markov games, but
assume myopic followers, a significant limitation compared to the general case. Brero et al. (2021a,b,
2022) use an inner-outer loop approach, which they call the Stackelberg POMDP, which can be seen
as a special case of our framework.
Mechanism Design. One of the first works specifically discussing Stackelberg equilibria in a
learning context is Swamy (2007), who design interventions in traffic patterns. More recently, several
strands of work have focused on using multi-agent RL for economic design, often framing this is a
bi-level or inner-outer-loop optimization problem. Zheng et al. (2022) use a bi-level RL approach to
design optimal tax policies, and Yang et al. (2022) use meta-gradients in a specific incentive design
setting. Shu & Tian (2018) and Shi et al. (2019) learn leader policies in a type of incentive-shaping
setting, using a form of modeling other agents coupled with rule-based followers. Balaguer et al.
(2022) use an inner-loop outer-loop, gradient descent approach for mechanism design on iterated
matrix games (which we also use as an experimental testbed). They mainly focus on the case
where both the environment transition as well as the follower learning behavior is differentiable, and
otherwise fall back to evolutionary strategies for the leader.
Opponent Shaping. More broadly related is also a line of work on opponent shaping, such as
M-FOS (Lu et al., 2022) or LOLA (Foerster et al., 2017). While these works also learn while taking
into account other agents’ responses, they intentionally only let the other agents learn “a little bit”
between the opponent-shaping agent’s learning steps, as opposed to letting them learn until they
best-respond. This reflects a difference in goals: Opponent-shaping aims to predict and exploit other
agents’ learning behaviors, whereas Stackelberg work aims to learn strategies that are robust even
when other agents are able to learn and adapt as much as they like.
2 Preliminaries
Markov games. We consider partially observable stochastic Markov games, which are a multi-
agent generalization of a partially observable Markov Decision Process (POMDP).
Definition 1 (Markov Game).AMarkov Game,
M
, with
n
agents is a tuple (
S, A, T, r,
, O, γ
),
consisting of a state space
S
, an action space
A
= (
A1, ..., An
), a (stochastic) transition function
T
:
S×AS
, a (stochastic) reward function
r
:
S×ARn
, an observation space Ω = (Ω
1, ...,
n
),
a (stochastic) observation function O:S×A, and a discount factor γ.
At each step
t
of the game, every agent
i
chooses an action
ai,t
from their action space
Ai
, the
game state evolves according to the joint action (
a1,t, . . . , an,t
) and the transition function
T
, and
agents receive observations and reward according to
O
and
R
. An agent’s behavior in the game
3
is characterized by its policy
πi
:
oi7→ ai
, which maps observations to actions.
1
Each agent in a
Markov Game individually seeks to maximize its own (discounted) total return
Ptγtri
(
st, ai,t, ai,t
).
This gives rise to the usual definitions of Nash equilibria (NE), correlated equilibria (CE), and
coarse correlated equilibria (CCE), which we do not repeat in full here, as well as their Bayesian
counterparts. Note that strategies in Markov games and in each of these equilibrium definitions are
policies, not actions: A pair of policies
π1, π2
in a two-player Markov game is a Nash equilibrium if
neither agent can increase their expected total reward by unilaterally deviating to a different policy
π
i.
Stackelberg Equilibria. Unlike the above equilibrium concepts, a Stackelberg equilibrium is
not symmetric: There is a special player, the leader, who commits to their strategy first; the other
player (the follower) then chooses their best strategy given the leader’s choice of strategy. This
makes the leader potentially more powerful.
Example 1. In a game often called the “battle of the sexes,” you and I wish to have dinner together,
but you prefer restaurant A (deriving happiness 2, but I only get happiness 1), and I prefer restaurant
B (I get happiness 2, you get happiness 1)—but we would both rather eat together at our less-preferred
venue, than to eat separately (we both get happiness 0). Table 1 shows the payoff matrix of this
game. There are two pure Nash equilibria in this game: We both go to restaurant A, or we both
go to restaurant B. But there is only a single Stackelberg equilibrium (per leader): If you commit
to going to restaurant A, then my only best response is to also go to restaurant A. In doing so I
receive happiness 1, whereas my only alternative would be to eat alone at restaurant B for happiness
0. Notice that this hinges on the leader strictly committing to their choice of restaurant.
Table 1: “Battle of
the Sexes” game.
2,1 0,0
0,0 1,2
This Stackelberg concept also extends to Markov games: Here a leader
agent
L
decides on their policy (i.e. strategy), and the remaining (follower)
agents—knowing the leader’s choice of policy—best-respond. The leader seeks
to maximize their own reward, considering that followers will best-respond.
For instance, in an Iterated Prisoners’ Dilemma (Robinson & Goforth, 2005),
a leader might commit to a Tit-for-Tat strategy, in turn leading the follower
to cooperate.
Typically, a Stackelberg equilibrium is formally defined using a max-min-
style condition: The leader maximizes its own reward knowing that the follower
best-responds, i.e., maximizes its own reward, with the leader-follower dynamic giving the order
of the two nested max-operators. This in turn suggests a nested outer-inner loop (reinforcement)
learning approach, where the follower trains until convergence every time the leader updates its policy.
As a key innovation in this work, we instead use a statement of the follower best-response through an
oracle abstraction. An oracle definition has been used before in order to extend Stackelberg equilibria
to multiple followers Nakamura (2015); Zhang et al. (2016); Liu (1998); Solis et al. (2016); Sinha
et al. (2014); Wang et al. (2022); Brero et al. (2021a). In contrast, we use the oracle abstraction to
greatly simplify the statement and proof of our main theorem, while simultaneously generalizing it
beyond prior approaches. In turn, this allows us to develop a novel Meta-RL approach in the second
part of the paper.
With multiple followers, any choice of leader strategy,
πL
, induces a Markov game,
FπL
, between
the followers, which could feature multiple equilibria as well as equilibria of different types, such as
1
To keep notation concise we discuss here the memory-less case, but all our results generalize to a stateful leader
policy in a straightforward manner, as we discuss in Appendix B.
4
Nash, correlated, and coarse correlated equilibria, each giving rise to a corresponding Stackelberg-
Nash, Stackelberg-CE, and Stackelberg-CCE concept. This motivates an oracle abstraction. For any
choice of leader strategy, we denote as
E
(
FπL
) a follower equilibrium (or a probability distribution
over equilibria), where we refer to
E
as an oracle. This oracle will later be realized as an algorithm.
See also Wang et al. (2022).
Definition 2 (Stackelberg equilibrium).Given a Markov Game
M
and a follower best-response
oracle
E
, a leader strategy
πL
together with a tuple of follower strategies
πF
is a Stackelberg
equilibrium, if and only if
πL
maximizes the leader’s expected reward under the condition that
follower strategies are drawn from E(FπL):
πLarg max
πL
E
πF∼E(FπL)hX
t
E[rLst, aL,t, aF,t]i,
where the second expectation is drawing actions and state transitions from their respective policies
πL, πF
and transition function
T
, and the reward function is
r
, all as in Definition 1. If the
follower oracle
E
gives a Nash equilibrium in the induced game
FπL
, we call this a Stackelberg-Nash
equilibrium, and similarly for CE and CCE.
In the remainder of the paper, when we say “oracle” we mean an algorithm that computes or
learns the follower best-response equilibrium,
E
(
FπL
), given the leader strategy
πL
. In full generality,
an oracle algorithm could take many forms, including RL, optimization, or any other algorithm that
takes as input
πL
and outputs
E
(
FπL
). These algorithms may vary in regard to how they access
information about
πL
. Our main positive result (Theorem 1) will rely on oracle algorithms that
only require query access (also called sample access) to the leader policy; i.e., algorithms that only
interact with the leader policy by receiving as input the leader’s actions in each of a set of leader
observation states, these states queried by the algorithm in some order.
Definition 3 (Query Oracle).An algorithm implementing an oracle
E
is called a query oracle if its
interactions with the leader policy
πL
are exclusively through queries, where a query is an input to
the leader policy; i.e., an observation
o
together with the response from the leader policy, which is
the action πL(o)the leader would take given o.
A query oracle can also receive additional information, for example an interaction with the
environment that is associated with the Markov Game (or a description of the environment). The
definition of a query oracle is only in regard to how the oracle algorithm interacts with the leader
policy. In particular, any oracle implemented using RL or another typical learning approach is a
query oracle, as learning algorithms generally only require query access to their environment and
the actions taken by other agents in different states. In particular, a suitably formulated RL leader
together with a RL-based followers (i.e., an “outer loop - inner loop” approach) can be interpreted
within our framework as an oracle querying the leader policy.
An algorithm that operates directly on a description of the leader policy,
πL
, such as a parametriza-
tion
θ
of
πL
, is not a query oracle. For instance, if
M
has a small number of discrete states and
actions,
πL
could be directly parametrized, with
θij
denoting the probability of taking action
j
in
state
i
. An optimization approach could then compute a best response directly from knowledge of
θ
,
but these parameters are not available through query access. Similarly,
θ
could be the weights of a
neural network.
5
摘要:

Oracles&Followers:StackelbergEquilibriainDeepMulti-AgentReinforcementLearningMatthiasGerstgrasser∗1,2andDavidC.Parkes†1,31JohnA.PaulsonSchoolofEngineeringandAppliedScience,HarvardUniversity,Cambridge,MA,USA2ComputerScienceDepartment,StanfordUniversity,Stanford,CA,USA3Deepmind,London,UKJune5,2023Abst...

展开>> 收起<<
Oracles Followers Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning Matthias Gerstgrasser12and David C. Parkes13.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:857.6KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注