Oracles Followers Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning Matthias Gerstgrasser12and David C. Parkes13

2025-04-29 2 0 857.6KB 32 页 10玖币

侵权投诉

Oracles & Followers: Stackelberg Equilibria in Deep

Multi-Agent Reinforcement Learning

Matthias Gerstgrasser ∗1,2 and David C. Parkes †1,3

1John A. Paulson School of Engineering and Applied Science, Harvard University,

Cambridge, MA, USA

2Computer Science Department, Stanford University, Stanford, CA, USA

3Deepmind, London, UK

June 5, 2023

Abstract

Stackelberg equilibria arise naturally in a range of popular learning problems, such as in

security games or indirect mechanism design, and have received increasing attention in the

reinforcement learning literature. We present a general framework for implementing Stackelberg

equilibria search as a multi-agent RL problem, allowing a wide range of algorithmic design choices.

We discuss how previous approaches can be seen as speciﬁc instantiations of this framework.

As a key insight, we note that the design space allows for approaches not previously seen in the

literature, for instance by leveraging multitask and meta-RL techniques for follower convergence.

We propose one such approach using contextual policies, and evaluate it experimentally on both

standard and novel benchmark domains, showing greatly improved sample eﬃciency compared

to previous approaches. Finally, we explore the eﬀect of adopting algorithm designs outside the

borders of our framework.

1 Introduction

Stackelberg equilibria are an important concept in economics, and in recent years have received

increasing attention in computer science and speciﬁcally in the multiagent learning community. They

model an asymmetric setting: a leader who commits to a strategy, and one or more followers who

respond. The leader aims to maximize their reward, knowing that followers in turn will best-respond

to the leader’s choice of strategy. These equilibria appear in a wide range of settings. In security

games, a defender wishes to choose an optimal strategy considering attackers will adapt to it (An

et al., 2017; Sinha et al., 2018). In mechanism design, a designer wants to allocate resources in an

eﬃcient manner, knowing that participants may strategize (Nisan & Ronen, 1999; Swamy, 2007;

Brero et al., 2021a). More broadly, many multi-agent system design problems can be viewed as

∗matthias@seas.harvard.edu

†parkes@seas.harvard.edu

arXiv:2210.11942v4 [cs.GT] 1 Jun 2023

Stackelberg equilibrium problems: we as the designer take on the role of the Stackelberg leader,

wishing to design a system that is robust to agent behavior.

In this paper, we are particularly interested in Stackelberg equilibria in sequential decision

making settings, i.e., stochastic Markov games, and using multi-agent reinforcement learning to

learn these equilibria. We make two key contributions:

We introduce a new theoretical framework for framing Stackelberg equilibria as a multi-agent

reinforcement learning problem.

(a)

We give a theorem (Theorem 1) that uniﬁes and generalizes several prior algorithmic

design choices in the literature.

(b)

Complementing this, we show that algorithm designs outside our framework can provably

fail (Theorem 2). In particular, we give to our knowledge the ﬁrst demonstration that

reinforcement learning (RL) against best-responding opponents can provably diverge.

2. Inspired by Theorem 1, we introduce a novel approach to accelerating follower best-response

convergence, borrowing ideas from multitask and meta-RL, including an experimental evalua-

tion on existing and new benchmark domains.

Our theoretical framework (Section 3) allows a black-box reduction from learning Stackelberg

equilibria into separate leader and follower learning problems. This theory encompasses and

generalizes several prior approaches from the literature, in particular Brero et al. (2021a), and

gives a large design space beyond what has been explored previously. In particular, our broader

framework generalizes beyond learning Stackelberg equilibria along with follower learning dynamics

to general, query-based follower algorithms. The theory is complemented by necessary conditions,

with a demonstration of impossibility for successful Stackelberg learning through RL by the leader

when used together with followers that immediately best respond to the leader policy (Theorem 2).

Our practical meta-RL approach uses contextual policies, a common tool in multitask and meta-RL

(Wang et al., 2016), to the follower learning problem. This allows followers to generalize and quickly

adapt to leader policies, as demonstrated on existing benchmark domains, with greatly improved

speed of convergence compared to previous approaches. Beyond this, we use this meta-RL approach

to scale up Stackelberg learning beyond what has previously been shown, to a state-of-the-art RL

benchmark domain built on Atari 2600.

In the remainder of the paper, we will introduce Stackelberg equilibria and Markov games in

Section 2. In Section 3, we motivate and deﬁne our framework for learning Stackelberg equilibria

using multi-agent RL, and discuss its scope and limitations. We show suﬃcient conditions in Lemma 1

and Theorem 1. In Theorem 2 we show that RL against best-responding opponents can fail, which

implies that the construction in Theorem 1 is also necessary for RL leaders (Appendix C.1 discusses

a case of non-RL leaders that does not require Theorem 1). In Appendix B and Appendix C we

discuss further ablations and edge cases of our framework. We deﬁne our novel Meta-RL approach

in Section 4, and empirically evaluate it in Section 4.1 on existing and new benchmark domains.

1.1 Prior Work

Learning Stackelberg Equilibria. Most prior work on Stackelberg equilibria focuses on single-

shot settings such as normal-form games, a signiﬁcantly simpler setting than Markov games. Some of

this work studies solving an optimization problem to ﬁnd a Stackelberg equilibria, given an explicit

description of the problem Paruchuri et al. (2008); Xu et al. (2014); Blum et al. (2014); Li et al.

(2022). Among the ﬁrst work to learn a Stackelberg equilibria was Letchford et al. (2009), who focus

on single-shot Bayesian games. Peng et al. (2019) also give results for sample access to the payoﬀs

of matrix games. Wang et al. (2022) give an approach that diﬀerentiates through optimality (KKT)

conditions, again for normal-form games. Bai et al. (2021) give lower and upper bounds on learning

Stackelberg equilibria in general-sum games, including so-called “bandit RL” games that have one

step for the leader and sequential decision-making for the followers. Few works consider Markov

games: Zhong et al. (2021) give algorithms that ﬁnd Stackelberg equilibria in Markov games, but

assume myopic followers, a signiﬁcant limitation compared to the general case. Brero et al. (2021a,b,

2022) use an inner-outer loop approach, which they call the Stackelberg POMDP, which can be seen

as a special case of our framework.

Mechanism Design. One of the ﬁrst works speciﬁcally discussing Stackelberg equilibria in a

learning context is Swamy (2007), who design interventions in traﬃc patterns. More recently, several

strands of work have focused on using multi-agent RL for economic design, often framing this is a

bi-level or inner-outer-loop optimization problem. Zheng et al. (2022) use a bi-level RL approach to

design optimal tax policies, and Yang et al. (2022) use meta-gradients in a speciﬁc incentive design

setting. Shu & Tian (2018) and Shi et al. (2019) learn leader policies in a type of incentive-shaping

setting, using a form of modeling other agents coupled with rule-based followers. Balaguer et al.

(2022) use an inner-loop outer-loop, gradient descent approach for mechanism design on iterated

matrix games (which we also use as an experimental testbed). They mainly focus on the case

where both the environment transition as well as the follower learning behavior is diﬀerentiable, and

otherwise fall back to evolutionary strategies for the leader.

Opponent Shaping. More broadly related is also a line of work on opponent shaping, such as

M-FOS (Lu et al., 2022) or LOLA (Foerster et al., 2017). While these works also learn while taking

into account other agents’ responses, they intentionally only let the other agents learn “a little bit”

between the opponent-shaping agent’s learning steps, as opposed to letting them learn until they

best-respond. This reﬂects a diﬀerence in goals: Opponent-shaping aims to predict and exploit other

agents’ learning behaviors, whereas Stackelberg work aims to learn strategies that are robust even

when other agents are able to learn and adapt as much as they like.

2 Preliminaries

Markov games. We consider partially observable stochastic Markov games, which are a multi-

agent generalization of a partially observable Markov Decision Process (POMDP).

Deﬁnition 1 (Markov Game).AMarkov Game,

, with

agents is a tuple (

S, A, T, r,

Ω

, O, γ

consisting of a state space

, an action space

= (

A1, ..., An

), a (stochastic) transition function

S×A→S

, a (stochastic) reward function

S×A→Rn

, an observation space Ω = (Ω

1, ...,

Ω

a (stochastic) observation function O:S×A→Ω, and a discount factor γ.

At each step

of the game, every agent

chooses an action

ai,t

from their action space

, the

game state evolves according to the joint action (

a1,t, . . . , an,t

) and the transition function

, and

agents receive observations and reward according to

and

. An agent’s behavior in the game

is characterized by its policy

πi

oi7→ ai

, which maps observations to actions.

Each agent in a

Markov Game individually seeks to maximize its own (discounted) total return

Ptγtri

(

st, ai,t, a−i,t

This gives rise to the usual deﬁnitions of Nash equilibria (NE), correlated equilibria (CE), and

coarse correlated equilibria (CCE), which we do not repeat in full here, as well as their Bayesian

counterparts. Note that strategies in Markov games and in each of these equilibrium deﬁnitions are

policies, not actions: A pair of policies

π1, π2

in a two-player Markov game is a Nash equilibrium if

neither agent can increase their expected total reward by unilaterally deviating to a diﬀerent policy

π′

Stackelberg Equilibria. Unlike the above equilibrium concepts, a Stackelberg equilibrium is

not symmetric: There is a special player, the leader, who commits to their strategy ﬁrst; the other

player (the follower) then chooses their best strategy given the leader’s choice of strategy. This

makes the leader potentially more powerful.

Example 1. In a game often called the “battle of the sexes,” you and I wish to have dinner together,

but you prefer restaurant A (deriving happiness 2, but I only get happiness 1), and I prefer restaurant

B (I get happiness 2, you get happiness 1)—but we would both rather eat together at our less-preferred

venue, than to eat separately (we both get happiness 0). Table 1 shows the payoﬀ matrix of this

game. There are two pure Nash equilibria in this game: We both go to restaurant A, or we both

go to restaurant B. But there is only a single Stackelberg equilibrium (per leader): If you commit

to going to restaurant A, then my only best response is to also go to restaurant A. In doing so I

receive happiness 1, whereas my only alternative would be to eat alone at restaurant B for happiness

0. Notice that this hinges on the leader strictly committing to their choice of restaurant.

Table 1: “Battle of

the Sexes” game.

2,1 0,0

0,0 1,2

This Stackelberg concept also extends to Markov games: Here a leader

agent

decides on their policy (i.e. strategy), and the remaining (follower)

agents—knowing the leader’s choice of policy—best-respond. The leader seeks

to maximize their own reward, considering that followers will best-respond.

For instance, in an Iterated Prisoners’ Dilemma (Robinson & Goforth, 2005),

a leader might commit to a Tit-for-Tat strategy, in turn leading the follower

to cooperate.

Typically, a Stackelberg equilibrium is formally deﬁned using a max-min-

style condition: The leader maximizes its own reward knowing that the follower

best-responds, i.e., maximizes its own reward, with the leader-follower dynamic giving the order

of the two nested max-operators. This in turn suggests a nested outer-inner loop (reinforcement)

learning approach, where the follower trains until convergence every time the leader updates its policy.

As a key innovation in this work, we instead use a statement of the follower best-response through an

oracle abstraction. An oracle deﬁnition has been used before in order to extend Stackelberg equilibria

to multiple followers Nakamura (2015); Zhang et al. (2016); Liu (1998); Solis et al. (2016); Sinha

et al. (2014); Wang et al. (2022); Brero et al. (2021a). In contrast, we use the oracle abstraction to

greatly simplify the statement and proof of our main theorem, while simultaneously generalizing it

beyond prior approaches. In turn, this allows us to develop a novel Meta-RL approach in the second

part of the paper.

With multiple followers, any choice of leader strategy,

πL

, induces a Markov game,

FπL

, between

the followers, which could feature multiple equilibria as well as equilibria of diﬀerent types, such as

To keep notation concise we discuss here the memory-less case, but all our results generalize to a stateful leader

policy in a straightforward manner, as we discuss in Appendix B.

Nash, correlated, and coarse correlated equilibria, each giving rise to a corresponding Stackelberg-

Nash, Stackelberg-CE, and Stackelberg-CCE concept. This motivates an oracle abstraction. For any

choice of leader strategy, we denote as

(

FπL

) a follower equilibrium (or a probability distribution

over equilibria), where we refer to

as an oracle. This oracle will later be realized as an algorithm.

See also Wang et al. (2022).

Deﬁnition 2 (Stackelberg equilibrium).Given a Markov Game

and a follower best-response

oracle

, a leader strategy

πL

together with a tuple of follower strategies

πF

is a Stackelberg

equilibrium, if and only if

πL

maximizes the leader’s expected reward under the condition that

follower strategies are drawn from E(FπL):

πL∈arg max

πL

πF∼E(FπL)hX

E[rLst, aL,t, aF,t]i,

where the second expectation is drawing actions and state transitions from their respective policies

πL, πF

and transition function

, and the reward function is

, all as in Deﬁnition 1. If the

follower oracle

gives a Nash equilibrium in the induced game

FπL

, we call this a Stackelberg-Nash

equilibrium, and similarly for CE and CCE.

In the remainder of the paper, when we say “oracle” we mean an algorithm that computes or

learns the follower best-response equilibrium,

(

FπL

), given the leader strategy

πL

. In full generality,

an oracle algorithm could take many forms, including RL, optimization, or any other algorithm that

takes as input

πL

and outputs

(

FπL

). These algorithms may vary in regard to how they access

information about

πL

. Our main positive result (Theorem 1) will rely on oracle algorithms that

only require query access (also called sample access) to the leader policy; i.e., algorithms that only

interact with the leader policy by receiving as input the leader’s actions in each of a set of leader

observation states, these states queried by the algorithm in some order.

Deﬁnition 3 (Query Oracle).An algorithm implementing an oracle

is called a query oracle if its

interactions with the leader policy

πL

are exclusively through queries, where a query is an input to

the leader policy; i.e., an observation

together with the response from the leader policy, which is

the action πL(o)the leader would take given o.

A query oracle can also receive additional information, for example an interaction with the

environment that is associated with the Markov Game (or a description of the environment). The

deﬁnition of a query oracle is only in regard to how the oracle algorithm interacts with the leader

policy. In particular, any oracle implemented using RL or another typical learning approach is a

query oracle, as learning algorithms generally only require query access to their environment and

the actions taken by other agents in diﬀerent states. In particular, a suitably formulated RL leader

together with a RL-based followers (i.e., an “outer loop - inner loop” approach) can be interpreted

within our framework as an oracle querying the leader policy.

An algorithm that operates directly on a description of the leader policy,

πL

, such as a parametriza-

tion

πL

, is not a query oracle. For instance, if

has a small number of discrete states and

actions,

πL

could be directly parametrized, with

θij

denoting the probability of taking action

state

. An optimization approach could then compute a best response directly from knowledge of

but these parameters are not available through query access. Similarly,

could be the weights of a

neural network.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Oracles&Followers:StackelbergEquilibriainDeepMulti-AgentReinforcementLearningMatthiasGerstgrasser∗1,2andDavidC.Parkes†1,31JohnA.PaulsonSchoolofEngineeringandAppliedScience,HarvardUniversity,Cambridge,MA,USA2ComputerScienceDepartment,StanfordUniversity,Stanford,CA,USA3Deepmind,London,UKJune5,2023Abst...

展开>> 收起<<

Oracles Followers Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning Matthias Gerstgrasser12and David C. Parkes13.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Oracles Followers Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning Matthias Gerstgrasser12and David C. Parkes13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: