Stackelberg POMDP A Reinforcement Learning Approach for Economic Design Gianluca Brero1 Alon Eden2 Darshan Chakrabarti3 Matthias Gerstgrasser4 Amy Greenwald5

2025-05-03 2 0 693.77KB 25 页 10玖币

侵权投诉

Stackelberg POMDP: A Reinforcement Learning Approach for

Economic Design

Gianluca Brero∗1, Alon Eden∗2, Darshan Chakrabarti3, Matthias Gerstgrasser4, Amy Greenwald5,

Vincent Li6, and David C. Parkes7

1Data Science Initiative, Brown University

2School of Computer Science and Engineering, Hebrew University of Jerusalem

3Department of Industrial Engineering and Operations Research, Columbia University

5Department of Computer Science, Brown University

4,6,7John A. Paulson School of Engineering and Applied Sciences, Harvard University

1,5{gianluca brero,amy greenwald}@brown.edu

2alon.eden@mail.huji.ac.il

3dc3595@columbia.edu

4,7{matthias,parkes}@g.harvard.edu

6vincentli@college.harvard.edu

July 22, 2024

Abstract

We introduce a reinforcement learning framework for economic design where the interaction be-

tween the environment designer and the participants is modeled as a Stackelberg game. In this game,

the designer (leader) sets up the rules of the economic system, while the participants (followers) respond

strategically. We integrate algorithms for determining followers’ response strategies into the leader’s

learning environment, providing a formulation of the leader’s learning problem as a POMDP that we

call the Stackelberg POMDP. We prove that the optimal leader’s strategy in the Stackelberg game is the

optimal policy in our Stackelberg POMDP under a limited set of possible policies, establishing a con-

nection between solving POMDPs and Stackelberg games. We solve our POMDP under a limited set of

policy options via the centralized training with decentralized execution framework. For the speciﬁc case

of followers that are modeled as no-regret learners, we solve an array of increasingly complex settings,

including problems of indirect mechanism design where there is turn-taking and limited communication

by agents. We demonstrate the effectiveness of our training framework through ablation studies. We also

give convergence results for no-regret learners to a Bayesian version of a coarse-correlated equilibrium,

extending known results to correlated types.

1 Introduction

A digital transformation is turning markets into algorithmic platforms that make use of complex mechanisms

to facilitate the matching of supply and demand. Examples include Amazon’s marketplace, Uber, GrubHub,

and AirBnB. These market platforms allow for an unprecedented level of control over mechanisms for

*These authors contributed equally to this work. This is the extended version of Brero et al. [2021a], which was presented at the

ICML Workshop for Reinforcement Learning Theory (ICML 2021).

arXiv:2210.03852v4 [cs.GT] 19 Jul 2024

matching and pricing. However, mechanism design—the engineering side of economic theory—does not

always provide enough guidance in regard to how to design these markets to achieve properties of interest,

such as economic efﬁciency or other speciﬁc objectives.

Mechanism design theory has mainly focused on direct revelation mechanisms, where participants di-

rectly report their preferences and in an incentive-aligned way. While in theory this is without loss of gen-

erality due to the revelation principle [Laffont and Tirole,1993], in practice most mechanisms deployed on

market platforms are indirect—participants respond to product choices and prices, and do not communicate

directly about their preferences. This emphasis on indirect mechanisms gives rise to the need to consider the

behaviors of agents that is induced by the rules of a mechanism: no longer can one simply look for direct

mechanisms in which it is incentive compatible to make truthful reports of preferences. Rather, we need

to ask, for example, how agents will make use of simpler messaging schemes and respond to an array of

products and prices.

This problem of indirect mechanism design can be modeled as one of ﬁnding an optimal leader strategy

in a Stackelberg game. In this game, the strategy of the leader corresponds to a commitment to the rules of

the mechanism that is used for pricing and allocation, and the followers correspond to the market participants

who respond to the induced economic environment. We introduce a new methodology for ﬁnding optimal

leader strategies in complex Stackelberg games, demonstrating how it can be applied to economic scenarios

capturing many aspects of indirect mechanisms, including communication, multi-round interaction, and

uncertainty regarding participants’ types.

Stackelberg games have been widely studied in computer science, with applications to security [Sinha

et al.,2018], wildlife conservation [Fang et al.,2016], and taxation policies [Zhou et al.,2019,Wang,1999],

among others. Much of the research has focused on solving these games in simple settings, such as those with

a single follower [Conitzer and Sandholm,2006, e.g.], complete information settings where followers move

simultaneously [Basilico et al.,2017, e.g.], or settings where the leader’s strategies can be enumerated and

optimized via bandit approaches [Bai et al.,2021, e.g.]. A more recent research thread [Zhang et al.,2023a,b]

focused on designing payment schemes that steer no-regret-learning agents to play desirable equilibria in

extensive-form games.

Our work differs by considering multi-round games with incomplete information, where the leader’s

interventions involve not only monetary compensation but also soliciting bids and allocating items.1We

approach this by treating these interventions as policies and reﬁning them using reinforcement learning

techniques. Furthermore, we don’t restrict the followers to just adapting via no-regret algorithms in response

to the leader’s strategy. Our framework is broader, allowing for various follower behavior models, including

those emulating human behavior through inverse reinforcement learning [Arora and Doshi,2021].

Formally, we model our settings as Stackelberg Partially Observable Markov Games, where the leader

commits to a strategy, and the followers respond to this commitment. The partial observability in these

games arises from the followers’ private types. To our knowledge, this work is the ﬁrst to provide a frame-

work that supports a proof of convergence to a Stackelberg policy in Partially Observable Markov Games

(POMGs).2We achieve this result through the suitable construction of a single-agent leader learning prob-

lem integrating the behavior of the followers, in learning to adapt to the leader’s strategy, into the leader’s

learning environment.

The learning method that we develop for solving for the Stackelberg leader’s policy in POMGs works

for any model of followers in which they learn to respond to the leader’s policy by interacting with the

leader’s policy across trajectories of states sampled from the POMG. We refer to these follower algorithms

as policy-interactive learning algorithms. Our insight is that since the followers’ policy-interactive learning

1In one of our scenarios, the leader’s task involves generating a game that allows for communication between a mechanism and

agents, where the semantics of this communication arises from the strategic response of followers, and where the order with which

agents participate in the market and the prices they see depends, through the rules of the mechanism, on this communication.

2Zhong et al. [2021] support convergence to an optimal leader strategy in POMGs, but they restrict their followers to be myopic.

algorithms adapt to a leader’s strategy by taking actions in game states, we can incorporate these algorithms

into a carefully constructed POMDP representing the leader’s learning problem. In this POMDP, the policy’s

actions determine the followers’ response (indirectly through the interaction of the leader’s policy with

the followers’ learning algorithms) and in turn the reward to the leader. This idea allows us to establish a

connection between solving POMDPs and solving Stackelberg games: we prove that the optimal policy in

our POMDP under a limited set of policy options is the optimal Stackelberg leader strategy for the given

model of follower behavior. We handle learning under limited policy options via a novel use of actor-critic

algorithms based on centralized training and decentralized execution [Lowe et al.,2017, see, e.g.,].

With this general result in place, we give algorithmic results for a model of followers as no-regret

learners, and demonstrate the robustness and ﬂexibility of the resulting Stackelberg POMDP by evaluating

it across an array of increasingly complex settings.

Further related work. The Stackelberg POMDP framework introduced in the present paper has been fur-

ther explored in subsequent studies. In Brero et al. [2022], Stackelberg POMDP is utilized for discovering

rules for economic platforms that mitigate collusive behavior among pricing agents employing Q-learning

algorithms. Gerstgrasser and Parkes [2023] have further generalized Stackelberg POMDP to accommodate

domain-speciﬁc queries for faster followers’ response via meta-learning. However, their work only stud-

ied settings with complete information, where followers lack private types. This limitation allowed them

to achieve good results without using the centralized training with decentralized execution framework we

discuss in this work, as detailed in our experiments section.

2 Preliminaries

We consider a multi-round Stackelberg game with n+ 1 agents: a leader (ℓ) and nfollowers (indexed

1, . . . , n). The leader gains a strategic advantage by committing to an initial strategy, inﬂuencing the subse-

quent choices made by the followers. This early commitment allows the leader to anticipate the followers’

reactions to their chosen strategy.

We formulate our Stackelberg game through the formalism of a partially observable Markov game

[Hansen et al.,2004] between the leader and the followers:

Deﬁnition 2.1 (Partially Observable Markov game).Apartially observable Markov game (POMG) Gwith

kagents (indexed 1, . . . , k) is deﬁned by the following components:

•A set of states,S, with each state s∈Scontaining the necessary information on game history to

predict the game’s progression based on agents’ actions.

•A set of actions,Aifor each agent i∈[k]. The action proﬁle is denoted by a= (a1, ..., ak)∈A,

where A=A1×... ×Ak.

•A set of observations,Oifor each agent i∈[k]. The agents’ observation proﬁle is denoted o=

(o1, ..., ok)∈O, where O=O1×... ×Ok.

•Astate transition function,T r :S×A×S→[0,1], which deﬁnes the probability of transitioning

from one state to another given the current state and action proﬁle: T r(s, a, s′) = P r(s′|s, a).

•An observation generation rule,O:S×A×O→[0,1], which deﬁnes the probability distribution

over observations given the current state and action proﬁle: O(s, a, o) = P r(o|s, a).

•Areward function,Ri:S×A→R, which maps for each agent i∈[k]the current state and action

proﬁle to the immediate reward received by agent i,Ri(s, a).

•A ﬁnite time horizon, T.

Each agent iaims to maximize their expected total reward over Tsteps. Going forward, we use a

subscript twith every POMG variable to refer to its value at the t-th step of the game. The history of

agent i’s observations at step tis denoted by hi,t = (oi,0, .., oi,t), while hirepresents a generic history with

unspeciﬁed time step and Hiis the set of these histories. Agent i’s behavior in the game is characterized by

their policy,πi:Hi→Ai, where πi(hi)is the policy action when observing history hi.3We let Πibe the

set of these policies.

Given a policy proﬁle πIfor a subset I⊂[k]of agents in G, we can deﬁne a subgame in G, denoted

GπI, for the remaining agents in [k]\I:

Deﬁnition 2.2 (Subgame of a POMG (Informal Deﬁnition, Formal in Appendix)).Given a POMG G, a

subset Iof the agents, and a policy proﬁle πIfor agents in I, we deﬁne the subgame POMG,GπI, as the

game induced among agents [k]\Iby policy proﬁle πI. Actions, rewards, and observations remain the same

as in G, while states are augmented to include policies πIand observation histories hI. State transitions

in GπIare induced by the state transition function in G, with the actions of the agents in Iin Gcomputed

“internally” by the game GπIby applying policies πIon histories hI.

We can now formalize our Stackelberg game as follows:

Deﬁnition 2.3 (Stackelberg Partially Observable Markov Game).Let Gbe a partially observable Markov

game (POMG) involving a leader ℓand nfollowers. We deﬁne the Stackelberg POMG SGbased on Gas

follows:

•Leader’s Commitment: The leader selects a policy πℓwithin G.

•Induced Subgame: The followers play the subgame POMG Gπℓ, induced by the leader’s committed

policy. This results in a state-action trajectory τ= (s0, a0, . . . , sT−1, aT−1)within G.

•Leader’s Reward: The leader receives a reward Rℓ(τ) = PT−1

t=0 Rℓ(st, at)based on the trajectory τ.

Given our focus on economic design, in the remainder of this paper we consider Bayesian settings

that model standard mechanism design problems. In these settings, when the game begins each follower

iprivately observes their type θi(i.e., oi,0=θi), which captures their preferences for the mechanism’s

outcomes. The leader, in contrast, begins with an empty initial observation (i.e., oℓ,0= 0. We use θ=

(θ1, ..., θn)for the follower type proﬁle and assume that type proﬁles are drawn from a (possibly correlated)

distribution D.

3 The Leader’s Problem

The leader in a Stackelberg game optimizes their reward by anticipating how the followers will respond to

their strategy. In general-sum partially observable Markov games (POMGs), followers can exhibit a wide

range of response behaviors. This includes both equilibrium strategies (of which there may be many) and

non-equilibrium behaviors, such as the seemingly collusive ones isolated by Calvano et al. [2020] and Abada

and Lambin [2023]. Our framework accommodates all the possible behaviors that can be learned by the

followers within the standard joint environment, i.e., by repeatedly playing the POMG Gagainst the leader.

To formalize this process, we introduce the notion of policy-interactive response algorithms, which we

deﬁne as follows:

3We only consider deterministic policies for simplicity, but our results extend to randomized policies.

Deﬁnition 3.1 (Policy-Interactive Response Algorithm).Given a Stackelberg POMG, SGand a leader pol-

icy πℓin SG, we deﬁne a policy-interactive (PI) response algorithm, denoted as Πas the algorithm that

determines a behavior function for the followers in the subgame Gπℓby repeatedly playing game Gagainst

πℓ. A PI response algorithm consists of the following elements:

•A set of states, Sb, where each state sb= (π−ℓ, s+

b)∈Sbis comprised of two parts: a policy proﬁle

π−ℓrepresenting the followers’ strategies in G, and an additional information state s+

butilized for

state transitions.

•A state transition function, T rb(sb,O, s′

b) = P r(s′

b|sb,O), that deﬁnes the probability of transitioning

from one state to another. The transition is based on the current state sband the outcome Oof game

Gplayed using the follower policies in sband the leader’s actions queried from πℓ.4

•A ﬁnite time horizon E, corresponding to the total number of game plays run by the algorithm.

As previously done for POMGs, we use subscript ewith every variable of our PI algorithm to refer to

its value at the e-th game play. Consequently, the ﬁnal state sb,E of our PI algorithm includes the followers’

response policy proﬁle π∗

−ℓ, where π∗

−ℓ= (π∗

1, . . . , π∗

n). To express the leader’s problem, we denote the

followers’ responses using a behavior function B(·). This function takes the leader’s strategy πℓas input and

returns followers’ response strategies B(πℓ) = π∗

−ℓ. We let

(τℓ, τ∗

−ℓ)=(s0,(aℓ,0, a∗

−ℓ,0), . . . , sT−1,(aℓ,T −1, a∗

−ℓ,T −1))

be a state-action trajectory in Ggenerated by πℓand B(πℓ)and P r(·|πℓ,B)be the distribution on trajecto-

ries. We formulate the leader’s problem:

Deﬁnition 3.2 (Optimal leader strategy).Given a Stackelberg POMG, SG, and a behavior function B(·)for

the followers in SG, the leader’s problem is deﬁned as ﬁnding a policy that maximizes their expected reward

under B(·):

π∗

ℓ∈arg max

πℓ∈Πℓ

(τℓ,τ∗

−ℓ)∼P r(·|πℓ,B)"T−1

t=0

Rℓst,(aℓ,t, a∗

−ℓ,t)#.

In the remainder of this paper we assume that our behavior function Breturns Bayesian Coarse Corre-

lated Equilibria (BCCE), which we deﬁne in the Appendix. In the Appendix, we also prove that this behavior

function can be derived when followers use no-regret learning algorithms, extending a result proven by Hart-

line et al. [2015] to the case of correlated types. We note that BCCEs are equivalent to Coarse-Correlated

Equilibria (CCE) when the followers only have one type, i.e., the setting is effectively non-Bayesian. We

then instantiate the policy-interactive response algorithm to the multiplicative-weights algorithm, which is a

speciﬁc no-regret learning algorithm. In our Appendix, we show how to formulate the multiplicative-weights

algorithm as a PI response algorithm.

We note that our framework can accommodate many behavior functions, as PI response algorithms

are quite general. This includes conventional MARL algorithms like multi-agent Q-learning, as well as

centralized algorithms such as the Multi-Agent Deep Deterministic Policy Gradient algorithm [Lowe et al.,

2017], and even human-like behaviors derived using inverse RL [Arora and Doshi,2021].

4For ease of exposition, we only allow PI response algorithms to update their states at the end of gameplays. However, this

aspect can be easily generalized to accommodate intra-game updates.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StackelbergPOMDP:AReinforcementLearningApproachforEconomicDesignGianlucaBrero∗1,AlonEden∗2,DarshanChakrabarti3,MatthiasGerstgrasser4,AmyGreenwald5,VincentLi6,andDavidC.Parkes71DataScienceInitiative,BrownUniversity2SchoolofComputerScienceandEngineering,HebrewUniversityofJerusalem3DepartmentofIndustri...

展开>> 收起<<

Stackelberg POMDP A Reinforcement Learning Approach for Economic Design Gianluca Brero1 Alon Eden2 Darshan Chakrabarti3 Matthias Gerstgrasser4 Amy Greenwald5.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Stackelberg POMDP A Reinforcement Learning Approach for Economic Design Gianluca Brero1 Alon Eden2 Darshan Chakrabarti3 Matthias Gerstgrasser4 Amy Greenwald5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: