Stackelberg POMDP A Reinforcement Learning Approach for Economic Design Gianluca Brero1 Alon Eden2 Darshan Chakrabarti3 Matthias Gerstgrasser4 Amy Greenwald5

2025-05-03 0 0 693.77KB 25 页 10玖币
侵权投诉
Stackelberg POMDP: A Reinforcement Learning Approach for
Economic Design
Gianluca Brero1, Alon Eden2, Darshan Chakrabarti3, Matthias Gerstgrasser4, Amy Greenwald5,
Vincent Li6, and David C. Parkes7
1Data Science Initiative, Brown University
2School of Computer Science and Engineering, Hebrew University of Jerusalem
3Department of Industrial Engineering and Operations Research, Columbia University
5Department of Computer Science, Brown University
4,6,7John A. Paulson School of Engineering and Applied Sciences, Harvard University
1,5{gianluca brero,amy greenwald}@brown.edu
2alon.eden@mail.huji.ac.il
3dc3595@columbia.edu
4,7{matthias,parkes}@g.harvard.edu
6vincentli@college.harvard.edu
July 22, 2024
Abstract
We introduce a reinforcement learning framework for economic design where the interaction be-
tween the environment designer and the participants is modeled as a Stackelberg game. In this game,
the designer (leader) sets up the rules of the economic system, while the participants (followers) respond
strategically. We integrate algorithms for determining followers’ response strategies into the leader’s
learning environment, providing a formulation of the leader’s learning problem as a POMDP that we
call the Stackelberg POMDP. We prove that the optimal leader’s strategy in the Stackelberg game is the
optimal policy in our Stackelberg POMDP under a limited set of possible policies, establishing a con-
nection between solving POMDPs and Stackelberg games. We solve our POMDP under a limited set of
policy options via the centralized training with decentralized execution framework. For the specific case
of followers that are modeled as no-regret learners, we solve an array of increasingly complex settings,
including problems of indirect mechanism design where there is turn-taking and limited communication
by agents. We demonstrate the effectiveness of our training framework through ablation studies. We also
give convergence results for no-regret learners to a Bayesian version of a coarse-correlated equilibrium,
extending known results to correlated types.
1 Introduction
A digital transformation is turning markets into algorithmic platforms that make use of complex mechanisms
to facilitate the matching of supply and demand. Examples include Amazon’s marketplace, Uber, GrubHub,
and AirBnB. These market platforms allow for an unprecedented level of control over mechanisms for
*These authors contributed equally to this work. This is the extended version of Brero et al. [2021a], which was presented at the
ICML Workshop for Reinforcement Learning Theory (ICML 2021).
1
arXiv:2210.03852v4 [cs.GT] 19 Jul 2024
matching and pricing. However, mechanism design—the engineering side of economic theory—does not
always provide enough guidance in regard to how to design these markets to achieve properties of interest,
such as economic efficiency or other specific objectives.
Mechanism design theory has mainly focused on direct revelation mechanisms, where participants di-
rectly report their preferences and in an incentive-aligned way. While in theory this is without loss of gen-
erality due to the revelation principle [Laffont and Tirole,1993], in practice most mechanisms deployed on
market platforms are indirect—participants respond to product choices and prices, and do not communicate
directly about their preferences. This emphasis on indirect mechanisms gives rise to the need to consider the
behaviors of agents that is induced by the rules of a mechanism: no longer can one simply look for direct
mechanisms in which it is incentive compatible to make truthful reports of preferences. Rather, we need
to ask, for example, how agents will make use of simpler messaging schemes and respond to an array of
products and prices.
This problem of indirect mechanism design can be modeled as one of finding an optimal leader strategy
in a Stackelberg game. In this game, the strategy of the leader corresponds to a commitment to the rules of
the mechanism that is used for pricing and allocation, and the followers correspond to the market participants
who respond to the induced economic environment. We introduce a new methodology for finding optimal
leader strategies in complex Stackelberg games, demonstrating how it can be applied to economic scenarios
capturing many aspects of indirect mechanisms, including communication, multi-round interaction, and
uncertainty regarding participants’ types.
Stackelberg games have been widely studied in computer science, with applications to security [Sinha
et al.,2018], wildlife conservation [Fang et al.,2016], and taxation policies [Zhou et al.,2019,Wang,1999],
among others. Much of the research has focused on solving these games in simple settings, such as those with
a single follower [Conitzer and Sandholm,2006, e.g.], complete information settings where followers move
simultaneously [Basilico et al.,2017, e.g.], or settings where the leader’s strategies can be enumerated and
optimized via bandit approaches [Bai et al.,2021, e.g.]. A more recent research thread [Zhang et al.,2023a,b]
focused on designing payment schemes that steer no-regret-learning agents to play desirable equilibria in
extensive-form games.
Our work differs by considering multi-round games with incomplete information, where the leader’s
interventions involve not only monetary compensation but also soliciting bids and allocating items.1We
approach this by treating these interventions as policies and refining them using reinforcement learning
techniques. Furthermore, we don’t restrict the followers to just adapting via no-regret algorithms in response
to the leader’s strategy. Our framework is broader, allowing for various follower behavior models, including
those emulating human behavior through inverse reinforcement learning [Arora and Doshi,2021].
Formally, we model our settings as Stackelberg Partially Observable Markov Games, where the leader
commits to a strategy, and the followers respond to this commitment. The partial observability in these
games arises from the followers’ private types. To our knowledge, this work is the first to provide a frame-
work that supports a proof of convergence to a Stackelberg policy in Partially Observable Markov Games
(POMGs).2We achieve this result through the suitable construction of a single-agent leader learning prob-
lem integrating the behavior of the followers, in learning to adapt to the leader’s strategy, into the leader’s
learning environment.
The learning method that we develop for solving for the Stackelberg leader’s policy in POMGs works
for any model of followers in which they learn to respond to the leader’s policy by interacting with the
leader’s policy across trajectories of states sampled from the POMG. We refer to these follower algorithms
as policy-interactive learning algorithms. Our insight is that since the followers’ policy-interactive learning
1In one of our scenarios, the leader’s task involves generating a game that allows for communication between a mechanism and
agents, where the semantics of this communication arises from the strategic response of followers, and where the order with which
agents participate in the market and the prices they see depends, through the rules of the mechanism, on this communication.
2Zhong et al. [2021] support convergence to an optimal leader strategy in POMGs, but they restrict their followers to be myopic.
2
algorithms adapt to a leader’s strategy by taking actions in game states, we can incorporate these algorithms
into a carefully constructed POMDP representing the leader’s learning problem. In this POMDP, the policy’s
actions determine the followers’ response (indirectly through the interaction of the leader’s policy with
the followers’ learning algorithms) and in turn the reward to the leader. This idea allows us to establish a
connection between solving POMDPs and solving Stackelberg games: we prove that the optimal policy in
our POMDP under a limited set of policy options is the optimal Stackelberg leader strategy for the given
model of follower behavior. We handle learning under limited policy options via a novel use of actor-critic
algorithms based on centralized training and decentralized execution [Lowe et al.,2017, see, e.g.,].
With this general result in place, we give algorithmic results for a model of followers as no-regret
learners, and demonstrate the robustness and flexibility of the resulting Stackelberg POMDP by evaluating
it across an array of increasingly complex settings.
Further related work. The Stackelberg POMDP framework introduced in the present paper has been fur-
ther explored in subsequent studies. In Brero et al. [2022], Stackelberg POMDP is utilized for discovering
rules for economic platforms that mitigate collusive behavior among pricing agents employing Q-learning
algorithms. Gerstgrasser and Parkes [2023] have further generalized Stackelberg POMDP to accommodate
domain-specific queries for faster followers’ response via meta-learning. However, their work only stud-
ied settings with complete information, where followers lack private types. This limitation allowed them
to achieve good results without using the centralized training with decentralized execution framework we
discuss in this work, as detailed in our experiments section.
2 Preliminaries
We consider a multi-round Stackelberg game with n+ 1 agents: a leader () and nfollowers (indexed
1, . . . , n). The leader gains a strategic advantage by committing to an initial strategy, influencing the subse-
quent choices made by the followers. This early commitment allows the leader to anticipate the followers’
reactions to their chosen strategy.
We formulate our Stackelberg game through the formalism of a partially observable Markov game
[Hansen et al.,2004] between the leader and the followers:
Definition 2.1 (Partially Observable Markov game).Apartially observable Markov game (POMG) Gwith
kagents (indexed 1, . . . , k) is defined by the following components:
A set of states,S, with each state sScontaining the necessary information on game history to
predict the game’s progression based on agents’ actions.
A set of actions,Aifor each agent i[k]. The action profile is denoted by a= (a1, ..., ak)A,
where A=A1×... ×Ak.
A set of observations,Oifor each agent i[k]. The agents’ observation profile is denoted o=
(o1, ..., ok)O, where O=O1×... ×Ok.
Astate transition function,T r :S×A×S[0,1], which defines the probability of transitioning
from one state to another given the current state and action profile: T r(s, a, s) = P r(s|s, a).
An observation generation rule,O:S×A×O[0,1], which defines the probability distribution
over observations given the current state and action profile: O(s, a, o) = P r(o|s, a).
Areward function,Ri:S×AR, which maps for each agent i[k]the current state and action
profile to the immediate reward received by agent i,Ri(s, a).
3
A finite time horizon, T.
Each agent iaims to maximize their expected total reward over Tsteps. Going forward, we use a
subscript twith every POMG variable to refer to its value at the t-th step of the game. The history of
agent is observations at step tis denoted by hi,t = (oi,0, .., oi,t), while hirepresents a generic history with
unspecified time step and Hiis the set of these histories. Agent is behavior in the game is characterized by
their policy,πi:HiAi, where πi(hi)is the policy action when observing history hi.3We let Πibe the
set of these policies.
Given a policy profile πIfor a subset I[k]of agents in G, we can define a subgame in G, denoted
GπI, for the remaining agents in [k]\I:
Definition 2.2 (Subgame of a POMG (Informal Definition, Formal in Appendix)).Given a POMG G, a
subset Iof the agents, and a policy profile πIfor agents in I, we define the subgame POMG,GπI, as the
game induced among agents [k]\Iby policy profile πI. Actions, rewards, and observations remain the same
as in G, while states are augmented to include policies πIand observation histories hI. State transitions
in GπIare induced by the state transition function in G, with the actions of the agents in Iin Gcomputed
“internally” by the game GπIby applying policies πIon histories hI.
We can now formalize our Stackelberg game as follows:
Definition 2.3 (Stackelberg Partially Observable Markov Game).Let Gbe a partially observable Markov
game (POMG) involving a leader and nfollowers. We define the Stackelberg POMG SGbased on Gas
follows:
Leader’s Commitment: The leader selects a policy πwithin G.
Induced Subgame: The followers play the subgame POMG Gπ, induced by the leader’s committed
policy. This results in a state-action trajectory τ= (s0, a0, . . . , sT1, aT1)within G.
Leader’s Reward: The leader receives a reward R(τ) = PT1
t=0 R(st, at)based on the trajectory τ.
Given our focus on economic design, in the remainder of this paper we consider Bayesian settings
that model standard mechanism design problems. In these settings, when the game begins each follower
iprivately observes their type θi(i.e., oi,0=θi), which captures their preferences for the mechanism’s
outcomes. The leader, in contrast, begins with an empty initial observation (i.e., oℓ,0= 0. We use θ=
(θ1, ..., θn)for the follower type profile and assume that type profiles are drawn from a (possibly correlated)
distribution D.
3 The Leader’s Problem
The leader in a Stackelberg game optimizes their reward by anticipating how the followers will respond to
their strategy. In general-sum partially observable Markov games (POMGs), followers can exhibit a wide
range of response behaviors. This includes both equilibrium strategies (of which there may be many) and
non-equilibrium behaviors, such as the seemingly collusive ones isolated by Calvano et al. [2020] and Abada
and Lambin [2023]. Our framework accommodates all the possible behaviors that can be learned by the
followers within the standard joint environment, i.e., by repeatedly playing the POMG Gagainst the leader.
To formalize this process, we introduce the notion of policy-interactive response algorithms, which we
define as follows:
3We only consider deterministic policies for simplicity, but our results extend to randomized policies.
4
Definition 3.1 (Policy-Interactive Response Algorithm).Given a Stackelberg POMG, SGand a leader pol-
icy πin SG, we define a policy-interactive (PI) response algorithm, denoted as Πas the algorithm that
determines a behavior function for the followers in the subgame Gπby repeatedly playing game Gagainst
π. A PI response algorithm consists of the following elements:
A set of states, Sb, where each state sb= (π, s+
b)Sbis comprised of two parts: a policy profile
πrepresenting the followers’ strategies in G, and an additional information state s+
butilized for
state transitions.
A state transition function, T rb(sb,O, s
b) = P r(s
b|sb,O), that defines the probability of transitioning
from one state to another. The transition is based on the current state sband the outcome Oof game
Gplayed using the follower policies in sband the leader’s actions queried from π.4
A finite time horizon E, corresponding to the total number of game plays run by the algorithm.
As previously done for POMGs, we use subscript ewith every variable of our PI algorithm to refer to
its value at the e-th game play. Consequently, the final state sb,E of our PI algorithm includes the followers’
response policy profile π
, where π
= (π
1, . . . , π
n). To express the leader’s problem, we denote the
followers’ responses using a behavior function B(·). This function takes the leader’s strategy πas input and
returns followers’ response strategies B(π) = π
. We let
(τ, τ
)=(s0,(aℓ,0, a
ℓ,0), . . . , sT1,(aℓ,T 1, a
ℓ,T 1))
be a state-action trajectory in Ggenerated by πand B(π)and P r(·|π,B)be the distribution on trajecto-
ries. We formulate the leader’s problem:
Definition 3.2 (Optimal leader strategy).Given a Stackelberg POMG, SG, and a behavior function B(·)for
the followers in SG, the leader’s problem is defined as finding a policy that maximizes their expected reward
under B(·):
π
arg max
πΠ
E
(τ
)P r(·|π,B)"T1
X
t=0
Rst,(aℓ,t, a
ℓ,t)#.
In the remainder of this paper we assume that our behavior function Breturns Bayesian Coarse Corre-
lated Equilibria (BCCE), which we define in the Appendix. In the Appendix, we also prove that this behavior
function can be derived when followers use no-regret learning algorithms, extending a result proven by Hart-
line et al. [2015] to the case of correlated types. We note that BCCEs are equivalent to Coarse-Correlated
Equilibria (CCE) when the followers only have one type, i.e., the setting is effectively non-Bayesian. We
then instantiate the policy-interactive response algorithm to the multiplicative-weights algorithm, which is a
specific no-regret learning algorithm. In our Appendix, we show how to formulate the multiplicative-weights
algorithm as a PI response algorithm.
We note that our framework can accommodate many behavior functions, as PI response algorithms
are quite general. This includes conventional MARL algorithms like multi-agent Q-learning, as well as
centralized algorithms such as the Multi-Agent Deep Deterministic Policy Gradient algorithm [Lowe et al.,
2017], and even human-like behaviors derived using inverse RL [Arora and Doshi,2021].
4For ease of exposition, we only allow PI response algorithms to update their states at the end of gameplays. However, this
aspect can be easily generalized to accommodate intra-game updates.
5
摘要:

StackelbergPOMDP:AReinforcementLearningApproachforEconomicDesignGianlucaBrero∗1,AlonEden∗2,DarshanChakrabarti3,MatthiasGerstgrasser4,AmyGreenwald5,VincentLi6,andDavidC.Parkes71DataScienceInitiative,BrownUniversity2SchoolofComputerScienceandEngineering,HebrewUniversityofJerusalem3DepartmentofIndustri...

展开>> 收起<<
Stackelberg POMDP A Reinforcement Learning Approach for Economic Design Gianluca Brero1 Alon Eden2 Darshan Chakrabarti3 Matthias Gerstgrasser4 Amy Greenwald5.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:693.77KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注