
matching and pricing. However, mechanism design—the engineering side of economic theory—does not
always provide enough guidance in regard to how to design these markets to achieve properties of interest,
such as economic efficiency or other specific objectives.
Mechanism design theory has mainly focused on direct revelation mechanisms, where participants di-
rectly report their preferences and in an incentive-aligned way. While in theory this is without loss of gen-
erality due to the revelation principle [Laffont and Tirole,1993], in practice most mechanisms deployed on
market platforms are indirect—participants respond to product choices and prices, and do not communicate
directly about their preferences. This emphasis on indirect mechanisms gives rise to the need to consider the
behaviors of agents that is induced by the rules of a mechanism: no longer can one simply look for direct
mechanisms in which it is incentive compatible to make truthful reports of preferences. Rather, we need
to ask, for example, how agents will make use of simpler messaging schemes and respond to an array of
products and prices.
This problem of indirect mechanism design can be modeled as one of finding an optimal leader strategy
in a Stackelberg game. In this game, the strategy of the leader corresponds to a commitment to the rules of
the mechanism that is used for pricing and allocation, and the followers correspond to the market participants
who respond to the induced economic environment. We introduce a new methodology for finding optimal
leader strategies in complex Stackelberg games, demonstrating how it can be applied to economic scenarios
capturing many aspects of indirect mechanisms, including communication, multi-round interaction, and
uncertainty regarding participants’ types.
Stackelberg games have been widely studied in computer science, with applications to security [Sinha
et al.,2018], wildlife conservation [Fang et al.,2016], and taxation policies [Zhou et al.,2019,Wang,1999],
among others. Much of the research has focused on solving these games in simple settings, such as those with
a single follower [Conitzer and Sandholm,2006, e.g.], complete information settings where followers move
simultaneously [Basilico et al.,2017, e.g.], or settings where the leader’s strategies can be enumerated and
optimized via bandit approaches [Bai et al.,2021, e.g.]. A more recent research thread [Zhang et al.,2023a,b]
focused on designing payment schemes that steer no-regret-learning agents to play desirable equilibria in
extensive-form games.
Our work differs by considering multi-round games with incomplete information, where the leader’s
interventions involve not only monetary compensation but also soliciting bids and allocating items.1We
approach this by treating these interventions as policies and refining them using reinforcement learning
techniques. Furthermore, we don’t restrict the followers to just adapting via no-regret algorithms in response
to the leader’s strategy. Our framework is broader, allowing for various follower behavior models, including
those emulating human behavior through inverse reinforcement learning [Arora and Doshi,2021].
Formally, we model our settings as Stackelberg Partially Observable Markov Games, where the leader
commits to a strategy, and the followers respond to this commitment. The partial observability in these
games arises from the followers’ private types. To our knowledge, this work is the first to provide a frame-
work that supports a proof of convergence to a Stackelberg policy in Partially Observable Markov Games
(POMGs).2We achieve this result through the suitable construction of a single-agent leader learning prob-
lem integrating the behavior of the followers, in learning to adapt to the leader’s strategy, into the leader’s
learning environment.
The learning method that we develop for solving for the Stackelberg leader’s policy in POMGs works
for any model of followers in which they learn to respond to the leader’s policy by interacting with the
leader’s policy across trajectories of states sampled from the POMG. We refer to these follower algorithms
as policy-interactive learning algorithms. Our insight is that since the followers’ policy-interactive learning
1In one of our scenarios, the leader’s task involves generating a game that allows for communication between a mechanism and
agents, where the semantics of this communication arises from the strategic response of followers, and where the order with which
agents participate in the market and the prices they see depends, through the rules of the mechanism, on this communication.
2Zhong et al. [2021] support convergence to an optimal leader strategy in POMGs, but they restrict their followers to be myopic.
2