Inverse Game Theory for Stackelberg Games the Blessing of Bounded Rationality Jibang Wu

2025-05-05 0 0 3.23MB 20 页 10玖币
侵权投诉
Inverse Game Theory for Stackelberg Games:
the Blessing of Bounded Rationality
Jibang Wu
Department of Computer Science
University of Chicago
wujibang@uchicago.edu
Weiran Shen
Gaoling School of Artificial Intelligence
Renmin University of China
shenweiran@ruc.edu.cn
Fei Fang
Institute for Software Research
Carnegie Mellon University
feif@cmu.edu
Haifeng Xu
Department of Computer Science
University of Chicago
haifengxu@uchicago.edu
Abstract
Optimizing strategic decisions (a.k.a. computing equilibrium) is key to the success
of many non-cooperative multi-agent applications. However, in many real-world
situations, we may face the exact opposite of this game-theoretic problem — instead
of prescribing equilibrium of a given game, we may directly observe the agents’
equilibrium behaviors but want to infer the underlying parameters of an unknown
game. This research question, also known as inverse game theory, has been studied
in multiple recent works in the context of Stackelberg games. Unfortunately,
existing works exhibit quite negative results, showing statistical hardness [
27
,
37
]
and computational hardness [
24
,
25
,
26
], assuming follower’s perfectly rational
behaviors. Our work relaxes the perfect rationality agent assumption to the classic
quantal response model, a more realistic behavior model of bounded rationality.
Interestingly, we show that the smooth property brought by such bounded rationality
model actually leads to provably more efficient learning of the follower utility
parameters in general Stackelberg games. Systematic empirical experiments on
synthesized games confirm our theoretical results and further suggest its robustness
beyond the strict quantal response model.
1 Introduction
One primary objective of game theory is to predict the behaviors of agents through equilibrium
concepts in a given game. In practice, however, we may observe some equilibrium behaviors of
agents, but the game itself turns out to be unknown. For example, an online shopping platform
can observe the shoppers’ purchase decisions on different sale prices, but the platform has limited
knowledge of the exact utilities of the shoppers. Similarly, while the policymaker could observe
the market reactions to its policy announcement, the exact motives behind traders’ reactions are
usually unclear. In various security domains, the defender may want to understand the intentions or
incentives of the attackers from their responses to different defense strategies so as to improve her
future defense strategy. As such, recovering the underlying game parameters would not only lead
us to better strategic decisions, but also improve our explications of the motives and rationale in the
dark.
These potentials and prospects motivate a class of research problems known as the inverse game
theory [
26
]: given the agents’ equilibrium behaviors, what are possible utilities that induce these
behaviors? In this paper, we specifically target the sequential game setting from the perspective of the
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01380v1 [cs.GT] 4 Oct 2022
first-moving agent (e.g., Internet platform, policymaker, or defender) whose different strategies (e.g.,
price, regulation, or defense scheme) would induce different equilibrium behaviors of the following
agent (e.g., Internet users, traders, or attacker). Studies of such game settings have seen broad impacts
and extensive applications ranging from the principal-agent problems in contract design [
21
,
19
],
the AI Economist [
43
] to security games modeled for social good [
16
]. We formalize our problem
under the normal form Stackelberg game, where a leader has the commitment power of a randomized
strategy, and a follower accordingly decides his response. It is known that the optimal commitment of
the leader can be efficiently computed in a single linear program, given full knowledge of the game
[
14
]. However, the inverse learning problem to determine the underlying game from the follower’s
responses is more challenging: Letchford et al.
[27]
, Peng et al.
[37]
show that learning optimal
leader strategy from the follower’s best responses requires number of samples that is a high-degree
polynomial in the game size and may be exponential in the worst cases. This significantly limits the
practicality of these algorithms, as the leader usually cannot afford the time or cost to gather feedback
from so many interactions.
More concerning is the inconvenient reality that we can hardly expect the agents’ optimal equilibrium
responses assumed in existing work. In fact, these shoppers, traders, or attackers themselves hardly
know their exact utilities and are naturally unable to determine the expected-utility maximizing strat-
egy. Extensive studies of behavioral economics and psychology [
23
,
5
,
32
,
11
,
10
] have pinpointed
the cognitive limitations that make human decisions prone to the noisy perception of their utilities.
Among various models for quantifying irrational agent behaviors, one of the most popular ones is
perhaps the quantal response (QR) model [
32
], which adopts the well-known logit choice model to
capture agents’ probabilistic selection of actions. This will also be the bounded rationality model of
our focus in this paper.
The Blessing of Bounded Rationality.
The key insight revealed from this paper is that the extra
layer of behavioral complexity due to bounded rationality, while complicating the modeling and
computation, provides a more informative source for us to learn the underlying utility of agents. To
understand the intuitions and motivations behind our results, consider a case where the follower has a
dominated action
j1
as shown in Table 1, where the leader’s and follower’s utility of action profile
(i, j)
is specified by
ui,j , vi,j
respectively. Conventionally, such an instance is treated as a degenerated
instance, because the leader could ignore the action
j1
that a perfectly rational follower would never
play. Then, the optimal leader strategy is clearly to always play the action
i2
. However, when facing a
ui,j , vi,j j1j2
i1100,0.9 0.9,1
i299,0.9 1.1,1
Table 1: An example of dangerously “degenerated” Stackelberg game.
boundedly rational follower, it becomes possible to observe the response
j1
and estimate the utilities
regarding this dominated action. For example, if the follower plays his action
j1
and
j2
at almost the
same frequency, the follower’s expected utility on the two actions should be close. Although such
dominated action has no effect on the leader’s optimal strategy against a perfectly rational follower,
it could be a potentially damaging (or beneficial) action that leader want to avoid (or encourage)
a bounded rational follower to play. That is, in the above game instance, if a somewhat irrational
follower plays action
j1
, it would be dangerous for the leader to play action
i2
yet rewarding to play
action
i1
; therefore, a more robust leader strategy should randomize by assigning some probability to
play action
i1
. We remark that in general, even without such extreme case of dominated actions, the
extra payoff information is now available on how much worse (or better) it is to use the empirical
frequency of the boundedly rational action responses (as long as some smoothness properties are
exhibited), which are overlooked under the assumption of perfectly rational followers.
Our Results. We present a set of tight analysis on the number of strategies and sample complexity
sufficient and necessary to learn the follower’s utility, for both situations in which the leader can
observe the follower’s full mixed strategies or only the follower’s sampled pure strategies. In the
former situation of observing follower’s mixed strategies, our algorithm can recover the follower
utility parameters using
m
follower mixed strategy responses in any general Stackelberg game where
m
is the number of leader actions. Surprisingly, the required number of queries is independent of
follower actions! This is due to the fact that the randomness introduced by bounded rationality carries
much more information about follower payoffs, compared to the perfect best response. In the later
2
(more realistic) situation of only observing follower’s sampled pure strategy, our algorithm learns
the follower utility parameters within precision
with probability at least
δ
using
Θ(mlog(mn/δ)
ρ2)
carefully chosen queries, where
n
is the number of follower actions and
ρ
depends on agent’s
bounded rationality level and is of order
Θ(1/n)
for typical boundedly rational agents. Interestingly,
the additional challenge of only observing sampled actions only deteriorates the sample complexity
by a factor of
log(mn)
.
1
These sample completexity results should be compared with that of
[
37
,
27
], which study similar learning questions but from perfectly rational follower responses. The
mlog(mn)
order in our sample complexity is in sharp contrast to their complexity with exponential
dependence in
m
or
n
in the worst case. Our experimental results empirically confirm the tightness
of our sample complexity analysis.
At the conceptual level, our work illustrates that noises due to bounded rational behaviors could be
leveraged as additional information sources to learn the follower utility. This intuition also drives the
design of our analytical tools to explain how efficient and effective learning of the follower’s utility
is possible in real situations, in complementing the previous negative results developed under the
idealized perfect rational behavior models [27, 37].
2 Related Work
Learning in Stackelberg Game.
The learning problem in sequential games has been studied in
several different setups. Marecki et al.
[30]
, Balcan et al.
[7]
consider the online learning problem
in the Stackelberg security game with adversarially chosen follower types. Bai et al.
[6]
consider a
bandit learning setting where one could query any entry of the followers’ utility under noise and use
the estimation of utility to approximate the optimal leader strategy; however, this learning process
assumes centralization, that is, the learner can control both leader’s and follower’s actions. More
similar to ours is the strategic learning setup in Stackelberg games studied by [
27
,
37
,
9
], where the
leader adaptively chooses her strategies based on the observation of the follower’s best response and
eventually recovers the follower’s utility up to some precision level.
Bounded Rationality.
McKelvey and Palfrey
[32]
introduced the quantal response equilibrium
(QRE) by adopting the logit choice model [
15
,
31
]. QRE serves as a strict generalization of Nash
equilibrium (NE) — when the agents become perfectly rational, QRE converges to the NE. The
modeling success of QR model attributes to the nice mathematical and statistical properties of the logit
function that can capture a variety of boundedly rational behaviors under different parameter
λ
. QRE
is widely adopted especially in Stackelberg (security) games [
42
,
35
,
39
,
16
,
20
,
12
] and zero-sum
games [
29
] and notably has been deployed in various real world application [
2
,
17
]. Moreover, the
model structure of QR has been also used in various other contexts, such as the softmax activation
in neural network [
18
], multinomial logistic regression [
8
] and the multiplicative weight update
algorithm for no-regret learning [3].
As an initial attempt to our general learning problem, we also adopt the QR model to capture our
agent’s bounded rational behavior, for its modeling success in practice and being the most common
choice of prior work [
42
,
35
,
16
,
20
,
12
,
29
,
2
,
17
]. We acknowledge that there exist other models
of bounded rational behaviors beyond the QR model. For example, Kahneman
[23]
introduced
the prospect theory to model the bounded rationality of agents in games under risk; Camerer et al.
[11]
proposed the cognitive hierarchy theory that classifies the agents according to their degree of
reasoning in forming expectations of others. We anticipate that the message of our paper — i.e., the
observation of suboptimal responses could provide additional information to learn the follower’s
preferences — would apply to many of these bounded rationality models.
Inverse Game Theory.
Vorobeychik et al.
[40]
considered the payoff function learning problem
using the strategy profiles and the corresponding utilities through regression. Kuleshov and Schrijvers
[26]
introduced the concept of inverse game theory, and the authors showed that the problem of
computing the agents’ utilities from a set of correlated equilibrium is NP-Hard, unless the game is
known to have special structures. More recently, the inverse game theory problem is studied under
the QR model and leads to a few positive results: Sinha et al.
[39]
considers the offline PAC-learning
setup where the follower responses can be predicted with small error for a fixed leader strategy
1
Note that the
log(δ)
2
term comes from concentration bound and is natural when observations (i.e., observed
follower actions) have randomness.
3
distribution; Haghtalab et al.
[20]
proves only three strategies are sufficient to recover linear follower
payoff functions in security games; Ling et al.
[29]
presents an end-to-end learning framework that
learns the zero-sum game payoff from its QRE. Following their success, our paper is the first work
that provides theoretical guarantee of payoff recovery in general Stackelberg game. Finally, inverse
problems have received significantly more attention in single-agent decision making problems; The
most notable problem is the inverse reinforcement learning pioneered by Ng et al.
[34]
, Abbeel and
Ng [1].
3 Problem Formulation
Game Setup
We consider the Stackelberg game between a single leader (she) and follower (he).
We let
URm×n
(resp.
VRm×n
) be the leader (resp. follower’s) utility matrix, where
m, n
are
the number of actions for the leader (resp. follower). We use
G(U, V )
to denote the game instance.
Each entry
ui,j
(resp.
vi,j
) of the utility matrix denotes the leader’s utility (resp. follower’s utility)
when leader plays action
i
and follower plays action
j
. Without loss of generality, let
ui,j , vi,j [0,1]
.
Let
VjRm
be the
j
th column of the matrix
V
. We denote the set of the leader’s (resp. follower’s)
action set by [m] := {1, . . . , m}(resp. [n] := {1, . . . , n}).
In this sequential game, the leader moves first by committing to a (possibly randomized) strategy,
x= (x1,··· , xm)m
, where the simplex
m={x:Pi[m]xi= 1 and 0xi1}
and
each
xi
represents the probability the leader playing action
i
. Similarly, let
n
denote the follower’s
strategy space. Under perfect rationality, given the leader’s committed strategy, the follower would in
turns chooses the best response action
j
that maximizes his utility, i.e.,
j= argmaxj[n]{x>Vj}.
In our problem, we use the QR model instead to capture the follower’s bounded rational behavior.
That is, the follower would respond to the leader’s committed strategy by choosing an strategy
y
that
maximizes his utility up to a Gibbs entropic regularizer, i.e.,
y= argmaxyn{λx>Vyyln y}.
This is shown to be equivalent to the setting where the follower is best responding according to the
payoff perturbed by noises from a Gumbel distribution [
22
]. And we know the close form solution of
follower’s optimal strategy for this convex optimization program is exactly the logit choice model on
the true payoff, i.e., for each j[n],y
j=exp(λx>Vj)
Pk[n]exp(λx>Vk)[33].
We refer to
λ
as the bounded rationality constant that is given in each specific problem, as several
existing work have already determined its empirical value in practice: the human behavior experiments
in [38, 41] compute λ= 7.6; the experiments [28, 36, 32] show λis in the range of 4to 16.2
Learning Problem
We consider the inverse game theory problem in sequential game with unknown
follower utility and seek to quantify how much the leader can learn about a bounded rational follower’s
utility. We frame this problem under an active and strategic learning setup, where the leader can
interactively choose a randomized strategy and observe follower’s strategic responses. Specifically, at
each round
t[T]
, the leader commits to a strategy
x(t)
. The follower observes the committed
x(t)
and responds based on the QR strategy
y(t)
. Below we will consider both feedback settings based on
whether the leader is able to observe the exact distribution y(t)or merely its samples.
We set our primary learning objective as to recover a full characterization of the follower’s utility; our
results below shall explain how it is unnecessary and almost unrealistic to expect an exact recovery of
the follower’s utility. And we show in Observation 1 and Theorem 1 that such utility characterization
can be used to compute the optimal leader strategy under both perfect rationality, known as the
strong Stackelberg equilibrium (SSE), and bounded rationality, known as the quantal Stackelberg
equilibrium (QSE). And besides developing the optimal (or robust) leader strategies, we believe the
recovered utilities are generally useful for our better understanding and reasoning of the followers’
motives. However, given the limited scope of the paper, we focus on the inverse game theory problems
and defer the problems regarding how to strategize using the knowledge of game (i.e., the typical
game-theoretical problems) to related and future work.
Such learning problem has been considered in [
20
] specifically for Stackelberg security games, where
the payoff is a strictly simplified single-dimensional linear utility function. Our paper overcomes
the curse of dimensionality and answers the open question in recovering payoffs in the general
Stackelberg game. On the other hand, Sinha et al.
[39]
showed a case of learning the nonparametric
2The λestimations are normalized to the utility scale in [0,1].
4
摘要:

InverseGameTheoryforStackelbergGames:theBlessingofBoundedRationalityJibangWuDepartmentofComputerScienceUniversityofChicagowujibang@uchicago.eduWeiranShenGaolingSchoolofArticialIntelligenceRenminUniversityofChinashenweiran@ruc.edu.cnFeiFangInstituteforSoftwareResearchCarnegieMellonUniversityfeif@cmu...

展开>> 收起<<
Inverse Game Theory for Stackelberg Games the Blessing of Bounded Rationality Jibang Wu.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:20 页 大小:3.23MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注