Inverse Game Theory for Stackelberg Games the Blessing of Bounded Rationality Jibang Wu

2025-05-05 0 0 3.23MB 20 页 10玖币

侵权投诉

Inverse Game Theory for Stackelberg Games:

the Blessing of Bounded Rationality

Jibang Wu

Department of Computer Science

University of Chicago

wujibang@uchicago.edu

Weiran Shen

Gaoling School of Artiﬁcial Intelligence

Renmin University of China

shenweiran@ruc.edu.cn

Fei Fang

Institute for Software Research

Carnegie Mellon University

feif@cmu.edu

Haifeng Xu

Department of Computer Science

University of Chicago

haifengxu@uchicago.edu

Abstract

Optimizing strategic decisions (a.k.a. computing equilibrium) is key to the success

of many non-cooperative multi-agent applications. However, in many real-world

situations, we may face the exact opposite of this game-theoretic problem — instead

of prescribing equilibrium of a given game, we may directly observe the agents’

equilibrium behaviors but want to infer the underlying parameters of an unknown

game. This research question, also known as inverse game theory, has been studied

in multiple recent works in the context of Stackelberg games. Unfortunately,

existing works exhibit quite negative results, showing statistical hardness [

]

and computational hardness [

], assuming follower’s perfectly rational

behaviors. Our work relaxes the perfect rationality agent assumption to the classic

quantal response model, a more realistic behavior model of bounded rationality.

Interestingly, we show that the smooth property brought by such bounded rationality

model actually leads to provably more efﬁcient learning of the follower utility

parameters in general Stackelberg games. Systematic empirical experiments on

synthesized games conﬁrm our theoretical results and further suggest its robustness

beyond the strict quantal response model.

1 Introduction

One primary objective of game theory is to predict the behaviors of agents through equilibrium

concepts in a given game. In practice, however, we may observe some equilibrium behaviors of

agents, but the game itself turns out to be unknown. For example, an online shopping platform

can observe the shoppers’ purchase decisions on different sale prices, but the platform has limited

knowledge of the exact utilities of the shoppers. Similarly, while the policymaker could observe

the market reactions to its policy announcement, the exact motives behind traders’ reactions are

usually unclear. In various security domains, the defender may want to understand the intentions or

incentives of the attackers from their responses to different defense strategies so as to improve her

future defense strategy. As such, recovering the underlying game parameters would not only lead

us to better strategic decisions, but also improve our explications of the motives and rationale in the

dark.

These potentials and prospects motivate a class of research problems known as the inverse game

theory [

]: given the agents’ equilibrium behaviors, what are possible utilities that induce these

behaviors? In this paper, we speciﬁcally target the sequential game setting from the perspective of the

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01380v1 [cs.GT] 4 Oct 2022

ﬁrst-moving agent (e.g., Internet platform, policymaker, or defender) whose different strategies (e.g.,

price, regulation, or defense scheme) would induce different equilibrium behaviors of the following

agent (e.g., Internet users, traders, or attacker). Studies of such game settings have seen broad impacts

and extensive applications ranging from the principal-agent problems in contract design [

the AI Economist [

] to security games modeled for social good [

]. We formalize our problem

under the normal form Stackelberg game, where a leader has the commitment power of a randomized

strategy, and a follower accordingly decides his response. It is known that the optimal commitment of

the leader can be efﬁciently computed in a single linear program, given full knowledge of the game

[

]. However, the inverse learning problem to determine the underlying game from the follower’s

responses is more challenging: Letchford et al.

[27]

, Peng et al.

[37]

show that learning optimal

leader strategy from the follower’s best responses requires number of samples that is a high-degree

polynomial in the game size and may be exponential in the worst cases. This signiﬁcantly limits the

practicality of these algorithms, as the leader usually cannot afford the time or cost to gather feedback

from so many interactions.

More concerning is the inconvenient reality that we can hardly expect the agents’ optimal equilibrium

responses assumed in existing work. In fact, these shoppers, traders, or attackers themselves hardly

know their exact utilities and are naturally unable to determine the expected-utility maximizing strat-

egy. Extensive studies of behavioral economics and psychology [

] have pinpointed

the cognitive limitations that make human decisions prone to the noisy perception of their utilities.

Among various models for quantifying irrational agent behaviors, one of the most popular ones is

perhaps the quantal response (QR) model [

], which adopts the well-known logit choice model to

capture agents’ probabilistic selection of actions. This will also be the bounded rationality model of

our focus in this paper.

The Blessing of Bounded Rationality.

The key insight revealed from this paper is that the extra

layer of behavioral complexity due to bounded rationality, while complicating the modeling and

computation, provides a more informative source for us to learn the underlying utility of agents. To

understand the intuitions and motivations behind our results, consider a case where the follower has a

dominated action

as shown in Table 1, where the leader’s and follower’s utility of action proﬁle

(i, j)

is speciﬁed by

ui,j , vi,j

respectively. Conventionally, such an instance is treated as a degenerated

instance, because the leader could ignore the action

that a perfectly rational follower would never

play. Then, the optimal leader strategy is clearly to always play the action

. However, when facing a

ui,j , vi,j j1j2

i1100,0.9 0.9,1

i2−99,0.9 1.1,1

Table 1: An example of dangerously “degenerated” Stackelberg game.

boundedly rational follower, it becomes possible to observe the response

and estimate the utilities

regarding this dominated action. For example, if the follower plays his action

and

at almost the

same frequency, the follower’s expected utility on the two actions should be close. Although such

dominated action has no effect on the leader’s optimal strategy against a perfectly rational follower,

it could be a potentially damaging (or beneﬁcial) action that leader want to avoid (or encourage)

a bounded rational follower to play. That is, in the above game instance, if a somewhat irrational

follower plays action

, it would be dangerous for the leader to play action

yet rewarding to play

action

; therefore, a more robust leader strategy should randomize by assigning some probability to

play action

. We remark that in general, even without such extreme case of dominated actions, the

extra payoff information is now available on how much worse (or better) it is to use the empirical

frequency of the boundedly rational action responses (as long as some smoothness properties are

exhibited), which are overlooked under the assumption of perfectly rational followers.

Our Results. We present a set of tight analysis on the number of strategies and sample complexity

sufﬁcient and necessary to learn the follower’s utility, for both situations in which the leader can

observe the follower’s full mixed strategies or only the follower’s sampled pure strategies. In the

former situation of observing follower’s mixed strategies, our algorithm can recover the follower

utility parameters using

follower mixed strategy responses in any general Stackelberg game where

is the number of leader actions. Surprisingly, the required number of queries is independent of

follower actions! This is due to the fact that the randomness introduced by bounded rationality carries

much more information about follower payoffs, compared to the perfect best response. In the later

(more realistic) situation of only observing follower’s sampled pure strategy, our algorithm learns

the follower utility parameters within precision



with probability at least

using

Θ(mlog(mn/δ)

ρ2)

carefully chosen queries, where

is the number of follower actions and

depends on agent’s

bounded rationality level and is of order

Θ(1/n)

for typical boundedly rational agents. Interestingly,

the additional challenge of only observing sampled actions only deteriorates the sample complexity

by a factor of

log(mn)/ρ

These sample completexity results should be compared with that of

[

], which study similar learning questions but from perfectly rational follower responses. The

mlog(mn)/ρ

order in our sample complexity is in sharp contrast to their complexity with exponential

dependence in

in the worst case. Our experimental results empirically conﬁrm the tightness

of our sample complexity analysis.

At the conceptual level, our work illustrates that noises due to bounded rational behaviors could be

leveraged as additional information sources to learn the follower utility. This intuition also drives the

design of our analytical tools to explain how efﬁcient and effective learning of the follower’s utility

is possible in real situations, in complementing the previous negative results developed under the

idealized perfect rational behavior models [27, 37].

2 Related Work

Learning in Stackelberg Game.

The learning problem in sequential games has been studied in

several different setups. Marecki et al.

[30]

, Balcan et al.

[7]

consider the online learning problem

in the Stackelberg security game with adversarially chosen follower types. Bai et al.

[6]

consider a

bandit learning setting where one could query any entry of the followers’ utility under noise and use

the estimation of utility to approximate the optimal leader strategy; however, this learning process

assumes centralization, that is, the learner can control both leader’s and follower’s actions. More

similar to ours is the strategic learning setup in Stackelberg games studied by [

], where the

leader adaptively chooses her strategies based on the observation of the follower’s best response and

eventually recovers the follower’s utility up to some precision level.

Bounded Rationality.

McKelvey and Palfrey

[32]

introduced the quantal response equilibrium

(QRE) by adopting the logit choice model [

]. QRE serves as a strict generalization of Nash

equilibrium (NE) — when the agents become perfectly rational, QRE converges to the NE. The

modeling success of QR model attributes to the nice mathematical and statistical properties of the logit

function that can capture a variety of boundedly rational behaviors under different parameter

. QRE

is widely adopted especially in Stackelberg (security) games [

] and zero-sum

games [

] and notably has been deployed in various real world application [

]. Moreover, the

model structure of QR has been also used in various other contexts, such as the softmax activation

in neural network [

], multinomial logistic regression [

] and the multiplicative weight update

algorithm for no-regret learning [3].

As an initial attempt to our general learning problem, we also adopt the QR model to capture our

agent’s bounded rational behavior, for its modeling success in practice and being the most common

choice of prior work [

]. We acknowledge that there exist other models

of bounded rational behaviors beyond the QR model. For example, Kahneman

[23]

introduced

the prospect theory to model the bounded rationality of agents in games under risk; Camerer et al.

[11]

proposed the cognitive hierarchy theory that classiﬁes the agents according to their degree of

reasoning in forming expectations of others. We anticipate that the message of our paper — i.e., the

observation of suboptimal responses could provide additional information to learn the follower’s

preferences — would apply to many of these bounded rationality models.

Inverse Game Theory.

Vorobeychik et al.

[40]

considered the payoff function learning problem

using the strategy proﬁles and the corresponding utilities through regression. Kuleshov and Schrijvers

[26]

introduced the concept of inverse game theory, and the authors showed that the problem of

computing the agents’ utilities from a set of correlated equilibrium is NP-Hard, unless the game is

known to have special structures. More recently, the inverse game theory problem is studied under

the QR model and leads to a few positive results: Sinha et al.

[39]

considers the ofﬂine PAC-learning

setup where the follower responses can be predicted with small error for a ﬁxed leader strategy

Note that the

log(δ)

2

term comes from concentration bound and is natural when observations (i.e., observed

follower actions) have randomness.

distribution; Haghtalab et al.

[20]

proves only three strategies are sufﬁcient to recover linear follower

payoff functions in security games; Ling et al.

[29]

presents an end-to-end learning framework that

learns the zero-sum game payoff from its QRE. Following their success, our paper is the ﬁrst work

that provides theoretical guarantee of payoff recovery in general Stackelberg game. Finally, inverse

problems have received signiﬁcantly more attention in single-agent decision making problems; The

most notable problem is the inverse reinforcement learning pioneered by Ng et al.

[34]

, Abbeel and

Ng [1].

3 Problem Formulation

Game Setup

We consider the Stackelberg game between a single leader (she) and follower (he).

We let

U∈Rm×n

(resp.

V∈Rm×n

) be the leader (resp. follower’s) utility matrix, where

m, n

are

the number of actions for the leader (resp. follower). We use

G(U, V )

to denote the game instance.

Each entry

ui,j

(resp.

vi,j

) of the utility matrix denotes the leader’s utility (resp. follower’s utility)

when leader plays action

and follower plays action

. Without loss of generality, let

ui,j , vi,j ∈[0,1]

Let

Vj∈Rm

be the

th column of the matrix

. We denote the set of the leader’s (resp. follower’s)

action set by [m] := {1, . . . , m}(resp. [n] := {1, . . . , n}).

In this sequential game, the leader moves ﬁrst by committing to a (possibly randomized) strategy,

x= (x1,··· , xm)∈∆m

, where the simplex

∆m={x:Pi∈[m]xi= 1 and 0≤xi≤1}

and

each

represents the probability the leader playing action

. Similarly, let

∆n

denote the follower’s

strategy space. Under perfect rationality, given the leader’s committed strategy, the follower would in

turns chooses the best response action

j∗

that maximizes his utility, i.e.,

j∗= argmaxj∈[n]{x>Vj}.

In our problem, we use the QR model instead to capture the follower’s bounded rational behavior.

That is, the follower would respond to the leader’s committed strategy by choosing an strategy

y∗

that

maximizes his utility up to a Gibbs entropic regularizer, i.e.,

y∗= argmaxy∈∆n{λx>Vy−yln y}.

This is shown to be equivalent to the setting where the follower is best responding according to the

payoff perturbed by noises from a Gumbel distribution [

]. And we know the close form solution of

follower’s optimal strategy for this convex optimization program is exactly the logit choice model on

the true payoff, i.e., for each j∈[n],y∗

j=exp(λx>Vj)

Pk∈[n]exp(λx>Vk)[33].

We refer to

as the bounded rationality constant that is given in each speciﬁc problem, as several

existing work have already determined its empirical value in practice: the human behavior experiments

in [38, 41] compute λ= 7.6; the experiments [28, 36, 32] show λis in the range of 4to 16.2

Learning Problem

We consider the inverse game theory problem in sequential game with unknown

follower utility and seek to quantify how much the leader can learn about a bounded rational follower’s

utility. We frame this problem under an active and strategic learning setup, where the leader can

interactively choose a randomized strategy and observe follower’s strategic responses. Speciﬁcally, at

each round

t∈[T]

, the leader commits to a strategy

x(t)

. The follower observes the committed

x(t)

and responds based on the QR strategy

y(t)

. Below we will consider both feedback settings based on

whether the leader is able to observe the exact distribution y(t)or merely its samples.

We set our primary learning objective as to recover a full characterization of the follower’s utility; our

results below shall explain how it is unnecessary and almost unrealistic to expect an exact recovery of

the follower’s utility. And we show in Observation 1 and Theorem 1 that such utility characterization

can be used to compute the optimal leader strategy under both perfect rationality, known as the

strong Stackelberg equilibrium (SSE), and bounded rationality, known as the quantal Stackelberg

equilibrium (QSE). And besides developing the optimal (or robust) leader strategies, we believe the

recovered utilities are generally useful for our better understanding and reasoning of the followers’

motives. However, given the limited scope of the paper, we focus on the inverse game theory problems

and defer the problems regarding how to strategize using the knowledge of game (i.e., the typical

game-theoretical problems) to related and future work.

Such learning problem has been considered in [

] speciﬁcally for Stackelberg security games, where

the payoff is a strictly simpliﬁed single-dimensional linear utility function. Our paper overcomes

the curse of dimensionality and answers the open question in recovering payoffs in the general

Stackelberg game. On the other hand, Sinha et al.

[39]

showed a case of learning the nonparametric

2The λestimations are normalized to the utility scale in [0,1].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InverseGameTheoryforStackelbergGames:theBlessingofBoundedRationalityJibangWuDepartmentofComputerScienceUniversityofChicagowujibang@uchicago.eduWeiranShenGaolingSchoolofArticialIntelligenceRenminUniversityofChinashenweiran@ruc.edu.cnFeiFangInstituteforSoftwareResearchCarnegieMellonUniversityfeif@cmu...

展开>> 收起<<

Inverse Game Theory for Stackelberg Games the Blessing of Bounded Rationality Jibang Wu.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Inverse Game Theory for Stackelberg Games the Blessing of Bounded Rationality Jibang Wu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: