
distribution; Haghtalab et al.
[20]
proves only three strategies are sufficient to recover linear follower
payoff functions in security games; Ling et al.
[29]
presents an end-to-end learning framework that
learns the zero-sum game payoff from its QRE. Following their success, our paper is the first work
that provides theoretical guarantee of payoff recovery in general Stackelberg game. Finally, inverse
problems have received significantly more attention in single-agent decision making problems; The
most notable problem is the inverse reinforcement learning pioneered by Ng et al.
[34]
, Abbeel and
Ng [1].
3 Problem Formulation
Game Setup
We consider the Stackelberg game between a single leader (she) and follower (he).
We let
U∈Rm×n
(resp.
V∈Rm×n
) be the leader (resp. follower’s) utility matrix, where
m, n
are
the number of actions for the leader (resp. follower). We use
G(U, V )
to denote the game instance.
Each entry
ui,j
(resp.
vi,j
) of the utility matrix denotes the leader’s utility (resp. follower’s utility)
when leader plays action
i
and follower plays action
j
. Without loss of generality, let
ui,j , vi,j ∈[0,1]
.
Let
Vj∈Rm
be the
j
th column of the matrix
V
. We denote the set of the leader’s (resp. follower’s)
action set by [m] := {1, . . . , m}(resp. [n] := {1, . . . , n}).
In this sequential game, the leader moves first by committing to a (possibly randomized) strategy,
x= (x1,··· , xm)∈∆m
, where the simplex
∆m={x:Pi∈[m]xi= 1 and 0≤xi≤1}
and
each
xi
represents the probability the leader playing action
i
. Similarly, let
∆n
denote the follower’s
strategy space. Under perfect rationality, given the leader’s committed strategy, the follower would in
turns chooses the best response action
j∗
that maximizes his utility, i.e.,
j∗= argmaxj∈[n]{x>Vj}.
In our problem, we use the QR model instead to capture the follower’s bounded rational behavior.
That is, the follower would respond to the leader’s committed strategy by choosing an strategy
y∗
that
maximizes his utility up to a Gibbs entropic regularizer, i.e.,
y∗= argmaxy∈∆n{λx>Vy−yln y}.
This is shown to be equivalent to the setting where the follower is best responding according to the
payoff perturbed by noises from a Gumbel distribution [
22
]. And we know the close form solution of
follower’s optimal strategy for this convex optimization program is exactly the logit choice model on
the true payoff, i.e., for each j∈[n],y∗
j=exp(λx>Vj)
Pk∈[n]exp(λx>Vk)[33].
We refer to
λ
as the bounded rationality constant that is given in each specific problem, as several
existing work have already determined its empirical value in practice: the human behavior experiments
in [38, 41] compute λ= 7.6; the experiments [28, 36, 32] show λis in the range of 4to 16.2
Learning Problem
We consider the inverse game theory problem in sequential game with unknown
follower utility and seek to quantify how much the leader can learn about a bounded rational follower’s
utility. We frame this problem under an active and strategic learning setup, where the leader can
interactively choose a randomized strategy and observe follower’s strategic responses. Specifically, at
each round
t∈[T]
, the leader commits to a strategy
x(t)
. The follower observes the committed
x(t)
and responds based on the QR strategy
y(t)
. Below we will consider both feedback settings based on
whether the leader is able to observe the exact distribution y(t)or merely its samples.
We set our primary learning objective as to recover a full characterization of the follower’s utility; our
results below shall explain how it is unnecessary and almost unrealistic to expect an exact recovery of
the follower’s utility. And we show in Observation 1 and Theorem 1 that such utility characterization
can be used to compute the optimal leader strategy under both perfect rationality, known as the
strong Stackelberg equilibrium (SSE), and bounded rationality, known as the quantal Stackelberg
equilibrium (QSE). And besides developing the optimal (or robust) leader strategies, we believe the
recovered utilities are generally useful for our better understanding and reasoning of the followers’
motives. However, given the limited scope of the paper, we focus on the inverse game theory problems
and defer the problems regarding how to strategize using the knowledge of game (i.e., the typical
game-theoretical problems) to related and future work.
Such learning problem has been considered in [
20
] specifically for Stackelberg security games, where
the payoff is a strictly simplified single-dimensional linear utility function. Our paper overcomes
the curse of dimensionality and answers the open question in recovering payoffs in the general
Stackelberg game. On the other hand, Sinha et al.
[39]
showed a case of learning the nonparametric
2The λestimations are normalized to the utility scale in [0,1].
4