Nash, correlated, and coarse correlated equilibria, each giving rise to a corresponding Stackelberg-
Nash, Stackelberg-CE, and Stackelberg-CCE concept. This motivates an oracle abstraction. For any
choice of leader strategy, we denote as
E
(
FπL
) a follower equilibrium (or a probability distribution
over equilibria), where we refer to
E
as an oracle. This oracle will later be realized as an algorithm.
See also Wang et al. (2022).
Definition 2 (Stackelberg equilibrium).Given a Markov Game
M
and a follower best-response
oracle
E
, a leader strategy
πL
together with a tuple of follower strategies
πF
is a Stackelberg
equilibrium, if and only if
πL
maximizes the leader’s expected reward under the condition that
follower strategies are drawn from E(FπL):
πL∈arg max
πL
E
πF∼E(FπL)hX
t
E[rLst, aL,t, aF,t]i,
where the second expectation is drawing actions and state transitions from their respective policies
πL, πF
and transition function
T
, and the reward function is
r
, all as in Definition 1. If the
follower oracle
E
gives a Nash equilibrium in the induced game
FπL
, we call this a Stackelberg-Nash
equilibrium, and similarly for CE and CCE.
In the remainder of the paper, when we say “oracle” we mean an algorithm that computes or
learns the follower best-response equilibrium,
E
(
FπL
), given the leader strategy
πL
. In full generality,
an oracle algorithm could take many forms, including RL, optimization, or any other algorithm that
takes as input
πL
and outputs
E
(
FπL
). These algorithms may vary in regard to how they access
information about
πL
. Our main positive result (Theorem 1) will rely on oracle algorithms that
only require query access (also called sample access) to the leader policy; i.e., algorithms that only
interact with the leader policy by receiving as input the leader’s actions in each of a set of leader
observation states, these states queried by the algorithm in some order.
Definition 3 (Query Oracle).An algorithm implementing an oracle
E
is called a query oracle if its
interactions with the leader policy
πL
are exclusively through queries, where a query is an input to
the leader policy; i.e., an observation
o
together with the response from the leader policy, which is
the action πL(o)the leader would take given o.
A query oracle can also receive additional information, for example an interaction with the
environment that is associated with the Markov Game (or a description of the environment). The
definition of a query oracle is only in regard to how the oracle algorithm interacts with the leader
policy. In particular, any oracle implemented using RL or another typical learning approach is a
query oracle, as learning algorithms generally only require query access to their environment and
the actions taken by other agents in different states. In particular, a suitably formulated RL leader
together with a RL-based followers (i.e., an “outer loop - inner loop” approach) can be interpreted
within our framework as an oracle querying the leader policy.
An algorithm that operates directly on a description of the leader policy,
πL
, such as a parametriza-
tion
θ
of
πL
, is not a query oracle. For instance, if
M
has a small number of discrete states and
actions,
πL
could be directly parametrized, with
θij
denoting the probability of taking action
j
in
state
i
. An optimization approach could then compute a best response directly from knowledge of
θ
,
but these parameters are not available through query access. Similarly,
θ
could be the weights of a
neural network.
5