HSVI CAN SOLVE ZERO -SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES A P REPRINT

2025-05-06 0 0 1.31MB 42 页 10玖币
侵权投诉
HSVI CAN SOLVE
ZERO-SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES
A PREPRINT
Aurélien Delage
CITI
INSA Lyon
Villeurbanne
aurelien.delage@insa-lyon.fr
Olivier Buffet
INRIA - CNRS
Université de Lorraine
Villers-lès-Nancy
olivier.buffet@inria.fr
Jilles S. Dibangoye
CITI
INSA Lyon
Villeurbanne
jilles-steeve.dibangoye@inria.fr
Abdallah Saffidine
University of New South Wales
Sidney
abdallahs@cse.unsw.edu.au
October 27, 2022
ABSTRACT
State-of-the-art methods for solving 2-player zero-sum imperfect information games rely on linear
programming or regret minimization, though not on dynamic programming (DP) or heuristic
search (HS), while the latter are often at the core of state-of-the-art solvers for other sequential
decision-making problems. In partially observable or collaborative settings (e.g., POMDPs and Dec-
POMDPs), DP and HS require introducing an appropriate statistic that induces a fully observable
problem as well as bounding (convex) approximators of the optimal value function. This approach
has succeeded in some subclasses of 2-player zero-sum partially observable stochastic games (zs-
POSGs) as well, but how to apply it in the general case still remains an open question. We answer
it by (i) rigorously defining an equivalent game to work with, (ii) proving mathematical properties
of the optimal value function that allow deriving bounds that come with solution strategies, (iii)
proposing for the first time an HSVI-like solver that provably converges to an
-optimal solution in
finite time, and (iv) empirically analyzing it. This opens the door to a novel family of promising
approaches complementing those relying on linear programming or iterative methods.
1 Introduction
Solving imperfect information sequential games is a challenging field with many applications from playing Poker
[Kuhn, 1950] to security games [Basilico et al., 2016]. We focus on finite-horizon 2-player zero-sum partially
observable stochastic games (zs-POSGs), an important class of games that is equivalent to that of zero-sum extensive-
form games (zs-EFGs) [Oliehoek and Vlassis, 2006]
1
. From the viewpoint of (maximizing) player
1
, we aim
at finding a strategy with a worst-case expected return (i.e., whatever player
2
s strategy) within
of the Nash
equilibrium value (NEV).
A first approach to solving a zs-POSG is to turn it into a zs-EFG addressed as a sequence form linear program
(SFLP) [Koller et al., 1996, von Stengel, 1996, Bošanský et al., 2014], giving rise to an exact algorithm. A second
approach is to use an iterative game solver, i.e., either a counterfactual-regret-based method (CFR) [Zinkevich et al.,
2007, Brown and Sandholm, 2018], or a first-order method [Hoda et al., 2010, Kroer et al., 2020], both coming with
asymptotic convergence properties. CFR-based approaches now incorporate deep reinforcement learning and search,
some of them winning against top human players at heads-up no limit hold’em poker [Moravˇ
cík et al., 2017, Brown
and Sandholm, 2018, Brown et al., 2020]. A third approach, proposed by Wiggers [2015], is to use two parallel
searches in strategy space, one per player, so that the gap between both strategies’ security levels (i.e., the values of
their opponent’s best responses) bounds the distance to the NEV.
1Note: POSGs are equivalent to the large class of “well-behaved” EFGs as defined by Kovaˇ
rík et al. [2019].
arXiv:2210.14640v1 [cs.AI] 26 Oct 2022
HSVI can solve
zero-sum Partially Observable Stochastic Games A PREPRINT
In contrast, dynamic programming and heuristic search have not been applied to general zs-POSGs, while often at the
core of state-of-the-art solvers in other problem classes that involve Markovian dynamics, partial observability and
multiple agents (POMDP [Åström, 1965, Smith, 2007], Dec-POMDP [Szer et al., 2005, Dibangoye et al., 2016], or
subclasses of zs-POSGs with simplifying observability assumptions [Ghosh et al., 2004, Chatterjee and Doyen, 2014,
Basu and Stettner, 2015, Horák et al., 2017, Cole and Kocherlakota, 2001, Horák and Bošanský, 2019]). They all rely
on some statistic that induces a fully observable problem whose value function (
V
) exhibits continuity properties
that allow deriving bounding approximations. Wiggers et al. [2016b,a] progress in this direction for zs-POSGs by
demonstrating an important continuity property of the optimal value function, and proposing a reformulation as a
particular equivalent game. We work in a similar direction, 1. using a game with different observability hypotheses,
2. proving theoretical results they implicitly rely on, and 3. building on some of their results to derive an HSVI-like
algorithm solving the zs-POSG.
Section 2 presents some necessary background, including the concept of occupancy state [Dibangoye et al., 2016,
Wiggers et al., 2016a] (i.e., the probability distribution over the players’ past action-observation histories), and
properties that rely on it. Then, Section 3 describes theoretical contributions. First, we rigorously reformulate
the problem as a non-observable game, and demonstrate that the Nash equilibrium value can be expressed with a
recursive formula, which is a required tool for DP and HS (Sec. 3.1). Second, we exhibit novel continuity properties
of optimal value functions and derive bounding approximators, a second tool made necessary due to the continuous
state space of the new game, before showing that these approximators come with valid solution strategies for the
zs-POSG (Sec. 3.2). Third, we adapt Smith and Simmons’ [Smith and Simmons, 2005] HSVI’s algorithmic scheme
to
-optimally solve the problem in finitely many iterations (Sec. 3.3). Section 4 presents an empirical analysis of the
approach. Section 5 discusses similarities and differences of our work with CFR-based continual resolving methods
before concluding.
2 Background
Here, we first give basic definitions about zs-POSGs, including the solution concept at hand. Then we introduce an
equivalent game where a state corresponds to a statistic summarizing past behaviors, which leads to some important
properties of the game’s optimal value.
2.1 zs-POSGs
Definition 2.1
(zs-POSGs)
.
As illustrated through a dynamic influence diagram in Figure 1, a (2-player) zero-sum
partially observable stochastic game (zs-POSG) is defined by a tuple hS,A1,A2,Z1,Z2, P, r, H, γ, b0i, where
Sis a finite set of states;
Aiis (player) is finite set of actions;
Ziis is finite set of observations;
Pz1,z2
a1,a2(s0|s)
is the probability to transition to state
s0
and receive observations
z1
and
z2
when actions
a1
and a2are performed while in state s;
r(s, a1, a2)is a (scalar) reward function;
HNis a (finite) temporal horizon;
γ[0,1] is a discount factor; and
b0is the initial belief state, i.e., a probability distribution over states at t= 0.
From the Dec-POMDP, POSG and EFG literature, we use the following concepts and definitions:
θi
τ= (ai
0, zi
1, . . . , ai
τ1, zi
τ)
is a length-
τ
action-observation history (AOH) for
i
. We note
Θi
τ
the set of all AOHs
for player iat horizon τsuch that any AOH θi
τis in H1
t=0 Θi
t.
βi
τ
is a (behavioral) decision rule (DR) at
τ
for
i
,i.e., a mapping from private AOHs in
Θi
τ
to distributions over
private actions. βi
τ(θi
τ, ai)is the probability to pick aiwhen facing θi
τ.
βi
τ:τ0= (βi
τ, . . . , βi
τ0)is a behavioral strategy for ifrom time step τto τ0(included).
When considering both players, the last 3 concepts become:
θτ= (θ1
τ, θ2
τ)(Θ=H1
t=0 Θt), a joint AOH at τ,
βτ=hβ1
τ, β2
τi(∈ B =H1
t=0 Bt), a decision rule profile, and
βτ:τ0=hβ1
τ:τ0, β2
τ:τ0i, a behavioral strategy profile.
2
HSVI can solve
zero-sum Partially Observable Stochastic Games A PREPRINT
St
start St+1 St+2
Z1
tZ1
t+1 Z1
t+2
Z2
tZ2
t+1 Z2
t+2
A1
tA1
t+1
A2
tA2
t+1
Time tt+ 1 t+ 2
r(s, a1, a2)
Pz1,z2
a1,a2(s0|s)
r(s, a1, a2)
Pz1,z2
a1,a2(s0|s)Hidden
1s viewpoint
2s viewpoint
Figure 1: Dynamic influence diagram representing the evolution of a zs-POSG
Nash Equilibria
Here, player
1
(respectively
2
) wants to maximize (resp. minimize) the expected return, or value, of strategy profile
β0:H1, defined as the discounted sum of future rewards, i.e.,
V0(β0:H1) = E"H1
X
t=0
γtRt|β0:H1#,
where
Rt
is the random variable associated to the instant reward at
t
. This leads to the solution concept of Nash
equilibrium strategy (NES).
Definition 2.2
(Nash Equilibrium)
.
The strategy profile
β
0:H1=hβ1
0:H1, β2
0:H1i
is a NES if no player has an
incentive to deviate, which can be written:
β1
0:H1, V0(β1
0:H1, β2
0:H1)V0(β1
0:H1, β2
0:H1)and
β2
0:H1, V0(β1
0:H1, β2
0:H1)V0(β1
0:H1, β2
0:H1).
In such a game, all NESs have the same Nash-equilibrium value (NEV),
V
0
def
=V0(β1
0:H1, β2
0:H1)
. Our specific
objective is to find an
-NES, i.e., a behavioral strategy profile such that any player would gain at most
by deviating.
Why writing a Bellman Optimality Equation is Hard
Our approach requires writing Bellman optimality equations. The main obstacle to achieve this is to find an
appropriate characterization of a subproblem that allows
1. predicting both the immediate reward and the next possible subproblems given an immediate decision;
2. connecting a subproblem’s solution with solutions of its own (lower-level) subproblems; and
3
HSVI can solve
zero-sum Partially Observable Stochastic Games A PREPRINT
si
sh
+1
, ah
1
, at
ah,
st
1
, ah
+1
, at
at,
nodes indistinguishable to P2
Figure 2: Simplified tree representation of the sequentialized Matching Pennies game. Irrelevant actions, noted
,
allow merging edges with the same action for (i) player
2
at
t= 0
, and (ii) player
1
at
t= 1
. Notes: (a) Due to
irrelevant actions, this game can be seen as an Extensive Form Game, despite players acting simultaneously. (b)
Players only know about their past action history (in this observation-free game).
3. prescripting a solution strategy for the subproblem built on solutions of lower-level subproblems.
In our setting, a player’s AOH does not characterize a subproblem since her opponent’s strategy is also required to
predict the expected reward and the next AOHs. For their part, joint AOHs allow predicting next joint AOHs given
both player’s immediate decision rules, but would not be appropriate either, since player
i
cannot decide how to act
when facing some individual AOH θi
τwithout considering all possible AOHs of his opponent ¬i.
Partial behavioral strategy profiles (sequences of behavioral decision rule profiles from
t= 0
to some
τ
) contain
enough information to completely describe the situation at
τ
, and are thus necessarily predictive. We still need to
demonstrate that they are connected, despite decision rules not being public, and prescriptive, despite the need to
address global-consistency issues illustrated in the following example.
Example 1.Matching pennies is a well-known 2-player zero-sum game in which each player has a penny and secretly
chooses one side (head or tail). Then, both penny’s sides are revealed, and player
1
wins (payoff of
+1
) if both
chosen sides match and looses (payoff of 1) if not.
We here formalize this game as a zs-POSG (as illustrated in Figure 2) where player
1
actually picks his action at
t= 0, and player 2at t= 1. Hence the tuple hS,A1,A2,Z1,Z2, P, r, H, γ, b0iwhere:
S={si, sh, st}
, where
si
is the initial state, and
sh
and
st
represent a memory of
1
s last move: respectively
"head" or "tail";
A1=A2={ah, at}for playing "head" (ah) or "tail" (at);
Z1=Z2={zn}a "none" trivial observation;
Pz
a(s0|s) = T(s, a, s0)· O(a, s0,z), using the next two definitions;
Tis deterministic and such that (·is used to denote "for all")
T(·,·, ah) = sh,
T(·,·, at) = st;
Ois deterministic and always returns "zn";
ris such that
r(si,·,·)=0,
r(st,·, at) = r(sh,·, ah) = +1,
r(st,·, ah) = r(sh,·, at) = 1;
H= 2;
γ= 1;
b0is such that the initial state is siwith probability 1.
Let us then assume that both players’ DRs at
t= 0
are fixed, with
β1
0
randomly picking
at
or
ah
(i.e., it induces a
NES whatever his DR at
t= 1
). Then, we face a "subgame" at
t= 1
where any strategy profile
hβ1
1:1, β2
1:1i
is a NES
4
HSVI can solve
zero-sum Partially Observable Stochastic Games A PREPRINT
profile with Nash equilibrium value
0
. In particular,
2
can pick a deterministic strategy
β2
1:1
, which will be said to be
locally consistent. Yet, for
2
, such a NES in the subgame at
τ= 1
is not necessarily globally consistent,i.e., it may
not be part of a NES for the original game (i.e., starting from
τ= 0
). Intuitively, in such global-consistency issues
?
Schmid [2021] (also called safety issues Burch et al. [2014]), the choices made at latter time steps do not account
for possible deviations from the opponent at earlier time steps.
As detailed in the next section, we will characterize a subproblem not with the raw data of partial strategy profiles,
but with a sufficient statistic, and this characterization will be used as the state of a new dynamic game equivalent to
the zs-POSG.
2.2 Occupancy State and Occupancy Markov Game
We now introduce an equivalent game, in which trajectories correspond to behavioral strategy profiles, and which we
will be able to decompose temporally (and recursively), a first key tool for DP and HS.
To cope with the necessarily continuous nature of its state space, we will set this game in occupancy space, i.e., a
statistic that sums up past DR profiles. This will let us derive continuity properties on which to build point-based
approximators.
As Wiggers et al. [2016a], let us formally define an occupancy state (OS)
σβ0:τ1
as the probability distribution
over joint AOHs
θτ
given partial strategy profile
β0:τ1
. This statistic exhibits the usual Markov and sufficiency
properties:
Proposition 2.3
(Adapted from Dibangoye et al. [Dibangoye et al., 2016, Thm. 1] – Proof in App. B.1)
.σβ0:τ1
,
together with
βτ
, is a sufficient statistic to compute (i) the next OS,
T(σβ0:τ1,βτ)def
=σβ0:τ
, and (ii) the expected
reward at τ:r(σβ0:τ1,βτ)def
=ERτ|β0:τ1βτ, where denotes a concatenation.
Writing from now on
στ
, as short for
σβ0:τ1
, the OS associated with some prefix strategy profile
β0:τ1
, the proof
essentially relies on deriving the following formulas: θ1
τ, a1, z1, θ2
τ, a2, z2,
T(στ,βτ)((θ1
τ, a1, z1),(θ2
τ, a2, z2)) (1)
def
=P r((θ1
τ, a1, z1),(θ2
τ, a2, z2)|στ,βτ)
=β1
τ(θ1
τ, a1)β2
τ(θ2
τ, a2)στ(θτ)X
s,s0
Pz
a(s0|s)b(s|θτ),
where b(s|θτ)is a belief state obtained by Hidden Markov Model filtering; and
r(στ,βτ)def
=E[r(S, A1, A2)|στ,βτ](2)
=X
s,θτ,a
στ(θτ)b(s|θτ)β1
τ(a1|θ1
τ)β2
τ(a2|θ2
τ)r(s, a).
We can then derive, from a zs-POSG, a non-observable zero-sum game similar to Wiggers et al.’s plan-time NOSG
[Wiggers et al., 2016a, Definition 4], but without assuming that the players’ past strategies are public.
Definition 2.4
(zero-sum occupancy Markov Game (zs-oMG))
.
Azero-sum occupancy Markov game (zs-oMG)
2
is
defined by the tuple hOσ,B, T, r, H, γi, where:
Oσ(= H1
t=0 Oσ
t)is the set of OSs induced by the zs-POSG;
Bis the set of DR profiles of the zs-POSG;
Tis the deterministic transition function in Eq. (1);
ris the reward function in Eq. (2); and
Hand γare as in the zs-POSG
(b0is not in the tuple but serves to define Tand r).
In this game, as in the zs-POSG, a player’s solution is a behavioral strategy. Besides, the value of a strategy profile
β0:H1
is the same for both zs-oMG and zs-POSG, so that they share the same
-NEV and
-NESs. We can thus
work with zs-oMGs as a means to solve zs-POSGs.
The following aims at deriving a recursive expression of V
0, as well as continuity properties.
2
We use (i) “Markov game” instead of “stochastic game” because the dynamics are not stochastic, and (ii) “partially observable
stochastic game” to stick with the literature.
5
摘要:

HSVICANSOLVEZERO-SUMPARTIALLYOBSERVABLESTOCHASTICGAMESAPREPRINTAurélienDelageCITIINSALyonVilleurbanneaurelien.delage@insa-lyon.frOlivierBuffetINRIA-CNRSUniversitédeLorraineVillers-lès-Nancyolivier.buffet@inria.frJillesS.DibangoyeCITIINSALyonVilleurbannejilles-steeve.dibangoye@inria.frAbdallahSafdin...

展开>> 收起<<
HSVI CAN SOLVE ZERO -SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES A P REPRINT.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:1.31MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注