HSVI CAN SOLVE ZERO -SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES A P REPRINT

2025-05-06 0 0 1.31MB 42 页 10玖币

侵权投诉

HSVI CAN SOLVE

ZERO-SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES

A PREPRINT

Aurélien Delage

CITI

INSA Lyon

Villeurbanne

aurelien.delage@insa-lyon.fr

Olivier Buffet

INRIA - CNRS

Université de Lorraine

Villers-lès-Nancy

olivier.buffet@inria.fr

Jilles S. Dibangoye

CITI

INSA Lyon

Villeurbanne

jilles-steeve.dibangoye@inria.fr

Abdallah Safﬁdine

University of New South Wales

Sidney

abdallahs@cse.unsw.edu.au

October 27, 2022

ABSTRACT

State-of-the-art methods for solving 2-player zero-sum imperfect information games rely on linear

programming or regret minimization, though not on dynamic programming (DP) or heuristic

search (HS), while the latter are often at the core of state-of-the-art solvers for other sequential

decision-making problems. In partially observable or collaborative settings (e.g., POMDPs and Dec-

POMDPs), DP and HS require introducing an appropriate statistic that induces a fully observable

problem as well as bounding (convex) approximators of the optimal value function. This approach

has succeeded in some subclasses of 2-player zero-sum partially observable stochastic games (zs-

POSGs) as well, but how to apply it in the general case still remains an open question. We answer

it by (i) rigorously deﬁning an equivalent game to work with, (ii) proving mathematical properties

of the optimal value function that allow deriving bounds that come with solution strategies, (iii)

proposing for the ﬁrst time an HSVI-like solver that provably converges to an



-optimal solution in

ﬁnite time, and (iv) empirically analyzing it. This opens the door to a novel family of promising

approaches complementing those relying on linear programming or iterative methods.

1 Introduction

Solving imperfect information sequential games is a challenging ﬁeld with many applications from playing Poker

[Kuhn, 1950] to security games [Basilico et al., 2016]. We focus on ﬁnite-horizon 2-player zero-sum partially

observable stochastic games (zs-POSGs), an important class of games that is equivalent to that of zero-sum extensive-

form games (zs-EFGs) [Oliehoek and Vlassis, 2006]

. From the viewpoint of (maximizing) player

, we aim

at ﬁnding a strategy with a worst-case expected return (i.e., whatever player

’s strategy) within



of the Nash

equilibrium value (NEV).

A ﬁrst approach to solving a zs-POSG is to turn it into a zs-EFG addressed as a sequence form linear program

(SFLP) [Koller et al., 1996, von Stengel, 1996, Bošanský et al., 2014], giving rise to an exact algorithm. A second

approach is to use an iterative game solver, i.e., either a counterfactual-regret-based method (CFR) [Zinkevich et al.,

2007, Brown and Sandholm, 2018], or a ﬁrst-order method [Hoda et al., 2010, Kroer et al., 2020], both coming with

asymptotic convergence properties. CFR-based approaches now incorporate deep reinforcement learning and search,

some of them winning against top human players at heads-up no limit hold’em poker [Moravˇ

cík et al., 2017, Brown

and Sandholm, 2018, Brown et al., 2020]. A third approach, proposed by Wiggers [2015], is to use two parallel

searches in strategy space, one per player, so that the gap between both strategies’ security levels (i.e., the values of

their opponent’s best responses) bounds the distance to the NEV.

1Note: POSGs are equivalent to the large class of “well-behaved” EFGs as deﬁned by Kovaˇ

rík et al. [2019].

arXiv:2210.14640v1 [cs.AI] 26 Oct 2022

HSVI can solve

zero-sum Partially Observable Stochastic Games A PREPRINT

In contrast, dynamic programming and heuristic search have not been applied to general zs-POSGs, while often at the

core of state-of-the-art solvers in other problem classes that involve Markovian dynamics, partial observability and

multiple agents (POMDP [Åström, 1965, Smith, 2007], Dec-POMDP [Szer et al., 2005, Dibangoye et al., 2016], or

subclasses of zs-POSGs with simplifying observability assumptions [Ghosh et al., 2004, Chatterjee and Doyen, 2014,

Basu and Stettner, 2015, Horák et al., 2017, Cole and Kocherlakota, 2001, Horák and Bošanský, 2019]). They all rely

on some statistic that induces a fully observable problem whose value function (

V∗

) exhibits continuity properties

that allow deriving bounding approximations. Wiggers et al. [2016b,a] progress in this direction for zs-POSGs by

demonstrating an important continuity property of the optimal value function, and proposing a reformulation as a

particular equivalent game. We work in a similar direction, 1. using a game with different observability hypotheses,

2. proving theoretical results they implicitly rely on, and 3. building on some of their results to derive an HSVI-like

algorithm solving the zs-POSG.

Section 2 presents some necessary background, including the concept of occupancy state [Dibangoye et al., 2016,

Wiggers et al., 2016a] (i.e., the probability distribution over the players’ past action-observation histories), and

properties that rely on it. Then, Section 3 describes theoretical contributions. First, we rigorously reformulate

the problem as a non-observable game, and demonstrate that the Nash equilibrium value can be expressed with a

recursive formula, which is a required tool for DP and HS (Sec. 3.1). Second, we exhibit novel continuity properties

of optimal value functions and derive bounding approximators, a second tool made necessary due to the continuous

state space of the new game, before showing that these approximators come with valid solution strategies for the

zs-POSG (Sec. 3.2). Third, we adapt Smith and Simmons’ [Smith and Simmons, 2005] HSVI’s algorithmic scheme



-optimally solve the problem in ﬁnitely many iterations (Sec. 3.3). Section 4 presents an empirical analysis of the

approach. Section 5 discusses similarities and differences of our work with CFR-based continual resolving methods

before concluding.

2 Background

Here, we ﬁrst give basic deﬁnitions about zs-POSGs, including the solution concept at hand. Then we introduce an

equivalent game where a state corresponds to a statistic summarizing past behaviors, which leads to some important

properties of the game’s optimal value.

2.1 zs-POSGs

Deﬁnition 2.1

(zs-POSGs)

As illustrated through a dynamic inﬂuence diagram in Figure 1, a (2-player) zero-sum

partially observable stochastic game (zs-POSG) is deﬁned by a tuple hS,A1,A2,Z1,Z2, P, r, H, γ, b0i, where

•Sis a ﬁnite set of states;

•Aiis (player) i’s ﬁnite set of actions;

•Ziis i’s ﬁnite set of observations;

•Pz1,z2

a1,a2(s0|s)

is the probability to transition to state

and receive observations

and

when actions

and a2are performed while in state s;

•r(s, a1, a2)is a (scalar) reward function;

•H∈Nis a (ﬁnite) temporal horizon;

•γ∈[0,1] is a discount factor; and

•b0is the initial belief state, i.e., a probability distribution over states at t= 0.

From the Dec-POMDP, POSG and EFG literature, we use the following concepts and deﬁnitions:

θi

τ= (ai

0, zi

1, . . . , ai

τ−1, zi

τ)

is a length-

action-observation history (AOH) for

. We note

Θi

the set of all AOHs

for player iat horizon τsuch that any AOH θi

τis in ∪H−1

t=0 Θi

βi

is a (behavioral) decision rule (DR) at

for

,i.e., a mapping from private AOHs in

Θi

to distributions over

private actions. βi

τ(θi

τ, ai)is the probability to pick aiwhen facing θi

τ.

βi

τ:τ0= (βi

τ, . . . , βi

τ0)is a behavioral strategy for ifrom time step τto τ0(included).

When considering both players, the last 3 concepts become:

θτ= (θ1

τ, θ2

τ)(∈Θ=∪H−1

t=0 Θt), a joint AOH at τ,

βτ=hβ1

τ, β2

τi(∈ B =∪H−1

t=0 Bt), a decision rule proﬁle, and

βτ:τ0=hβ1

τ:τ0, β2

τ:τ0i, a behavioral strategy proﬁle.

HSVI can solve

zero-sum Partially Observable Stochastic Games A PREPRINT

start St+1 St+2

tZ1

t+1 Z1

t+2

tZ2

t+1 Z2

t+2

tA1

t+1

tA2

t+1

Time tt+ 1 t+ 2

r(s, a1, a2)

Pz1,z2

a1,a2(s0|s)

r(s, a1, a2)

Pz1,z2

a1,a2(s0|s)Hidden

1’s viewpoint

2’s viewpoint

Figure 1: Dynamic inﬂuence diagram representing the evolution of a zs-POSG

Nash Equilibria

Here, player

(respectively

) wants to maximize (resp. minimize) the expected return, or value, of strategy proﬁle

β0:H−1, deﬁned as the discounted sum of future rewards, i.e.,

V0(β0:H−1) = E"H−1

t=0

γtRt|β0:H−1#,

where

is the random variable associated to the instant reward at

. This leads to the solution concept of Nash

equilibrium strategy (NES).

Deﬁnition 2.2

(Nash Equilibrium)

The strategy proﬁle

β∗

0:H−1=hβ1∗

0:H−1, β2∗

0:H−1i

is a NES if no player has an

incentive to deviate, which can be written:

∀β1

0:H−1, V0(β1∗

0:H−1, β2∗

0:H−1)≥V0(β1

0:H−1, β2∗

0:H−1)and

∀β2

0:H−1, V0(β1∗

0:H−1, β2∗

0:H−1)≤V0(β1∗

0:H−1, β2

0:H−1).

In such a game, all NESs have the same Nash-equilibrium value (NEV),

V∗

def

=V0(β1∗

0:H−1, β2∗

0:H−1)

. Our speciﬁc

objective is to ﬁnd an



-NES, i.e., a behavioral strategy proﬁle such that any player would gain at most



by deviating.

Why writing a Bellman Optimality Equation is Hard

Our approach requires writing Bellman optimality equations. The main obstacle to achieve this is to ﬁnd an

appropriate characterization of a subproblem that allows

1. predicting both the immediate reward and the next possible subproblems given an immediate decision;

2. connecting a subproblem’s solution with solutions of its own (lower-level) subproblems; and

HSVI can solve

zero-sum Partially Observable Stochastic Games A PREPRINT

∗

∗, ah

∗

−1

∗, at

ah,∗

∗

−1

∗, ah

∗

∗, at

at,∗

nodes indistinguishable to P2

Figure 2: Simpliﬁed tree representation of the sequentialized Matching Pennies game. Irrelevant actions, noted

∗

allow merging edges with the same action for (i) player

t= 0

, and (ii) player

t= 1

. Notes: (a) Due to

irrelevant actions, this game can be seen as an Extensive Form Game, despite players acting simultaneously. (b)

Players only know about their past action history (in this observation-free game).

3. prescripting a solution strategy for the subproblem built on solutions of lower-level subproblems.

In our setting, a player’s AOH does not characterize a subproblem since her opponent’s strategy is also required to

predict the expected reward and the next AOHs. For their part, joint AOHs allow predicting next joint AOHs given

both player’s immediate decision rules, but would not be appropriate either, since player

cannot decide how to act

when facing some individual AOH θi

τwithout considering all possible AOHs of his opponent ¬i.

Partial behavioral strategy proﬁles (sequences of behavioral decision rule proﬁles from

t= 0

to some

) contain

enough information to completely describe the situation at

, and are thus necessarily predictive. We still need to

demonstrate that they are connected, despite decision rules not being public, and prescriptive, despite the need to

address global-consistency issues illustrated in the following example.

Example 1.Matching pennies is a well-known 2-player zero-sum game in which each player has a penny and secretly

chooses one side (head or tail). Then, both penny’s sides are revealed, and player

wins (payoff of

) if both

chosen sides match and looses (payoff of −1) if not.

We here formalize this game as a zs-POSG (as illustrated in Figure 2) where player

actually picks his action at

t= 0, and player 2at t= 1. Hence the tuple hS,A1,A2,Z1,Z2, P, r, H, γ, b0iwhere:

•S={si, sh, st}

, where

is the initial state, and

and

represent a memory of

’s last move: respectively

"head" or "tail";

•A1=A2={ah, at}for playing "head" (ah) or "tail" (at);

•Z1=Z2={zn}a "none" trivial observation;

•Pz

a(s0|s) = T(s, a, s0)· O(a, s0,z), using the next two deﬁnitions;

•Tis deterministic and such that (·is used to denote "for all")

–T(·,·, ah) = sh,

–T(·,·, at) = st;

•Ois deterministic and always returns "zn";

•ris such that

–r(si,·,·)=0,

–r(st,·, at) = r(sh,·, ah) = +1,

–r(st,·, ah) = r(sh,·, at) = −1;

•H= 2;

•γ= 1;

•b0is such that the initial state is siwith probability 1.

Let us then assume that both players’ DRs at

t= 0

are ﬁxed, with

β1

randomly picking

(i.e., it induces a

NES whatever his DR at

t= 1

). Then, we face a "subgame" at

t= 1

where any strategy proﬁle

hβ1

1:1, β2

1:1i

is a NES

HSVI can solve

zero-sum Partially Observable Stochastic Games A PREPRINT

proﬁle with Nash equilibrium value

. In particular,

can pick a deterministic strategy

β2

1:1

, which will be said to be

locally consistent. Yet, for

, such a NES in the subgame at

τ= 1

is not necessarily globally consistent,i.e., it may

not be part of a NES for the original game (i.e., starting from

τ= 0

). Intuitively, in such global-consistency issues

Schmid [2021] (also called safety issues Burch et al. [2014]), the choices made at latter time steps do not account

for possible deviations from the opponent at earlier time steps.

As detailed in the next section, we will characterize a subproblem not with the raw data of partial strategy proﬁles,

but with a sufﬁcient statistic, and this characterization will be used as the state of a new dynamic game equivalent to

the zs-POSG.

2.2 Occupancy State and Occupancy Markov Game

We now introduce an equivalent game, in which trajectories correspond to behavioral strategy proﬁles, and which we

will be able to decompose temporally (and recursively), a ﬁrst key tool for DP and HS.

To cope with the necessarily continuous nature of its state space, we will set this game in occupancy space, i.e., a

statistic that sums up past DR proﬁles. This will let us derive continuity properties on which to build point-based

approximators.

As Wiggers et al. [2016a], let us formally deﬁne an occupancy state (OS)

σβ0:τ−1

as the probability distribution

over joint AOHs

θτ

given partial strategy proﬁle

β0:τ−1

. This statistic exhibits the usual Markov and sufﬁciency

properties:

Proposition 2.3

(Adapted from Dibangoye et al. [Dibangoye et al., 2016, Thm. 1] – Proof in App. B.1)

.σβ0:τ−1

together with

βτ

, is a sufﬁcient statistic to compute (i) the next OS,

T(σβ0:τ−1,βτ)def

=σβ0:τ

, and (ii) the expected

reward at τ:r(σβ0:τ−1,βτ)def

=ERτ|β0:τ−1⊕βτ, where ⊕denotes a concatenation.

Writing from now on

στ

, as short for

σβ0:τ−1

, the OS associated with some preﬁx strategy proﬁle

β0:τ−1

, the proof

essentially relies on deriving the following formulas: ∀θ1

τ, a1, z1, θ2

τ, a2, z2,

T(στ,βτ)((θ1

τ, a1, z1),(θ2

τ, a2, z2)) (1)

def

=P r((θ1

τ, a1, z1),(θ2

τ, a2, z2)|στ,βτ)

=β1

τ(θ1

τ, a1)β2

τ(θ2

τ, a2)στ(θτ)X

s,s0

a(s0|s)b(s|θτ),

where b(s|θτ)is a belief state obtained by Hidden Markov Model ﬁltering; and

r(στ,βτ)def

=E[r(S, A1, A2)|στ,βτ](2)

s,θτ,a

στ(θτ)b(s|θτ)β1

τ(a1|θ1

τ)β2

τ(a2|θ2

τ)r(s, a).

We can then derive, from a zs-POSG, a non-observable zero-sum game similar to Wiggers et al.’s plan-time NOSG

[Wiggers et al., 2016a, Deﬁnition 4], but without assuming that the players’ past strategies are public.

Deﬁnition 2.4

(zero-sum occupancy Markov Game (zs-oMG))

Azero-sum occupancy Markov game (zs-oMG)

deﬁned by the tuple hOσ,B, T, r, H, γi, where:

•Oσ(= ∪H−1

t=0 Oσ

t)is the set of OSs induced by the zs-POSG;

•Bis the set of DR proﬁles of the zs-POSG;

•Tis the deterministic transition function in Eq. (1);

•ris the reward function in Eq. (2); and

•Hand γare as in the zs-POSG

(b0is not in the tuple but serves to deﬁne Tand r).

In this game, as in the zs-POSG, a player’s solution is a behavioral strategy. Besides, the value of a strategy proﬁle

β0:H−1

is the same for both zs-oMG and zs-POSG, so that they share the same



-NEV and



-NESs. We can thus

work with zs-oMGs as a means to solve zs-POSGs.

The following aims at deriving a recursive expression of V∗

0, as well as continuity properties.

We use (i) “Markov game” instead of “stochastic game” because the dynamics are not stochastic, and (ii) “partially observable

stochastic game” to stick with the literature.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HSVICANSOLVEZERO-SUMPARTIALLYOBSERVABLESTOCHASTICGAMESAPREPRINTAurélienDelageCITIINSALyonVilleurbanneaurelien.delage@insa-lyon.frOlivierBuffetINRIA-CNRSUniversitédeLorraineVillers-lès-Nancyolivier.buffet@inria.frJillesS.DibangoyeCITIINSALyonVilleurbannejilles-steeve.dibangoye@inria.frAbdallahSafdin...

展开>> 收起<<

HSVI CAN SOLVE ZERO -SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES A P REPRINT.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HSVI CAN SOLVE ZERO -SUM PARTIALLY OBSERVABLE STOCHASTIC GAMES A P REPRINT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: