Efﬁcient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization

2025-08-18 1 0 291.36KB 23 页 10玖币

arXiv:2210.12057v2 [cs.LG] 31 Jan 2023

Proceedings of Machine Learning Research vol 201:1–23, 2023 34th International Conference on Algorithmic Learning Theory

Efﬁcient Global Planning in Large MDPs

via Stochastic Primal-Dual Optimization

Gergely Neu GERGELY.NEU@GMAIL.COM

Universitat Pompeu Fabra, Barcelona, Spain

Nneka Okolo NNEKAMAUREEN.OKOLO@UPF.EDU

Universitat Pompeu Fabra, Barcelona, Spain

Editors: Shipra Agrawal and Francesco Orabona

Abstract

We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted

Markov decision process with a generative model and linear function approximation. Assuming

that the feature map approximately satisﬁes standard realizability and Bellman-closedness condi-

tions and also that the feature vectors of all state-action pairs are representable as convex combina-

tions of a small core set of state-action pairs, we show that our method outputs a near-optimal policy

after a polynomial number of queries to the generative model. Our method is computationally efﬁ-

cient and comes with the major advantage that it outputs a single softmax policy that is compactly

represented by a low-dimensional parameter vector, and does not need to execute computationally

expensive local planning subroutines in runtime.

Keywords: Markov decision processes, Linear Programming, Linear function approximation,

Planning with a generative model.

1. Introduction

Finding near-optimal policies in large Markov decision processes (MDPs) is one of the most im-

portant tasks encountered in model-based reinforcement learning Sutton and Barto (2018). This

problem (commonly referred to as planning) presents both computational and statistical challenges

when having access only to a generative model of the environment: an efﬁcient planning method

should use little computation and few samples drawn from the model. While the complexity of plan-

ning in small Markov decision processes is already well-understood by now (cf. Azar et al.,2013;

Sidford et al.,2018;Agarwal et al.,2020b), extending the insights to large state spaces with func-

tion approximation has remained challenging. One particular challenge that remains unaddressed

in the present literature is going beyond local planning methods that require costly online compu-

tations for producing each action during execution time. In this work, we advance the state of the

art in planning for large MDPs by designing and analyzing a new global planning algorithm that

outputs a simple, compactly represented policy after performing all of its computations in an ofﬂine

fashion.

Our method is rooted in the classic linear programming framework for sequential decision mak-

ing ﬁrst proposed by Manne (1960); d’Epenoux (1963); Denardo (1970). This approach formulates

the problem of ﬁnding an optimal behavior policy as a linear program (LP) with a number of vari-

ables and constraints that is proportional to the size of the state-action space of the MDP. Over the

last several decades, numerous attempts have been made to derive computationally tractable algo-

rithms from the LP framework, most notably via reducing the number of variables using function

EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION

approximation techniques. This idea has been ﬁrst explored by Schweitzer and Seidmann (1985),

whose ideas were further developed and popularized by the seminal work of De Farias and Van Roy

(2003). Further approximations were later proposed by De Farias and Van Roy (2004); Petrik and Zilberstein

(2009); Lakshminarayanan et al. (2017), whose main focus was reducing the complexity of the LPs

while keeping the quality of optimal solutions under control.

More recently, the LP framework has started to inspire practical methods for reinforcement

learning and, more speciﬁcally, planning with a generative model. The most relevant to the present

paper is the line of work initiated by Wang and Chen (2016); Wang (2017), who proposed plan-

ning algorithms based on primal-dual optimization of the Lagrangian associated with the classic

LP. Over the last several years, this computational recipe has been perfected for small MDPs by

works like Cheng et al. (2020); Jin and Sidford (2020); Tiapkin and Gasnikov (2022), and even

some extensions using linear function approximation have been proposed by Chen et al. (2018);

Bas-Serrano and Neu (2020). While these methods have successfully reduced the number of primal

and dual variables, they all require stringent conditions on the function approximator being used and

their overall sample complexity scales poorly with the size of the MDP. For instance, the bounds

of Chen et al. (2018) scale with a so-called “concentrability coefﬁcient” that can be as large as the

size of the state space, thus failing to yield meaningful guarantees for large MDPs. Furthermore,

these methods require parametrizing the space of so-called state-action occupancy measures, which

is undesirable given the complexity of said space.

In the present work, we develop a new stochastic primal-dual method based on a relaxed

version of the classical LP. This relaxed LP (inspired by the works of Mehta and Meyn,2009;

Neu and Pike-Burke,2020 and Bas-Serrano et al.,2021) features a linearly parametrized action-

value function and guarantees to produce optimal policies as solutions under well-studied conditions

like the linear MDP assumption of Yang and Wang (2019); Jin et al. (2020). Our method iteratively

updates the primal and dual variables via a combination of stochastic mirror descent steps and a

set of implicit update rules tailor-made for the relaxed LP. Under a so-called core state-action-pair

assumption, we show that the method produces a near-optimal policy with sample and computa-

tion complexity that is polynomial in the relevant problem parameters: the size of the core set,

the dimensionality of the feature map, the effective horizon, and the desired accuracy. Additional

assumptions required by our analysis are the near-realizability of the Q-functions and closedness

under the Bellman operators of all policies. The main merit of our algorithm is that it produces a

compactly represented softmax policy which requires no access to the generative model in runtime.

The works most directly comparable to ours are Wang et al. (2021), Shariff and Szepesv´ari

(2020), and Yin et al. (2022)—their common feature being the use of a core set of states or state-

action pairs for planning. Wang et al. (2021) provide a sample complexity bound for ﬁnding a

near-optimal policy in linear MDPs, but their algorithm is computationally intractable1due to fea-

turing a planning subroutine in a linear MDP. Shariff and Szepesv´ari (2020) and Yin et al. (2022)

both propose local planning methods that require extensive computation in runtime, but on the other

hand some of their assumptions are weaker than ours. In particular, they only require realizability

of the value functions but continue to operate without the function class being closed under all Bell-

man operators. For a detailed discussion on the role and necessity of such assumptions, we refer to

the excellent discussion provided by Yin et al. (2022) on the subject.

1. At least we are not aware of a computationally efﬁcient global planning method that works for the discounted case

they consider.

EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION

Notation. In the n-dimensional Euclidean space Rn, we denote the vector of all ones by 1, the

zero vector by 0and the i-th coordinate vector by ei. For, vectors a, b ∈Rm, we use a≤bto

denote elementwise comparison, meaning that ai≤biis satisﬁed for all entries i. For any ﬁnite

set D,∆D={p∈R|D|

+|kpk1= 1}denotes the set of all probability distributions over its entries.

We deﬁne the relative entropy between two distributions p, p′∈∆Das D(pkp′) = P|D|

i=1 pilog pi

p′

In the context of iterative algorithms, we will use the notation Ft−1to refer to the sigma-algebra

generated by all events up to the end of iteration t−1, and use the shorthand notation Et[·] =

E[·|Ft−1]to denote expectation conditional on the past history encountered by the method.

2. Preliminaries

We study a discounted Markov Decision Processes (Puterman,2014) denoted by the quintuple

(X,A, r, P, γ)with Xand Arepresenting ﬁnite (but potentially very large) state and action spaces

of cardinality X=|X|, A =|A| respectively. The reward function is denoted by r:X × A → R,

and the transition function by P:X × A → ∆X. We will often represent the reward function by

a vector in RXA and the transition function by the operator P∈RXA×Xwhich acts on functions

v∈RXby assigning (P v)(x, a) = Px′P(x′|x, a)v(x′)for all x, a. Its adjoint PT∈RX×XA is

similarly deﬁned on functions u∈RXA via the assignment (PTu)(x) = Px′,a′P(x|x′, a′)u(x′, a′)

for all x. We also deﬁne the operator E∈RXA×Xand its adjoint ET∈RX×XA acting on

respective vectors v∈RXand u∈RXA through the assignments (Ev)(x, a) = v(x)and

(ETu)(x) = Pau(x, a). For simplicity, we assume the rewards are bounded in [0,1] and let

Z={(x, a)|x∈ X, a ∈ A} denote the set of all possible state action pairs with cardinality

Z=|Z| to be used when necessary.

The Markov decision process describes a sequential decision-making process where in each

round t= 0,1,..., the agent observes the state of the environment xt, takes an action at, and earns

a potentially random reward with expectation r(xt, at). The state of the environment in the next

round t+ 1 is generated randomly according to the transition dynamics as xt+1 ∼P(·|xt, at). The

initial state of the process x0is drawn from a ﬁxed distribution ν0∈∆X. In the setting we consider,

the agent’s goal is to maximize its normalized discounted return (1−γ)EP∞

t=0 γtr(xt, at), where

γ∈(0,1) is the discount factor, and the expectation is taken over the random transitions generated

by the environment, the random initial state, and the potential randomness injected by the agent. It

is well-known that maximizing the discounted return can be achieved by following a memoryless

time-independent decision making rule mapping states to actions. Thus, we restrict our attention to

stationary stochastic policies π:X → ∆Awith π(a|x)denoting the probability of the policy taking

action ain state x. We deﬁne the mean operator Mπ:RXA →RXwith respect to policy πthat acts

on functions Q∈RX×Aas (MπQ)(x) = Paπ(a|x)Q(x, a)and the (non-linear) max operator

M:RXA →RXthat acts as (M∗Q)(x) = maxa∈A Q(x, a). The value function and action-

value function of policy πare respectively deﬁned as Vπ(x) = EπP∞

t=0 γtr(xt, at)x0=xand

Qπ(x, a) = EπP∞

t=0 γtr(xt, at)x0=x, a0=a, where the notation Eπ[·]signiﬁes that each

action is selected by following the policy as at∼π(·|xt). The value functions of a policy πare

known to satisfy the Bellman equations (Bellman,1966) and the value functions of an optimal policy

π∗satisfy the Bellman optimality equations, which can be conveniently written in our notation as

Qπ=r+γP V πand Q∗=r+γP V ∗,(1)

EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION

with Vπ=MπQπand V∗=M∗Q∗. Any optimal policy satisﬁes supp(π(·|x)) ⊆arg maxaQ∗(x, a).

We refer the reader to Puterman (2014) for the standard proofs of these fundamental results.

Our approach taken in this paper will primarily make use of an alternative formulation of the

MDP optimization based on linear programming, due to Manne (1960) (see also d’Epenoux,1963,

Denardo,1970, and Section 6.9 in Puterman,2014). This formulation phrases the optimal control

problem as a search for an occupancy measure with maximal return. The state-action occupancy

measure of a policy πis deﬁned as

µπ(x, a) = (1 −γ)Eπ"∞

t=0

γtI{(xt,at)=(x,a)}#,

which allows rewriting the expected normalized return of πas Rπ

γ=hµπ, ri. The corresponding

state-occupancy measure is deﬁned as the state distribution νπ(x) = Paµπ(x, a). Denoting the

direct product of a state distribution νand the policy πby ν◦πwith its entries being (ν◦π)(x, a) =

ν(x)π(a|x), we can notice that the state-action occupancy measure induced by πsatisﬁes µπ=

νπ◦π. The set of all occupancy measures can be fully characterized by a set of linear constraints,

which allows rewriting the optimal control problem as the following linear program (LP):

maxµ∈RXA

+hµ, ri

subject to ETµ= (1 −γ)ν0+γP Tµ. (2)

Any optimal solution µ∗of this LP can be shown to correspond to the occupancy measure of

an optimal policy π∗, and generally policies can be extracted from feasible points µvia the rule

πµ(a|x) = µ(x, a)/Pa′µ(x, a′)(subject to the denominator being nonzero). The dual of this

linear program takes the following form:

minV∈RX(1 −γ)hν0, V i

subject to EV ≥r+γP V. (3)

The optimal value function V∗is known to be an optimal solution for this LP, and is the unique

optimal solution provided that ν0has full support over the state space.

The above linear programs are not directly suitable as solution tools for large MDPs, given that

they have a large number of variables and constraints. Numerous adjustments have been proposed

over the last decades to address this issue (Schweitzer and Seidmann,1985;De Farias and Van Roy,

2003,2004;Lakshminarayanan et al.,2017;Bas-Serrano and Neu,2020). One recurring theme in

these alternative formulations is relaxing some of the constraints by the introduction of a feature

map. In this work, we use as starting point an alternative proposed by Bas-Serrano et al. (2021),

who use a feature map ϕ:X ×A → Rdrepresented by the XA ×dfeature matrix Φ, and rephrase

the original optimization problem as the following LP:

maxµ,u∈RXA

+hµ , ri

subject to ETu= (1 −γ)ν0+γP Tµ,

ΦTµ= ΦTu . (4)

EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION

The dual of this LP is given as

minθ∈Rd,V ∈RX(1 −γ)hν0, V i

subject to Φθ≥r+γP V,

EV ≥Φθ .

As shown by Bas-Serrano et al. (2021), optimal solutions of the above relaxed LPs correspond

to optimal occupancy measures and value functions under the popular linear MDP assumption of

Yang and Wang (2019); Jin et al. (2020). The key merit of these LPs is that the dual features a

linearly parametrized action-value function Qθ= Φθ, which allows the extraction of a simple

greedy policy πθ(a|x) = I{a=arg maxa′Qθ(x,a′)}from any dual solution θ, and in particular an optimal

policy can be extracted from its optimal solution.

3. A tractable linear program for large MDPs with linear function approximation

The downside of the classical linear programs deﬁned in the previous sections is that they all feature

a large number of variables and constraints, even after successive relaxations. In what follows, we

offer a new version of the LP (4) that gets around this issue and allows the development of a tractable

primal-dual algorithm with strong performance guarantees. In particular, we will reduce the number

of variables in the primal LP (4) by considering only sparsely supported state-action distributions

instead of full occupancy measures, which will be justiﬁed by the following technical assumption

made on the MDP structure:

Assumption 1 (Approximate Core State-Action Assumption) The feature vector of any state-action

pair (x, a)∈ Z can be approximately expressed as a convex combination of features evaluated at

a set of mcore state-action pairs (x′, a′)∈e

Z, up to an error ∆core ∈RXA×d. That is, for each

(x, a)∈ Z, there exists a set of coefﬁcients satisfying b(x′, a′|x, a)≥0and Px′,a′b(x′, a′|x, a) = 1

such that ϕ(x, a) = Px′,a′b(x′, a′|x, a)ϕ(x′, a′) + ∆core(x, a). Furthermore, for every x, a, the

misspeciﬁcation error satisﬁes k∆core(x, a)k2=εcore(x, a).

It will be useful to rephrase this assumption using the following handy notation. Let U ∈ Rm×Z

denote a selection matrix such that, e

Φ = UΦ∈Rm×dis the core feature matrix with rows

corresponding to Φevaluated at core state-action pairs. Furthermore, the interpolation coefﬁ-

cients from Assumption 1can be organized into a stochastic matrix B ∈ RZ×mwith B(x, a) =

{b(x′, a′|x, a)}(x′,a′)∈e

Z∈Rm

+for (x, a)∈ Z. Then, the assumption can be simply rephrased as

requiring the condition that Φ = BUΦ+∆core. Note that both Uand Bare stochastic matrices

satisfying U1=1and B1=1, and the same holds for their product BU1=1. Note however

that BU is a rank-mmatrix, which implies that the assumption can only be satisﬁed with zero error

whenever m≥rank(Φ), which in general can be as large as the feature dimensionality d. Whether

or not it is possible to ﬁnd a set of core-state-action pairs in a given MDP is a nontrivial question

that we discuss in Section 6.

With the notation introduced above, we are ready to state our relaxation of the LP (4) that serves

as the foundation for our algorithm design:

maxλ∈Rm

+,u∈RXA

+hλ , Uri

subject to ETu= (1 −γ)ν0+γP TUTλ,

ΦTUTλ= ΦTu .

(5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.12057v2[cs.LG]31Jan2023ProceedingsofMachineLearningResearchvol201:1–23,202334thInternationalConferenceonAlgorithmicLearningTheoryEfﬁcientGlobalPlanninginLargeMDPsviaStochasticPrimal-DualOptimizationGergelyNeuGERGELY.NEU@GMAIL.COMUniversitatPompeuFabra,Barcelona,SpainNnekaOkoloNNEKAMAUREEN...

展开>> 收起<<

Efﬁcient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: