Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization

2025-08-18 1 0 291.36KB 23 页 10玖币
侵权投诉
arXiv:2210.12057v2 [cs.LG] 31 Jan 2023
Proceedings of Machine Learning Research vol 201:123, 2023 34th International Conference on Algorithmic Learning Theory
Efficient Global Planning in Large MDPs
via Stochastic Primal-Dual Optimization
Gergely Neu GERGELY.NEU@GMAIL.COM
Universitat Pompeu Fabra, Barcelona, Spain
Nneka Okolo NNEKAMAUREEN.OKOLO@UPF.EDU
Universitat Pompeu Fabra, Barcelona, Spain
Editors: Shipra Agrawal and Francesco Orabona
Abstract
We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted
Markov decision process with a generative model and linear function approximation. Assuming
that the feature map approximately satisfies standard realizability and Bellman-closedness condi-
tions and also that the feature vectors of all state-action pairs are representable as convex combina-
tions of a small core set of state-action pairs, we show that our method outputs a near-optimal policy
after a polynomial number of queries to the generative model. Our method is computationally effi-
cient and comes with the major advantage that it outputs a single softmax policy that is compactly
represented by a low-dimensional parameter vector, and does not need to execute computationally
expensive local planning subroutines in runtime.
Keywords: Markov decision processes, Linear Programming, Linear function approximation,
Planning with a generative model.
1. Introduction
Finding near-optimal policies in large Markov decision processes (MDPs) is one of the most im-
portant tasks encountered in model-based reinforcement learning Sutton and Barto (2018). This
problem (commonly referred to as planning) presents both computational and statistical challenges
when having access only to a generative model of the environment: an efficient planning method
should use little computation and few samples drawn from the model. While the complexity of plan-
ning in small Markov decision processes is already well-understood by now (cf. Azar et al.,2013;
Sidford et al.,2018;Agarwal et al.,2020b), extending the insights to large state spaces with func-
tion approximation has remained challenging. One particular challenge that remains unaddressed
in the present literature is going beyond local planning methods that require costly online compu-
tations for producing each action during execution time. In this work, we advance the state of the
art in planning for large MDPs by designing and analyzing a new global planning algorithm that
outputs a simple, compactly represented policy after performing all of its computations in an offline
fashion.
Our method is rooted in the classic linear programming framework for sequential decision mak-
ing first proposed by Manne (1960); d’Epenoux (1963); Denardo (1970). This approach formulates
the problem of finding an optimal behavior policy as a linear program (LP) with a number of vari-
ables and constraints that is proportional to the size of the state-action space of the MDP. Over the
last several decades, numerous attempts have been made to derive computationally tractable algo-
rithms from the LP framework, most notably via reducing the number of variables using function
© 2023 G. Neu & N. Okolo.
EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION
approximation techniques. This idea has been first explored by Schweitzer and Seidmann (1985),
whose ideas were further developed and popularized by the seminal work of De Farias and Van Roy
(2003). Further approximations were later proposed by De Farias and Van Roy (2004); Petrik and Zilberstein
(2009); Lakshminarayanan et al. (2017), whose main focus was reducing the complexity of the LPs
while keeping the quality of optimal solutions under control.
More recently, the LP framework has started to inspire practical methods for reinforcement
learning and, more specifically, planning with a generative model. The most relevant to the present
paper is the line of work initiated by Wang and Chen (2016); Wang (2017), who proposed plan-
ning algorithms based on primal-dual optimization of the Lagrangian associated with the classic
LP. Over the last several years, this computational recipe has been perfected for small MDPs by
works like Cheng et al. (2020); Jin and Sidford (2020); Tiapkin and Gasnikov (2022), and even
some extensions using linear function approximation have been proposed by Chen et al. (2018);
Bas-Serrano and Neu (2020). While these methods have successfully reduced the number of primal
and dual variables, they all require stringent conditions on the function approximator being used and
their overall sample complexity scales poorly with the size of the MDP. For instance, the bounds
of Chen et al. (2018) scale with a so-called “concentrability coefficient” that can be as large as the
size of the state space, thus failing to yield meaningful guarantees for large MDPs. Furthermore,
these methods require parametrizing the space of so-called state-action occupancy measures, which
is undesirable given the complexity of said space.
In the present work, we develop a new stochastic primal-dual method based on a relaxed
version of the classical LP. This relaxed LP (inspired by the works of Mehta and Meyn,2009;
Neu and Pike-Burke,2020 and Bas-Serrano et al.,2021) features a linearly parametrized action-
value function and guarantees to produce optimal policies as solutions under well-studied conditions
like the linear MDP assumption of Yang and Wang (2019); Jin et al. (2020). Our method iteratively
updates the primal and dual variables via a combination of stochastic mirror descent steps and a
set of implicit update rules tailor-made for the relaxed LP. Under a so-called core state-action-pair
assumption, we show that the method produces a near-optimal policy with sample and computa-
tion complexity that is polynomial in the relevant problem parameters: the size of the core set,
the dimensionality of the feature map, the effective horizon, and the desired accuracy. Additional
assumptions required by our analysis are the near-realizability of the Q-functions and closedness
under the Bellman operators of all policies. The main merit of our algorithm is that it produces a
compactly represented softmax policy which requires no access to the generative model in runtime.
The works most directly comparable to ours are Wang et al. (2021), Shariff and Szepesv´ari
(2020), and Yin et al. (2022)—their common feature being the use of a core set of states or state-
action pairs for planning. Wang et al. (2021) provide a sample complexity bound for finding a
near-optimal policy in linear MDPs, but their algorithm is computationally intractable1due to fea-
turing a planning subroutine in a linear MDP. Shariff and Szepesv´ari (2020) and Yin et al. (2022)
both propose local planning methods that require extensive computation in runtime, but on the other
hand some of their assumptions are weaker than ours. In particular, they only require realizability
of the value functions but continue to operate without the function class being closed under all Bell-
man operators. For a detailed discussion on the role and necessity of such assumptions, we refer to
the excellent discussion provided by Yin et al. (2022) on the subject.
1. At least we are not aware of a computationally efficient global planning method that works for the discounted case
they consider.
2
EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION
Notation. In the n-dimensional Euclidean space Rn, we denote the vector of all ones by 1, the
zero vector by 0and the i-th coordinate vector by ei. For, vectors a, b Rm, we use abto
denote elementwise comparison, meaning that aibiis satisfied for all entries i. For any finite
set D,D={pR|D|
+|kpk1= 1}denotes the set of all probability distributions over its entries.
We define the relative entropy between two distributions p, pDas D(pkp) = P|D|
i=1 pilog pi
p
i
.
In the context of iterative algorithms, we will use the notation Ft1to refer to the sigma-algebra
generated by all events up to the end of iteration t1, and use the shorthand notation Et[·] =
E[·|Ft1]to denote expectation conditional on the past history encountered by the method.
2. Preliminaries
We study a discounted Markov Decision Processes (Puterman,2014) denoted by the quintuple
(X,A, r, P, γ)with Xand Arepresenting finite (but potentially very large) state and action spaces
of cardinality X=|X|, A =|A| respectively. The reward function is denoted by r:X × A R,
and the transition function by P:X × A X. We will often represent the reward function by
a vector in RXA and the transition function by the operator PRXA×Xwhich acts on functions
vRXby assigning (P v)(x, a) = PxP(x|x, a)v(x)for all x, a. Its adjoint PTRX×XA is
similarly defined on functions uRXA via the assignment (PTu)(x) = Px,aP(x|x, a)u(x, a)
for all x. We also define the operator ERXA×Xand its adjoint ETRX×XA acting on
respective vectors vRXand uRXA through the assignments (Ev)(x, a) = v(x)and
(ETu)(x) = Pau(x, a). For simplicity, we assume the rewards are bounded in [0,1] and let
Z={(x, a)|x∈ X, a A} denote the set of all possible state action pairs with cardinality
Z=|Z| to be used when necessary.
The Markov decision process describes a sequential decision-making process where in each
round t= 0,1,..., the agent observes the state of the environment xt, takes an action at, and earns
a potentially random reward with expectation r(xt, at). The state of the environment in the next
round t+ 1 is generated randomly according to the transition dynamics as xt+1 P(·|xt, at). The
initial state of the process x0is drawn from a fixed distribution ν0X. In the setting we consider,
the agent’s goal is to maximize its normalized discounted return (1γ)EP
t=0 γtr(xt, at), where
γ(0,1) is the discount factor, and the expectation is taken over the random transitions generated
by the environment, the random initial state, and the potential randomness injected by the agent. It
is well-known that maximizing the discounted return can be achieved by following a memoryless
time-independent decision making rule mapping states to actions. Thus, we restrict our attention to
stationary stochastic policies π:X Awith π(a|x)denoting the probability of the policy taking
action ain state x. We define the mean operator Mπ:RXA RXwith respect to policy πthat acts
on functions QRX×Aas (MπQ)(x) = Paπ(a|x)Q(x, a)and the (non-linear) max operator
M:RXA RXthat acts as (MQ)(x) = maxa∈A Q(x, a). The value function and action-
value function of policy πare respectively defined as Vπ(x) = EπP
t=0 γtr(xt, at)x0=xand
Qπ(x, a) = EπP
t=0 γtr(xt, at)x0=x, a0=a, where the notation Eπ[·]signifies that each
action is selected by following the policy as atπ(·|xt). The value functions of a policy πare
known to satisfy the Bellman equations (Bellman,1966) and the value functions of an optimal policy
πsatisfy the Bellman optimality equations, which can be conveniently written in our notation as
Qπ=r+γP V πand Q=r+γP V ,(1)
3
EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION
with Vπ=MπQπand V=MQ. Any optimal policy satisfies supp(π(·|x)) arg maxaQ(x, a).
We refer the reader to Puterman (2014) for the standard proofs of these fundamental results.
Our approach taken in this paper will primarily make use of an alternative formulation of the
MDP optimization based on linear programming, due to Manne (1960) (see also d’Epenoux,1963,
Denardo,1970, and Section 6.9 in Puterman,2014). This formulation phrases the optimal control
problem as a search for an occupancy measure with maximal return. The state-action occupancy
measure of a policy πis defined as
µπ(x, a) = (1 γ)Eπ"
X
t=0
γtI{(xt,at)=(x,a)}#,
which allows rewriting the expected normalized return of πas Rπ
γ=hµπ, ri. The corresponding
state-occupancy measure is defined as the state distribution νπ(x) = Paµπ(x, a). Denoting the
direct product of a state distribution νand the policy πby νπwith its entries being (νπ)(x, a) =
ν(x)π(a|x), we can notice that the state-action occupancy measure induced by πsatisfies µπ=
νππ. The set of all occupancy measures can be fully characterized by a set of linear constraints,
which allows rewriting the optimal control problem as the following linear program (LP):
maxµRXA
+hµ, ri
subject to ETµ= (1 γ)ν0+γP Tµ. (2)
Any optimal solution µof this LP can be shown to correspond to the occupancy measure of
an optimal policy π, and generally policies can be extracted from feasible points µvia the rule
πµ(a|x) = µ(x, a)/Paµ(x, a)(subject to the denominator being nonzero). The dual of this
linear program takes the following form:
minVRX(1 γ)hν0, V i
subject to EV r+γP V. (3)
The optimal value function Vis known to be an optimal solution for this LP, and is the unique
optimal solution provided that ν0has full support over the state space.
The above linear programs are not directly suitable as solution tools for large MDPs, given that
they have a large number of variables and constraints. Numerous adjustments have been proposed
over the last decades to address this issue (Schweitzer and Seidmann,1985;De Farias and Van Roy,
2003,2004;Lakshminarayanan et al.,2017;Bas-Serrano and Neu,2020). One recurring theme in
these alternative formulations is relaxing some of the constraints by the introduction of a feature
map. In this work, we use as starting point an alternative proposed by Bas-Serrano et al. (2021),
who use a feature map ϕ:X ×A → Rdrepresented by the XA ×dfeature matrix Φ, and rephrase
the original optimization problem as the following LP:
maxµ,uRXA
+hµ , ri
subject to ETu= (1 γ)ν0+γP Tµ,
ΦTµ= ΦTu . (4)
4
EFFICIENT GLOBAL PLANNING IN LARGE MDPS VIA STOCHASTIC PRIMAL-DUAL OPTIMIZATION
The dual of this LP is given as
minθRd,V RX(1 γ)hν0, V i
subject to Φθr+γP V,
EV Φθ .
As shown by Bas-Serrano et al. (2021), optimal solutions of the above relaxed LPs correspond
to optimal occupancy measures and value functions under the popular linear MDP assumption of
Yang and Wang (2019); Jin et al. (2020). The key merit of these LPs is that the dual features a
linearly parametrized action-value function Qθ= Φθ, which allows the extraction of a simple
greedy policy πθ(a|x) = I{a=arg maxaQθ(x,a)}from any dual solution θ, and in particular an optimal
policy can be extracted from its optimal solution.
3. A tractable linear program for large MDPs with linear function approximation
The downside of the classical linear programs defined in the previous sections is that they all feature
a large number of variables and constraints, even after successive relaxations. In what follows, we
offer a new version of the LP (4) that gets around this issue and allows the development of a tractable
primal-dual algorithm with strong performance guarantees. In particular, we will reduce the number
of variables in the primal LP (4) by considering only sparsely supported state-action distributions
instead of full occupancy measures, which will be justified by the following technical assumption
made on the MDP structure:
Assumption 1 (Approximate Core State-Action Assumption) The feature vector of any state-action
pair (x, a)∈ Z can be approximately expressed as a convex combination of features evaluated at
a set of mcore state-action pairs (x, a)e
Z, up to an error core RXA×d. That is, for each
(x, a)∈ Z, there exists a set of coefficients satisfying b(x, a|x, a)0and Px,ab(x, a|x, a) = 1
such that ϕ(x, a) = Px,ab(x, a|x, a)ϕ(x, a) + ∆core(x, a). Furthermore, for every x, a, the
misspecification error satisfies kcore(x, a)k2=εcore(x, a).
It will be useful to rephrase this assumption using the following handy notation. Let U Rm×Z
+
denote a selection matrix such that, e
Φ = UΦRm×dis the core feature matrix with rows
corresponding to Φevaluated at core state-action pairs. Furthermore, the interpolation coef-
cients from Assumption 1can be organized into a stochastic matrix B RZ×mwith B(x, a) =
{b(x, a|x, a)}(x,a)e
ZRm
+for (x, a)∈ Z. Then, the assumption can be simply rephrased as
requiring the condition that Φ = BUΦ+∆core. Note that both Uand Bare stochastic matrices
satisfying U1=1and B1=1, and the same holds for their product BU1=1. Note however
that BU is a rank-mmatrix, which implies that the assumption can only be satisfied with zero error
whenever mrank(Φ), which in general can be as large as the feature dimensionality d. Whether
or not it is possible to find a set of core-state-action pairs in a given MDP is a nontrivial question
that we discuss in Section 6.
With the notation introduced above, we are ready to state our relaxation of the LP (4) that serves
as the foundation for our algorithm design:
maxλRm
+,uRXA
+hλ , Uri
subject to ETu= (1 γ)ν0+γP TUTλ,
ΦTUTλ= ΦTu .
(5)
5
摘要:

arXiv:2210.12057v2[cs.LG]31Jan2023ProceedingsofMachineLearningResearchvol201:1–23,202334thInternationalConferenceonAlgorithmicLearningTheoryEfficientGlobalPlanninginLargeMDPsviaStochasticPrimal-DualOptimizationGergelyNeuGERGELY.NEU@GMAIL.COMUniversitatPompeuFabra,Barcelona,SpainNnekaOkoloNNEKAMAUREEN...

展开>> 收起<<
Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:291.36KB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注