KrylovBellman boosting Super-linear policy evaluation in general state spaces Eric XiaMartin J. Wainwright

2025-05-03 0 0 823.32KB 40 页 10玖币
侵权投诉
Krylov–Bellman boosting: Super-linear policy
evaluation in general state spaces
Eric Xia,:Martin J. Wainwright,:,;
Department of Statistics, and
Department of Electrical Engineering and Computer Sciences;
UC Berkeley, Berkeley, CA
Department of Electrical Engineering and Computer Sciences:
Department of Mathematics;
Massachusetts Institute of Technology, Cambridge, MA
October 21, 2022
Abstract
We present and analyze the Krylov–Bellman Boosting (KBB) algorithm for policy
evaluation in general state spaces. It alternates between fitting the Bellman residual us-
ing non-parametric regression (as in boosting), and estimating the value function via the
least-squares temporal difference (LSTD) procedure applied with a feature set that grows
adaptively over time. By exploiting the connection to Krylov methods, we equip this
method with two attractive guarantees. First, we provide a general convergence bound
that allows for separate estimation errors in residual fitting and LSTD computation. Con-
sistent with our numerical experiments, this bound shows that convergence rates depend
on the restricted spectral structure, and are typically super-linear. Second, by combin-
ing this meta-result with sample-size dependent guarantees for residual fitting and LTSD
computation, we obtain concrete statistical guarantees that depend on the sample size
along with the complexity of the function class used to fit the residuals. We illustrate
the behavior of the KBB algorithm for various types of policy evaluation problems, and
typically find large reductions in sample complexity relative to the standard approach of
fitted value iteration.
1 Introduction
The past decade has witnessed tremendous success with applications of reinforcement learning
(e.g., [TFR`17,LFDA16,SHM`16]), but many challenges remain. Notably, many procedures
in reinforcement learning are very “data hungry”, meaning that large sample sizes—which may
or may not be available in a given application—are often required. Given a data-dependent
procedure, its sample complexity is a function that characterizes the number of samples re-
quired, as a function of the problem parameters, to achieve a desired error tolerance. Given
the cost associated with sampling (or lack of availability of large sample sizes), an important
problem is to develop RL algorithms with the lowest possible sample complexity.
In this paper, we focus on the problem of policy evaluation for Markov decision pro-
cesses defined over general state spaces. Policy evaluation occupies a central role in rein-
forcement learning. It is of interest both in its own right and as a key subroutine in many
algorithms. For instance, classical algorithms for computing optimal policies, such as pol-
icy iteration [How60], require repeated rounds of policy evaluation; iterative schemes like
policy gradient methods [Wil92,SLH`14,SMSM99,Kak01] and actor-critic schemes [KT01,
1
arXiv:2210.11377v1 [stat.ML] 20 Oct 2022
MBM`16,WMG`17,BSGL09] involve maintaining estimates of the value function as a part
of the algorithm.
Policy evaluation in discrete state spaces, known as the tabular setting, is very well-
understood (e.g., [Ber17a,Bel57,PB79]); notably, recent work has exhibited various pro-
cedures, for both batch and on-line settings, that achieve fundamental lower bounds in a
strong instance-optimal sense [KPR`20,PW21]. Far more challenging are Markov decision
processes involving continuous state spaces. In this setting, there are at least two broad
classes of approaches. One approach is based on using a suitably “rich” non-parametric func-
tion class, such as regression trees, reproducing kernel Hilbert spaces, or neural networks, to
repeatedly refit an estimate of the value function; procedures of this type are known collec-
tively as instantiations of fitted value iteration (FVI) [EGW05]. The FVI algorithm along
with its close relative fitted Q-iteration (FQI), has been the focus of a long line of work
(e.g., [ASM07,MS08,SGG`15,CJ19,FWXY20,DW20]).
Another classical approach to approximate policy evaluation, even simpler than fitted value
iteration, is via the least-squares temporal difference method, or LSTD for short [BB96,SB18,
Boy02]. Given a finite collection of features or basis functions, it computes the value function
within the associated linear span that provides the “best fit”, as formalized in terms of a
projected fixed point (e.g., [BY09,YB10]). There has been line of work analyzing algorithms
for computing the LSTD solution [TVR97,SMS08,BRS18,LS18], as well as more general
settings involving projected fixed-point equations [MPW20,Wan22,DWW21]. This line of
analysis, for the most part, takes as input a fixed set of features. In practice, the quality of
the LSTD fit depends strongly on the given basis, and finding a good basis is often a difficult
task. Often, features are chosen based on knowledge of the specific problem domain; there is
also a line of past work [KMP06,MMS03] on automated procedures for feature construction.
1.1 Our contributions
In this paper, we propose and analyze a procedure for approximate policy evaluation that
combines the LSTD framework with the boosting approach of fitting residuals. The method
estimates the value function by computing a sequence of LSTD solutions, where the basis
used at each round is augmented with a non-parametric estimate of the Bellman residual
from the previous round. The LSTD steps can be interpreted as computing a sequence of
Krylov subspaces, and accordingly we refer to our method as the Krylov–Bellman boosting
algorithm, or KBB for short.
We note that boosting methods—that is, using non-parametric procedures to fit the Bell-
man residual—have been studied in past work on reinforcement learning tasks [AAD`16]. In
our method, however, the combination with LSTD is equally important. As shown in our
theoretical analysis, it is the combination of boosting with LSTD that allows us establish
convergence guarantees for KBB that are provably superior to those of fitted value iteration
(FVI). Consistent with this theory, we also see the gains of KBB over FVI in our experi-
mental studies. In the tabular setting, an idealized form of the KBB algorithm with known
transition dynamics—and hence no error in fitting the residuals—has been studied in past
work [PPWLL07,PLT`08], in which context it is called the method of Bellman error basis
functions (BEBF). However for general state spaces it is unrealistic to assume that residu-
als can be computed without error; for continuous state spaces this task can be challenging
even if the model dynamics are known, since it involves computing the integral that defines
the Bellman update. Even more challenging—and the focus of this paper—is the setting of
unknown dynamics, in which case any estimate of the residual is subject to statistical error
2
due to a finite sample size. It has been noted [Pet07] that the BEBF method is equivalent
to a Krylov subspace method [Saa81,Saa03] for solving a finite-dimensional linear system.
This connection to Krylov methods plays an important role in our paper; as opposed to the
classical use of Krylov methods in solving finite-dimensional linear systems, in this paper
we tackle the more general problem of solving a linear operator equation, which (for general
continuous state spaces) corresponds to an infinite-dimensional problem.
The primary contributions of this paper take the form of two theorems. Our first main
result, stated as Theorem 1, is a general convergence guarantee for the KBB algorithm. It
involves a sequence of so-called restricted spectral values (cf. equation (14)) that track how
the effective conditioning of the residual problem improves as the algorithm proceeds. This
guarantee should be viewed as a “meta-result”, since it applies to arbitrary procedures used to
estimate the Bellman residual and to approximate the LSTD solutions, with the convergence
guarantee depending upon these two associated errors. In particular, Theorem 1 applies both
when the transition dynamics are known and numerical integration is used to approximate
the Bellman residual at each round, and in the reinforcement learning setting where the
dynamics are unknown but samples from the Markov chain are available. Our second set of
results, stated as Theorem 2 and Corollary 1, provides guarantees for this latter RL setting,
and in particular, for independent samples of state, next-state and reward triples. Under
this set-up, Theorem 2 provides an upper bound on the regression error for a general function
class, whereas Corollary 1 combines this guarantee with Theorem 1 to provide an “end-to-end”
guarantee for a particular instantiation of the KBB algorithm.
1.2 A preview
In order to whet the reader’s appetite, let us provide an illustrative preview of the results
to come. In Figure 1, we compare the performance of three different methods for policy
evaluation: the classical value iteration algorithm (green); a version of fitted value iteration
(a) (b)
Figure 1. Plots of log error in }¨}µ-norm versus iteration number for three algorithms for policy
evaluation: exact value iteration with known dynamics (red), fitted value iteration (green), and
Krylov Bellman Boosting (blue). (a) “Circular” corresponds to a MRP arising from a random
walk on the circle (cf. Section 4.1) with discount factor γ0.90. (b) A non-linear system
(cf. Section 4.3) with discount factor γ0.99.
(FVI) (red), and the KBB algorithm analyzed in this paper (blue). In our implementations of
3
both KBB and FVI, we used a gradient boosting procedure to fit the residual at each round
(KBB), or the value function at each round (FVI). In each case, we used the same number
of samples for each fit, so that for a fixed iteration number, both procedures use the same
sample size. In contrast, classical value iteration is predicated upon exact knowledge of the
dynamics, and computes Bellman updates exactly.
In each panel, we plot the log error in evaluating a given policy versus the number of
iterations. Panel (a) corresponds to a simple tabular problem that arises from a random walk
on the circle (see Section 4.1 for details), whereas panel (b) corresponds to a Markov reward
process arising from non-linear dynamical system with a continuous state space (see Section 4.3
for details). Consistent with extant theory, both classical and fitted value iteration converge
at a geometric rate, as shown by the linear nature of these plots of log error versus iteration
number. In contrast, the KBB algorithm exhibits super-linear convergence up to an error
floor, which arises from the statistical errors in evaluating residuals, and is consistent with
our theory (cf. Corollary 1). In the nonlinear setting, we have also included the plot of running
the KBB algorithm with less samples, still indicating super-linear convergence with a higher
error floor.
In practice, of primary importance in choosing between the FVI and KBB algorithms are
their respective sample complexities—that is, how many samples are required to achieve a
specified accuracy? Figure 2provides a comparison of the sample complexities of these two
procedures for five different types of Markov reward processes. Panels (a) and (b) compare
the sample complexity required to reduce the error by factors of 1{2 and 1{10, respectively.
For each model within each panel, we rescale the KBB sample size so that it remains constant
(a) (b)
Figure 2. Comparison of the sample complexities of Krylov–Bellman boosting (KBB) to fitted
value iteration (FVI), as measured by the (relative) sample size required to reduce the error
by a factor of 1{2 (panel (a)), or 1{10 (panel (b)). Both algorithms are applied to five different
types of Markov reward processes, as indicated on the horizontal axis. For each MRP, the bar
plots compare the relative number of samples used, with the KBB bars rescaled to remain at
a constant level across problem types.
across models; we apply this same scaling to the FVI sample size, so the relative heights of
the bars are the relevant comparison. In panel (a), we see that the KBB procedure requires
roughly a factor of 4–5 times fewer samples than the FVI procedure. As shown in panel (b),
the savings are even more dramatic when a higher error tolerance is used.
4
1.3 Paper organization and notation
The remainder of the paper is organized as follows. Section 2is dedicated to the background
necessary to understand our algorithm and results, including Markov reward process, linear
function approximation and the least-squares temporal difference (LSTD) estimate, along
with some background on Krylov subspace methods for solving linear systems. In Section 3,
we describe the Krylov–Bellman boosting (KBB) algorithm, and then state and interpret our
two main guarantees (Theorem 1 and Theorem 2) on its behavior. In Section 4, we provide
numerical results that illustrate the behavior of the KBB algorithm in various settings, along
with comparisons to fitted value iteration. Section 5is devoted to the proofs of our main
results, with more technical details deferred to the appendices. We conclude with a discussion
in Section 6.
Notation: Let us summarize some notation used throughout the paper. For a probability
distribution µ, let L2pX, µqdenote the space of real-valued functions with domain X, square
integrable with respect to µ. This space is equipped with the inner product
xf, gyµżX
fpxqgpxqpxq ” EµrfpXqgpXqs,
along with the norm }f}µaxf, fyµ. Given a subspace SĂL2pX, µq, we write vKµSto
indicate that vis orthogonal to the Swith respect to the inner product ,¨yµ, and SKto
denote the set of all such vectors v. For an operator Q, we write Qľ0 to mean xf, Qfyµě0
for all fPL2pX, µq, and we say that Qis a positive operator for short. Given two operators
Pand Q, we write PľQto indicate that P´Qľ0. Given samples txiun
i1, we define
the empirical norm }f}nb1
nřn
i1f2pxiq; note that it is a random quantity due to its
dependence on the underlying samples. Throughout the paper, we use pc, c1, c1, c2qetc. to
denote universal constants whose values may change from line to line.
2 Background
Here we describe some background necessary to explain and motivate our results. In Sec-
tion 2.1 we describe the basic setup of the problem of interest, solving for the value function
of a Markov reward process, as well as the sampling model to which we have access. Sec-
tion 2.2 then describes an approximate method for solving for the value function as linear
combination of given basis functions. Finally in Section 2.3 we describe Krylov subspace
methods for solving linear equations.
2.1 Markov decision and reward processes
A Markov decision process consists of a state space X, an action space A, a reward function R,
along with a collection of probability transition functions. In this paper, we focus exclusively
on the policy evaluation procedure, in which case a given policy is fixed up front. For a fixed
policy, a Markov decision process reduces to a Markov reward process (MRP), which can be
characterized more simply in terms of the state space X, along with a probability transition
kernel P
P
P, along with a reward function.
In a Markov reward process, the states evolve over time according to a collection of
transition kernels P
P
P. At some time t1,2, . . ., when in state xt, the next state is generated
5
摘要:

Krylov{Bellmanboosting:Super-linearpolicyevaluationingeneralstatespacesEricXia;:MartinJ.Wainwright;:;;DepartmentofStatistics,andDepartmentofElectricalEngineeringandComputerSciences;UCBerkeley,Berkeley,CADepartmentofElectricalEngineeringandComputerSciences:DepartmentofMathematics;MassachusettsInst...

展开>> 收起<<
KrylovBellman boosting Super-linear policy evaluation in general state spaces Eric XiaMartin J. Wainwright.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:40 页 大小:823.32KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注