KrylovBellman boosting Super-linear policy evaluation in general state spaces Eric XiaMartin J. Wainwright

2025-05-03 0 0 823.32KB 40 页 10玖币

侵权投诉

Krylov–Bellman boosting: Super-linear policy

evaluation in general state spaces

Eric Xia‹,:Martin J. Wainwright‹,:,;

Department of Statistics‹, and

Department of Electrical Engineering and Computer Sciences;

UC Berkeley, Berkeley, CA

Department of Electrical Engineering and Computer Sciences:

Department of Mathematics;

Massachusetts Institute of Technology, Cambridge, MA

October 21, 2022

Abstract

We present and analyze the Krylov–Bellman Boosting (KBB) algorithm for policy

evaluation in general state spaces. It alternates between ﬁtting the Bellman residual us-

ing non-parametric regression (as in boosting), and estimating the value function via the

least-squares temporal diﬀerence (LSTD) procedure applied with a feature set that grows

adaptively over time. By exploiting the connection to Krylov methods, we equip this

method with two attractive guarantees. First, we provide a general convergence bound

that allows for separate estimation errors in residual ﬁtting and LSTD computation. Con-

sistent with our numerical experiments, this bound shows that convergence rates depend

on the restricted spectral structure, and are typically super-linear. Second, by combin-

ing this meta-result with sample-size dependent guarantees for residual ﬁtting and LTSD

computation, we obtain concrete statistical guarantees that depend on the sample size

along with the complexity of the function class used to ﬁt the residuals. We illustrate

the behavior of the KBB algorithm for various types of policy evaluation problems, and

typically ﬁnd large reductions in sample complexity relative to the standard approach of

ﬁtted value iteration.

1 Introduction

The past decade has witnessed tremendous success with applications of reinforcement learning

(e.g., [TFR`17,LFDA16,SHM`16]), but many challenges remain. Notably, many procedures

in reinforcement learning are very “data hungry”, meaning that large sample sizes—which may

or may not be available in a given application—are often required. Given a data-dependent

procedure, its sample complexity is a function that characterizes the number of samples re-

quired, as a function of the problem parameters, to achieve a desired error tolerance. Given

the cost associated with sampling (or lack of availability of large sample sizes), an important

problem is to develop RL algorithms with the lowest possible sample complexity.

In this paper, we focus on the problem of policy evaluation for Markov decision pro-

cesses deﬁned over general state spaces. Policy evaluation occupies a central role in rein-

forcement learning. It is of interest both in its own right and as a key subroutine in many

algorithms. For instance, classical algorithms for computing optimal policies, such as pol-

icy iteration [How60], require repeated rounds of policy evaluation; iterative schemes like

policy gradient methods [Wil92,SLH`14,SMSM99,Kak01] and actor-critic schemes [KT01,

arXiv:2210.11377v1 [stat.ML] 20 Oct 2022

MBM`16,WMG`17,BSGL09] involve maintaining estimates of the value function as a part

of the algorithm.

Policy evaluation in discrete state spaces, known as the tabular setting, is very well-

understood (e.g., [Ber17a,Bel57,PB79]); notably, recent work has exhibited various pro-

cedures, for both batch and on-line settings, that achieve fundamental lower bounds in a

strong instance-optimal sense [KPR`20,PW21]. Far more challenging are Markov decision

processes involving continuous state spaces. In this setting, there are at least two broad

classes of approaches. One approach is based on using a suitably “rich” non-parametric func-

tion class, such as regression trees, reproducing kernel Hilbert spaces, or neural networks, to

repeatedly reﬁt an estimate of the value function; procedures of this type are known collec-

tively as instantiations of ﬁtted value iteration (FVI) [EGW05]. The FVI algorithm along

with its close relative ﬁtted Q-iteration (FQI), has been the focus of a long line of work

(e.g., [ASM07,MS08,SGG`15,CJ19,FWXY20,DW20]).

Another classical approach to approximate policy evaluation, even simpler than ﬁtted value

iteration, is via the least-squares temporal diﬀerence method, or LSTD for short [BB96,SB18,

Boy02]. Given a ﬁnite collection of features or basis functions, it computes the value function

within the associated linear span that provides the “best ﬁt”, as formalized in terms of a

projected ﬁxed point (e.g., [BY09,YB10]). There has been line of work analyzing algorithms

for computing the LSTD solution [TVR97,SMS08,BRS18,LS18], as well as more general

settings involving projected ﬁxed-point equations [MPW20,Wan22,DWW21]. This line of

analysis, for the most part, takes as input a ﬁxed set of features. In practice, the quality of

the LSTD ﬁt depends strongly on the given basis, and ﬁnding a good basis is often a diﬃcult

task. Often, features are chosen based on knowledge of the speciﬁc problem domain; there is

also a line of past work [KMP06,MMS03] on automated procedures for feature construction.

1.1 Our contributions

In this paper, we propose and analyze a procedure for approximate policy evaluation that

combines the LSTD framework with the boosting approach of ﬁtting residuals. The method

estimates the value function by computing a sequence of LSTD solutions, where the basis

used at each round is augmented with a non-parametric estimate of the Bellman residual

from the previous round. The LSTD steps can be interpreted as computing a sequence of

Krylov subspaces, and accordingly we refer to our method as the Krylov–Bellman boosting

algorithm, or KBB for short.

We note that boosting methods—that is, using non-parametric procedures to ﬁt the Bell-

man residual—have been studied in past work on reinforcement learning tasks [AAD`16]. In

our method, however, the combination with LSTD is equally important. As shown in our

theoretical analysis, it is the combination of boosting with LSTD that allows us establish

convergence guarantees for KBB that are provably superior to those of ﬁtted value iteration

(FVI). Consistent with this theory, we also see the gains of KBB over FVI in our experi-

mental studies. In the tabular setting, an idealized form of the KBB algorithm with known

transition dynamics—and hence no error in ﬁtting the residuals—has been studied in past

work [PPWLL07,PLT`08], in which context it is called the method of Bellman error basis

functions (BEBF). However for general state spaces it is unrealistic to assume that residu-

als can be computed without error; for continuous state spaces this task can be challenging

even if the model dynamics are known, since it involves computing the integral that deﬁnes

the Bellman update. Even more challenging—and the focus of this paper—is the setting of

unknown dynamics, in which case any estimate of the residual is subject to statistical error

due to a ﬁnite sample size. It has been noted [Pet07] that the BEBF method is equivalent

to a Krylov subspace method [Saa81,Saa03] for solving a ﬁnite-dimensional linear system.

This connection to Krylov methods plays an important role in our paper; as opposed to the

classical use of Krylov methods in solving ﬁnite-dimensional linear systems, in this paper

we tackle the more general problem of solving a linear operator equation, which (for general

continuous state spaces) corresponds to an inﬁnite-dimensional problem.

The primary contributions of this paper take the form of two theorems. Our ﬁrst main

result, stated as Theorem 1, is a general convergence guarantee for the KBB algorithm. It

involves a sequence of so-called restricted spectral values (cf. equation (14)) that track how

the eﬀective conditioning of the residual problem improves as the algorithm proceeds. This

guarantee should be viewed as a “meta-result”, since it applies to arbitrary procedures used to

estimate the Bellman residual and to approximate the LSTD solutions, with the convergence

guarantee depending upon these two associated errors. In particular, Theorem 1 applies both

when the transition dynamics are known and numerical integration is used to approximate

the Bellman residual at each round, and in the reinforcement learning setting where the

dynamics are unknown but samples from the Markov chain are available. Our second set of

results, stated as Theorem 2 and Corollary 1, provides guarantees for this latter RL setting,

and in particular, for independent samples of state, next-state and reward triples. Under

this set-up, Theorem 2 provides an upper bound on the regression error for a general function

class, whereas Corollary 1 combines this guarantee with Theorem 1 to provide an “end-to-end”

guarantee for a particular instantiation of the KBB algorithm.

1.2 A preview

In order to whet the reader’s appetite, let us provide an illustrative preview of the results

to come. In Figure 1, we compare the performance of three diﬀerent methods for policy

evaluation: the classical value iteration algorithm (green); a version of ﬁtted value iteration

(a) (b)

Figure 1. Plots of log error in }¨}µ-norm versus iteration number for three algorithms for policy

evaluation: exact value iteration with known dynamics (red), ﬁtted value iteration (green), and

Krylov Bellman Boosting (blue). (a) “Circular” corresponds to a MRP arising from a random

walk on the circle (cf. Section 4.1) with discount factor γ“0.90. (b) A non-linear system

(cf. Section 4.3) with discount factor γ“0.99.

(FVI) (red), and the KBB algorithm analyzed in this paper (blue). In our implementations of

both KBB and FVI, we used a gradient boosting procedure to ﬁt the residual at each round

(KBB), or the value function at each round (FVI). In each case, we used the same number

of samples for each ﬁt, so that for a ﬁxed iteration number, both procedures use the same

sample size. In contrast, classical value iteration is predicated upon exact knowledge of the

dynamics, and computes Bellman updates exactly.

In each panel, we plot the log error in evaluating a given policy versus the number of

iterations. Panel (a) corresponds to a simple tabular problem that arises from a random walk

on the circle (see Section 4.1 for details), whereas panel (b) corresponds to a Markov reward

process arising from non-linear dynamical system with a continuous state space (see Section 4.3

for details). Consistent with extant theory, both classical and ﬁtted value iteration converge

at a geometric rate, as shown by the linear nature of these plots of log error versus iteration

number. In contrast, the KBB algorithm exhibits super-linear convergence up to an error

ﬂoor, which arises from the statistical errors in evaluating residuals, and is consistent with

our theory (cf. Corollary 1). In the nonlinear setting, we have also included the plot of running

the KBB algorithm with less samples, still indicating super-linear convergence with a higher

error ﬂoor.

In practice, of primary importance in choosing between the FVI and KBB algorithms are

their respective sample complexities—that is, how many samples are required to achieve a

speciﬁed accuracy? Figure 2provides a comparison of the sample complexities of these two

procedures for ﬁve diﬀerent types of Markov reward processes. Panels (a) and (b) compare

the sample complexity required to reduce the error by factors of 1{2 and 1{10, respectively.

For each model within each panel, we rescale the KBB sample size so that it remains constant

(a) (b)

Figure 2. Comparison of the sample complexities of Krylov–Bellman boosting (KBB) to ﬁtted

value iteration (FVI), as measured by the (relative) sample size required to reduce the error

by a factor of 1{2 (panel (a)), or 1{10 (panel (b)). Both algorithms are applied to ﬁve diﬀerent

types of Markov reward processes, as indicated on the horizontal axis. For each MRP, the bar

plots compare the relative number of samples used, with the KBB bars rescaled to remain at

a constant level across problem types.

across models; we apply this same scaling to the FVI sample size, so the relative heights of

the bars are the relevant comparison. In panel (a), we see that the KBB procedure requires

roughly a factor of 4–5 times fewer samples than the FVI procedure. As shown in panel (b),

the savings are even more dramatic when a higher error tolerance is used.

1.3 Paper organization and notation

The remainder of the paper is organized as follows. Section 2is dedicated to the background

necessary to understand our algorithm and results, including Markov reward process, linear

function approximation and the least-squares temporal diﬀerence (LSTD) estimate, along

with some background on Krylov subspace methods for solving linear systems. In Section 3,

we describe the Krylov–Bellman boosting (KBB) algorithm, and then state and interpret our

two main guarantees (Theorem 1 and Theorem 2) on its behavior. In Section 4, we provide

numerical results that illustrate the behavior of the KBB algorithm in various settings, along

with comparisons to ﬁtted value iteration. Section 5is devoted to the proofs of our main

results, with more technical details deferred to the appendices. We conclude with a discussion

in Section 6.

Notation: Let us summarize some notation used throughout the paper. For a probability

distribution µ, let L2pX, µqdenote the space of real-valued functions with domain X, square

integrable with respect to µ. This space is equipped with the inner product

xf, gyµ–żX

fpxqgpxqdµpxq ” EµrfpXqgpXqs,

along with the norm }f}µ“axf, fyµ. Given a subspace SĂL2pX, µq, we write vKµSto

indicate that vis orthogonal to the Swith respect to the inner product x¨,¨yµ, and SKto

denote the set of all such vectors v. For an operator Q, we write Qľ0 to mean xf, Qfyµě0

for all fPL2pX, µq, and we say that Qis a positive operator for short. Given two operators

Pand Q, we write PľQto indicate that P´Qľ0. Given samples txiun

i“1, we deﬁne

the empirical norm }f}n–b1

nřn

i“1f2pxiq; note that it is a random quantity due to its

dependence on the underlying samples. Throughout the paper, we use pc, c1, c1, c2qetc. to

denote universal constants whose values may change from line to line.

2 Background

Here we describe some background necessary to explain and motivate our results. In Sec-

tion 2.1 we describe the basic setup of the problem of interest, solving for the value function

of a Markov reward process, as well as the sampling model to which we have access. Sec-

tion 2.2 then describes an approximate method for solving for the value function as linear

combination of given basis functions. Finally in Section 2.3 we describe Krylov subspace

methods for solving linear equations.

2.1 Markov decision and reward processes

A Markov decision process consists of a state space X, an action space A, a reward function R,

along with a collection of probability transition functions. In this paper, we focus exclusively

on the policy evaluation procedure, in which case a given policy is ﬁxed up front. For a ﬁxed

policy, a Markov decision process reduces to a Markov reward process (MRP), which can be

characterized more simply in terms of the state space X, along with a probability transition

kernel P

P, along with a reward function.

In a Markov reward process, the states evolve over time according to a collection of

transition kernels P

P. At some time t“1,2, . . ., when in state xt, the next state is generated

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Krylov{Bellmanboosting:Super-linearpolicyevaluationingeneralstatespacesEricXia;:MartinJ.Wainwright;:;;DepartmentofStatistics,andDepartmentofElectricalEngineeringandComputerSciences;UCBerkeley,Berkeley,CADepartmentofElectricalEngineeringandComputerSciences:DepartmentofMathematics;MassachusettsInst...

展开>> 收起<<

KrylovBellman boosting Super-linear policy evaluation in general state spaces Eric XiaMartin J. Wainwright.pdf

共40页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

KrylovBellman boosting Super-linear policy evaluation in general state spaces Eric XiaMartin J. Wainwright

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: