MBM`16,WMG`17,BSGL09] involve maintaining estimates of the value function as a part
of the algorithm.
Policy evaluation in discrete state spaces, known as the tabular setting, is very well-
understood (e.g., [Ber17a,Bel57,PB79]); notably, recent work has exhibited various pro-
cedures, for both batch and on-line settings, that achieve fundamental lower bounds in a
strong instance-optimal sense [KPR`20,PW21]. Far more challenging are Markov decision
processes involving continuous state spaces. In this setting, there are at least two broad
classes of approaches. One approach is based on using a suitably “rich” non-parametric func-
tion class, such as regression trees, reproducing kernel Hilbert spaces, or neural networks, to
repeatedly refit an estimate of the value function; procedures of this type are known collec-
tively as instantiations of fitted value iteration (FVI) [EGW05]. The FVI algorithm along
with its close relative fitted Q-iteration (FQI), has been the focus of a long line of work
(e.g., [ASM07,MS08,SGG`15,CJ19,FWXY20,DW20]).
Another classical approach to approximate policy evaluation, even simpler than fitted value
iteration, is via the least-squares temporal difference method, or LSTD for short [BB96,SB18,
Boy02]. Given a finite collection of features or basis functions, it computes the value function
within the associated linear span that provides the “best fit”, as formalized in terms of a
projected fixed point (e.g., [BY09,YB10]). There has been line of work analyzing algorithms
for computing the LSTD solution [TVR97,SMS08,BRS18,LS18], as well as more general
settings involving projected fixed-point equations [MPW20,Wan22,DWW21]. This line of
analysis, for the most part, takes as input a fixed set of features. In practice, the quality of
the LSTD fit depends strongly on the given basis, and finding a good basis is often a difficult
task. Often, features are chosen based on knowledge of the specific problem domain; there is
also a line of past work [KMP06,MMS03] on automated procedures for feature construction.
1.1 Our contributions
In this paper, we propose and analyze a procedure for approximate policy evaluation that
combines the LSTD framework with the boosting approach of fitting residuals. The method
estimates the value function by computing a sequence of LSTD solutions, where the basis
used at each round is augmented with a non-parametric estimate of the Bellman residual
from the previous round. The LSTD steps can be interpreted as computing a sequence of
Krylov subspaces, and accordingly we refer to our method as the Krylov–Bellman boosting
algorithm, or KBB for short.
We note that boosting methods—that is, using non-parametric procedures to fit the Bell-
man residual—have been studied in past work on reinforcement learning tasks [AAD`16]. In
our method, however, the combination with LSTD is equally important. As shown in our
theoretical analysis, it is the combination of boosting with LSTD that allows us establish
convergence guarantees for KBB that are provably superior to those of fitted value iteration
(FVI). Consistent with this theory, we also see the gains of KBB over FVI in our experi-
mental studies. In the tabular setting, an idealized form of the KBB algorithm with known
transition dynamics—and hence no error in fitting the residuals—has been studied in past
work [PPWLL07,PLT`08], in which context it is called the method of Bellman error basis
functions (BEBF). However for general state spaces it is unrealistic to assume that residu-
als can be computed without error; for continuous state spaces this task can be challenging
even if the model dynamics are known, since it involves computing the integral that defines
the Bellman update. Even more challenging—and the focus of this paper—is the setting of
unknown dynamics, in which case any estimate of the residual is subject to statistical error
2