Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes_2

2025-04-27 0 0 1.98MB 52 页 10玖币
侵权投诉
Bias and Extrapolation in Markovian Linear Stochastic
Approximation with Constant Stepsizes
Dongyan (Lucy) Huo,§Yudong Chen,Qiaomin Xie,
§School of Operations Research and Information Engineering, Cornell University
Department of Computer Sciences, University of Wisconsin-Madison
Department of Industrial and Systems Engineering, University of Wisconsin-Madison
Abstract
We consider Linear Stochastic Approximation (LSA) with constant stepsize and Markovian data.
Viewing the joint process of the data and LSA iterate as a time-homogeneous Markov chain, we prove
its convergence to a unique limiting and stationary distribution in Wasserstein distance and establish
non-asymptotic, geometric convergence rates. Furthermore, we show that the bias vector of this limit
admits an infinite series expansion with respect to the stepsize. Consequently, the bias is proportional
to the stepsize up to higher order terms. This result stands in contrast with LSA under i.i.d. data, for
which the bias vanishes. In the reversible chain setting, we provide a general characterization of the
relationship between the bias and the mixing time of the Markovian data, establishing that they are
roughly proportional to each other.
While Polyak-Ruppert tail-averaging reduces the variance of LSA iterates, it does not affect the bias.
The above characterization allows us to show that the bias can be reduced using Richardson-Romberg
extrapolation with m2 stepsizes, which eliminates the m1 leading terms in the bias expansion.
This extrapolation scheme leads to an exponentially smaller bias and an improved mean squared error,
both theoretically and empirically. Our results immediately apply to the Temporal Difference learn-
ing algorithm with linear function approximation, and stochastic gradient descent applied to quadratic
functions.
1 Introduction
We consider the following Linear Stochastic Approximation (LSA) iteration driven by Markovian data:
θk+1 =θk+αA(xk)θk+b(xk), k = 0,1,2,...,
where (xk)k0is a Markov chain representing the underlying data stream, Aand bare deterministic functions,
and α > 0 is a constant stepsize. LSA is an iterative data-driven procedure for approximating the solution
θto the linear fixed point equation ¯
+¯
b= 0, where ¯
A:= PiπiA(i), ¯
b:= Piπib(i), and πis the unique
stationary distribution of the chain (xk)k0.
Stochastic Approximation (SA), which uses stochastic updates to solve fixed-point equations, is a fun-
damental algorithmic paradigm in many areas including stochastic control and filtering [KY03,Bor08],
approximate dynamic programming and reinforcement learning (RL) [Ber19,SB18]. For example, the cele-
brated Temporal Difference (TD) learning algorithm [Sut88] in RL, potentially equipped with linear function
approximation, is a special case of LSA and an important algorithm primitive in RL. Variants of the TD
algorithm such as TD(λ) and Gradient TD can also be cast as LSA [LS18]. Another important special case
of SA is stochastic gradient descent (SGD). When applied to a quadratic objective function, SGD can be
viewed as an LSA procedure.
Classical work on SA focuses on settings with diminishing stepsizes, which allow for asymptotic con-
vergence of θkto θ[RM51,Blu54,BM00]. More recently, SA with constant stepsizes has attracted
Emails: dh622@cornell.edu,yudong.chen@wisc.edu,qiaomin.xie@wisc.edu
1
arXiv:2210.00953v3 [stat.ML] 21 Aug 2023
attention due to its simplicity, fast convergence, and good empirical performance. A growing line of
work has been devoted to this setting and established non-asymptotic results valid for finite values of k
[LS18,SY19,CMSS21b,BRS21]. These results provide upper bounds on the mean-squared error (MSE)
Eθkθ2, and such bounds typically consist of the sum of two terms: a finite-time term1that decays with
k, and a steady-state MSE bound that is independent of k.
In this work, we study LSA with constant stepsizes through the lens of Markov chain theory. This
perspective allows us to delineate the convergence and distributional properties of LSA viewed as a stochastic
process. Consequently, we provide a more precise characterization of the MSE in terms of the decomposition
Eθkθ2≍ ∥EθkEθ(α)
2
| {z }
optimization error
+ Var(θk)
| {z }
variance
+Eθ(α)
θ2
| {z }
asymptotic bias2
,
where the random variable θ(α)
denotes the limit (as k→ ∞) of the LSA iterate θkwith stepsize α. Our
main results, summarized below, characterize the behavior of the above three terms.
Convergence and optimization error. With a constant stepsize α, the process (xk, θk)k0is a time-
homogeneous Markov chain. Under appropriate conditions, we show that the sequence of (xk, θk) converges
to a unique limiting random variable (x, θ(α)
) in distribution and in W2, the Wasserstein distance of order
2. Moreover, the distribution of (x, θ(α)
) is the unique stationary distribution of the chain (xk, θk)k0.
We provide non-asymptotic bounds on the distributional distance between θkand θ(α)
in W2, which in turn
upper bounds the optimization error EθkEθ(α)
. Both bounds decay exponentially in kthanks to the use
of a constant stepsize. We emphasize that the existence of the limit θ(α)
and the convergence rate cannot
be deduced from existing upper bounds on the MSE Eθkθ2, which does not vanish as k→ ∞.
Variance and asymptotic bias. The variance Var(θk) is of order O(1) as kgrows. The variance can
be made vanishing by averaging the LSA iterates. For example, the Polyak-Ruppert tail-averaged iterate
¯
θk:= 1
k/2Pk1
t=k/2θthas variance of order O(1/k). Consequently, for large k, the MSE of ¯
θkis dominated by
the asymptotic bias, i.e., E¯
θkθ2≈ ∥E¯
θθ2=Eθ(α)
θ2. Our second main result establishes
that the asymptotic bias is proportional to the stepsize α(up to a second order term):
Eθ(α)
θ=αB(1) +O(α2),(1.1)
where B(1) is a vector independent of αand admits an explicit expression in terms of A, b and the transition
kernel Pof the underlying Markov chain (xk)k0. Crucially, the first order term in equation (1.1) is an
equality rather than an upper bound. This equality implies that the asymptotic bias is not affected by
averaging the LSA iterates.
Bias expansion and extrapolation. While the asymptotic bias persists under Polyak-Ruppert averaging,
the equality (1.1) implies that bias can be reduced using a simple technique called Richardson-Romberg (RR)
extrapolation: run LSA with two stepsizes αand 2α, compute the respective averaged iterates ¯
θ(α)
kand ¯
θ(2α)
k,
and output their linear combination e
θ(α)
k:= 2¯
θ(α)
k¯
θ(2α)
k. Doing so cancels out the leading term in the bias
characterization (1.1), resulting in an order-wise smaller bias Ee
θ(α)
θ=O(α2).
In fact, the bias characterization (1.1) can be generalized to higher orders. We establish that the bias
admits the following infinite series expansion:
Eθ(α)
θ=αB(1) +α2B(2) +α3B(3) +··· ,(1.2)
where the vectors {B(i)}are independent of α. Consequently, RR extrapolation can be executed with m2
stepsizes to eliminate the m1 leading terms in equation (1.2), reducing the bias to the order O(αm).
1This term is sometimes called “(finite-time) bias” in prior work. It should not be confused with the asymptotic bias
discussed below.
2
Put together, the above results show that the combination of constant stepsize, averaging, and ex-
trapolation allows one to approach the best of three worlds: (a) using a constant stepsize leads to fast,
exponential-in-kconvergence of the optimization error, (b) tail-averaging eliminates the variance at an (op-
timal) 1/k rate, and (c) RR extrapolation order-wise reduces the asymptotic bias. We highlight that the
miterate sequences used in RR extrapolation can be computed in parallel, using the same data stream
(xk)k0. Compared with standard LSA, the above combined procedure is data efficient (in terms of the
sample complexity kfor achieving a given MSE), does not require sophisticated tuning of the stepsize, and
incurs a minimal increase in the computational cost.
The results above should be contrasted with the setting of LSA with i.i.d. data, where the xk’s are
sampled independently from some distribution π. In this setting, it has been shown (sometimes implicitly)
in existing work that the asymptotic bias is zero [LS18,MLW+20]. Similar results are known for SGD
applied to a quadratic function with gradients contaminated by independent noise [BM13,DDB20]. It is
perhaps surprising that using Markovian data leads to a non-zero bias, even when the LSA iteration is linear
in θk. To provide intuition, we plot the dependency graphs for LSA with i.i.d. data and Markovian data
in Figure 1. It can be seen that in the Markovian setting, the correlation between xk’s leads to additional
correlation among θk’s; in particular, the iterate sequence (θk)k0is no longer a Markov chain by itself. As
such, θk+1 has an implicit, nonlinear dependence on θkthrough (xk1, xk). This non-linearity is the source
of the asymptotic bias.
x0
θ0
··· xkxk+1 ···
··· θkθk+1 ···
x0
θ0
··· xkxk+1 ···
··· θkθk+1 ···
Figure 1: Dependency graphs of LSA. Left: i.i.d. data. Right: Markovian data.
Bias and mixing time. We quantify the observations above by relating the magnitude of the asymptotic
bias to the mixing time of the underlying Markov chain (xk)k0and the absolute spectral gap of the transi-
tion kernel P, denoted by γ(P). We show that the leading coefficient B(1) in the expansion (1.2) has a norm
upper bounded by O1γ(P)
γ(P). It is well known that the mixing time of (xk)k0can be tightly upper and
lower bounded by functions of γ(P) [Pau15,LP17]. Consequently, the faster the underlying chain (xk)k0
mixes, the smaller the asymptotic bias is. As a special case, LSA with i.i.d. data has zero mixing time and
1γ(P) = 0, hence zero bias.
Our results hold for LSA driven by a Markov chain (xk)k0on a general (possibly continuous) state
space. These results immediately apply to Markovian settings of the TD algorithm with linear function
approximation and SGD for quadratic functions. Furthermore, we provide numerical results for LSA, TD and
SGD, which corroborate our theory and demonstrate the benefit of using constant stepsizes, tail averaging,
and RR extrapolation.
Paper Organization: In Section 2, we review existing results related to our work. After setting up the
problem and assumptions in Section 3, we present our main results in Section 4. In Section 5, we provide
numerical results for LSA, TD and SGD. We outline the proofs of the main results in Section 6. The paper
is concluded in Section 7with a discussion of future directions. The proofs of our theoretical results are
provided in the Appendix.
2 Related Work
In this section, we review existing results that are most related to our work. The literature on SA and SGD
is vast. Here we mainly discuss prior works in the non-asymptotic and constant stepsize regime, with a focus
on the Markovian noise setting.
3
2.1 Classical Results on SA and SGD
The study of stochastic approximation and stochastic gradient descent dates back to the seminal work
of Robbins and Monro [RM51]. Convergence results in classical works typically assume that the stepsize
sequence (αk)k1satisfies P
k=1 αk=and P
k=1 α2
k<.This assumption implies that the stepsize
sequence is diminishing. Under suitable conditions, Robbins and Monro prove that SA and SGD algorithms
asymptotically converge in the L2sense [RM51], and Blum shows that the convergence holds almost surely
[Blu54]. Subsequent works [Rup88,Pol90] propose the technique now known as the Polyak-Ruppert (PR)
averaging. A Central Limit Theorem (CLT) for the asymptotic normality of the averaged iterates is estab-
lished in [PJ92]. Borkar and Meyn [BM00] introduce the Ordinary Differential Equation (ODE) technique for
analyzing SA algorithms. Utilizing the ODE technique, the recent work [BCD+21] establishes a functional
CLT for SA driven by Markovian noise.
The asymptotic theory of SA and SGD is well-developed and covered in several excellent textbooks [KY03,
Bor08,BMP12,WR22]. Our work, in comparison, focuses on the setting of constant stepsizes and provides
non-asymptotic bounds.
2.2 SA and SGD with Constant Stepsizes
A growing body of recent works considers the constant stepsize setting of SA and the closely related SGD
algorithm. A majority of works in this line assume i.i.d. noise, with a number of finite-time results. The
work in [LS18] analyzes LSA and establishes finite-time upper and lower bounds on the MSE. The work
[MLW+20] provides refined results with the optimal dependence on problem-specific constants, as well as a
CLT for the averaged iterates with an exact characterization of the asymptotic covariance matrix. Using
new results on random matrix products, the work [DMN+21] establishes tight concentration bounds of LSA,
which are extended to LSA with iterate averaging in [DMNS22].
Closely related to our work is [DDB20], which studies constant stepsize SGD for strongly convex and
smooth functions. By connecting SGD to time-homogeneous Markov chains, they establish that the iterates
converge to a unique stationary distribution. This result is generalized to non-convex and non-smooth
functions with quadratic growth in the work [YBVE21], which further establishes asymptotic normality of
the averaged SGD iterates. Subsequent work [CMM22] studies the limit of the stationary distribution as
stepsize goes to zero. Note that these aforementioned results are established under the i.i.d. noise setting.
More recent works study constant-stepsize SA and SGD under Markovian noise. The work [SY19] provides
finite-time bounds on the MSE of LSA. The work [MPWB21] considers LSA with averaging and establishes
instance-dependent MSE upper bounds with tight dimension dependence. The papers [SY19,DMNS22]
establish bounds on higher moments of LSA iterates. Going beyond linear SA, the work [CMSS20] considers
general SA with contractive mapping and provides finite-time convergence results. The work [BJN+20]
studies SGD for linear regression problems with Markovian data and constant stepsizes. Most of these
results focus on the upper bounds of the MSE and do not decouple the effect of the asymptotic bias.
A portion of our results are similar in spirit to [DDB20, Proposition 2] and [DMN+21, Theorem 3], in
that we both study LSA and SGD with constant stepsizes in the lens of Markov chain analysis. A crucial
difference is that we consider the Markovian data setting whereas they consider i.i.d. data. Arising naturally
in stochastic control and RL problems, the Markovian setting leads to non-zero asymptotic bias and new
analytical challenges, which are not present in the i.i.d. setting. In particular, our analysis involves more
delicate coupling arguments and builds on the Lyapunov function techniques from [SY19]. Along the way,
we obtain a refinement of the MSE bounds from the work [SY19]. We discuss these analytical challenges
and improvements in greater detail after stating our theorems; see Sections 4and 6.
2.3 Applications in Reinforcement Learning and TD Learning
Many iterative algorithms in RL solve for the fixed point of Bellman equations and can be viewed as special
cases of SA [SB18,Ber19]. The TD algorithms [Sut88] with linear function approximation, including TD(0)
and more generally TD(λ), are LSA procedures. Our results can be specialized to TD learning and hence
are related to existing works in this line.
Classical results on TD Learning, similarly to those on SA, focus on asymptotic convergence under
diminishing stepsizes [Sut88,Day92,DS94,TVR97]. More recent works provide finite-time results. The
4
work [DSTM18] is among the first to provide MSE and concentration bounds for linear TD learning in its
original form, and their analysis assumes diminishing stepsize and i.i.d. noise. The work [BRS21] presents
finite-time analysis of TD(0) under both i.i.d. and Markovian noise, with both diminishing and constant
stepsizes. Their results require a projection step in TD(0) to ensure boundedness. The Lyapunov analysis
in [SY19] on LSA, when specialized to TD(0), removes this projection step and proves upper bounds on the
MSE. The recent work in [CMSS21a,CMSS21b] uses Lyapunov theory to study the tabular TD and obtains
finite sample convergence guarantees. The paper [KPR+21] provides sharp, instance-dependent error
bounds for the tabular TD algorithm with i.i.d. data.
We mention in passing that Q-learning [WD92], another standard algorithm in RL, can be viewed as a
nonlinear SA procedure with contractive mappings. Q-learning has been studied in both classical and recent
work, e.g., [Tsi94,Sze97,EDMB03] and [CMSS21b,CBD22]. Generalizing our results to nonlinear SA and
Q-learning is an interesting future direction.
3 Set-up and Assumptions
We formally set up the problem and introduce the assumptions and notations used in the sequel.
3.1 Problem Set-up
Let (xk)k0be a Markov chain on a general state space X. Consider the following linear stochastic approx-
imation iteration:
θ(α)
k+1 =θ(α)
k+αA(xk)θ(α)
k+b(xk), k = 0,1,..., (3.1)
where A:X Rd×dand b:X Rdare deterministic functions, and α > 0 is a constant stepsize. In what
follows, we omit the superscript in θ(α)
kwhen the dependence on αis clear from the context. The initial
distribution of θ0may depend on x0but is independent of (xk)k1given x0.
Let πbe the stationary distribution of the Markov chain (xk) and define the shorthands
¯
A:= Eπ[A(x)] Rd×dand ¯
b:= Eπ[b(x)] Rd,(3.2)
where Eπ[·] denotes the expectation with respect to xπ. The iterative procedure (3.1) computes an
approximation of the target vector θ, defined as the solution to the steady-state equation
¯
+¯
b= 0.(3.3)
Our general goal is to characterize the relationship between the iterate θkand the target solution θ.
The stochastic process of the LSA iterates, (θk)k0, is not a Markov chain itself: given θk, the random
variables θk+1 and θk1are correlated through the underlying Markov process (x0, x1, . . . , xk). However,
direct calculation verifies that the joint process (xk, θk)k0is a Markov chain on the state space X × Rd.
This chain is time-homogeneous as the stepsize αis independent of k.
Remark 1. The LSA iteration (3.1)covers as a special case the SGD algorithm applied to minimizing a
quadratic function f(θ) = 1
2θ¯
¯
=Eπ1
2θA(x)θb(x)θ,where ¯
Ais the symmetric expected
Hessian matrix. Note that LSA is more general than SGD for quadratic minimization, as A(x)need not be
symmetric, in which case θ7→ −(¯
+¯
b)is not a gradient field.
Remark 2. One primary advantage of LSA (and SGD) is the low computational cost of the iteration (3.1),
particularly in forming the matrix-vector product A(xk)θk. For example, in TD(0) the matrix A(xk)is rank-
one (and in addition 2-sparse in the tabular case); see Section 4.4. In comparison, the expected matrix ¯
A,
as well as its running empirical estimate ˆ
Ak:= 1
kPk1
t=0 A(xt), are typically dense and full-rank.
3.1.1 General State Space Markov Chains
As the underlying Markov chain (xk)k0is on a general state space X, we review some relevant concepts
and notations. We assume throughout the paper that the Xis Borel, i.e., the σ-algebra B(X) is Borel.
5
摘要:

BiasandExtrapolationinMarkovianLinearStochasticApproximationwithConstantStepsizesDongyan(Lucy)Huo,§YudongChen,†QiaominXie,‡∗§SchoolofOperationsResearchandInformationEngineering,CornellUniversity†DepartmentofComputerSciences,UniversityofWisconsin-Madison‡DepartmentofIndustrialandSystemsEngineering,Un...

展开>> 收起<<
Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes_2.pdf

共52页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:52 页 大小:1.98MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 52
客服
关注