Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes_2

2025-04-27 0 0 1.98MB 52 页 10玖币

侵权投诉

Bias and Extrapolation in Markovian Linear Stochastic

Approximation with Constant Stepsizes

Dongyan (Lucy) Huo,§Yudong Chen,†Qiaomin Xie,‡∗

§School of Operations Research and Information Engineering, Cornell University

†Department of Computer Sciences, University of Wisconsin-Madison

‡Department of Industrial and Systems Engineering, University of Wisconsin-Madison

Abstract

We consider Linear Stochastic Approximation (LSA) with constant stepsize and Markovian data.

Viewing the joint process of the data and LSA iterate as a time-homogeneous Markov chain, we prove

its convergence to a unique limiting and stationary distribution in Wasserstein distance and establish

non-asymptotic, geometric convergence rates. Furthermore, we show that the bias vector of this limit

admits an inﬁnite series expansion with respect to the stepsize. Consequently, the bias is proportional

to the stepsize up to higher order terms. This result stands in contrast with LSA under i.i.d. data, for

which the bias vanishes. In the reversible chain setting, we provide a general characterization of the

relationship between the bias and the mixing time of the Markovian data, establishing that they are

roughly proportional to each other.

While Polyak-Ruppert tail-averaging reduces the variance of LSA iterates, it does not aﬀect the bias.

The above characterization allows us to show that the bias can be reduced using Richardson-Romberg

extrapolation with m≥2 stepsizes, which eliminates the m−1 leading terms in the bias expansion.

This extrapolation scheme leads to an exponentially smaller bias and an improved mean squared error,

both theoretically and empirically. Our results immediately apply to the Temporal Diﬀerence learn-

ing algorithm with linear function approximation, and stochastic gradient descent applied to quadratic

functions.

1 Introduction

We consider the following Linear Stochastic Approximation (LSA) iteration driven by Markovian data:

θk+1 =θk+αA(xk)θk+b(xk), k = 0,1,2,...,

where (xk)k≥0is a Markov chain representing the underlying data stream, Aand bare deterministic functions,

and α > 0 is a constant stepsize. LSA is an iterative data-driven procedure for approximating the solution

θ∗to the linear ﬁxed point equation ¯

Aθ∗+¯

b= 0, where ¯

A:= PiπiA(i), ¯

b:= Piπib(i), and πis the unique

stationary distribution of the chain (xk)k≥0.

Stochastic Approximation (SA), which uses stochastic updates to solve ﬁxed-point equations, is a fun-

damental algorithmic paradigm in many areas including stochastic control and ﬁltering [KY03,Bor08],

approximate dynamic programming and reinforcement learning (RL) [Ber19,SB18]. For example, the cele-

brated Temporal Diﬀerence (TD) learning algorithm [Sut88] in RL, potentially equipped with linear function

approximation, is a special case of LSA and an important algorithm primitive in RL. Variants of the TD

algorithm such as TD(λ) and Gradient TD can also be cast as LSA [LS18]. Another important special case

of SA is stochastic gradient descent (SGD). When applied to a quadratic objective function, SGD can be

viewed as an LSA procedure.

Classical work on SA focuses on settings with diminishing stepsizes, which allow for asymptotic con-

vergence of θkto θ∗[RM51,Blu54,BM00]. More recently, SA with constant stepsizes has attracted

∗Emails: dh622@cornell.edu,yudong.chen@wisc.edu,qiaomin.xie@wisc.edu

arXiv:2210.00953v3 [stat.ML] 21 Aug 2023

attention due to its simplicity, fast convergence, and good empirical performance. A growing line of

work has been devoted to this setting and established non-asymptotic results valid for ﬁnite values of k

[LS18,SY19,CMSS21b,BRS21]. These results provide upper bounds on the mean-squared error (MSE)

E∥θk−θ∗∥2, and such bounds typically consist of the sum of two terms: a ﬁnite-time term1that decays with

k, and a steady-state MSE bound that is independent of k.

In this work, we study LSA with constant stepsizes through the lens of Markov chain theory. This

perspective allows us to delineate the convergence and distributional properties of LSA viewed as a stochastic

process. Consequently, we provide a more precise characterization of the MSE in terms of the decomposition

E∥θk−θ∗∥2≍ ∥Eθk−Eθ(α)

∞∥2

| {z }

optimization error

+ Var(θk)

| {z }

variance

+∥Eθ(α)

∞−θ∗∥2

| {z }

asymptotic bias2

where the random variable θ(α)

∞denotes the limit (as k→ ∞) of the LSA iterate θkwith stepsize α. Our

main results, summarized below, characterize the behavior of the above three terms.

Convergence and optimization error. With a constant stepsize α, the process (xk, θk)k≥0is a time-

homogeneous Markov chain. Under appropriate conditions, we show that the sequence of (xk, θk) converges

to a unique limiting random variable (x∞, θ(α)

∞) in distribution and in W2, the Wasserstein distance of order

2. Moreover, the distribution of (x∞, θ(α)

∞) is the unique stationary distribution of the chain (xk, θk)k≥0.

We provide non-asymptotic bounds on the distributional distance between θkand θ(α)

∞in W2, which in turn

upper bounds the optimization error ∥Eθk−Eθ(α)

∞∥. Both bounds decay exponentially in kthanks to the use

of a constant stepsize. We emphasize that the existence of the limit θ(α)

∞and the convergence rate cannot

be deduced from existing upper bounds on the MSE E∥θk−θ∗∥2, which does not vanish as k→ ∞.

Variance and asymptotic bias. The variance Var(θk) is of order O(1) as kgrows. The variance can

be made vanishing by averaging the LSA iterates. For example, the Polyak-Ruppert tail-averaged iterate

θk:= 1

k/2Pk−1

t=k/2θthas variance of order O(1/k). Consequently, for large k, the MSE of ¯

θkis dominated by

the asymptotic bias, i.e., E∥¯

θk−θ∗∥2≈ ∥E¯

θ∞−θ∗∥2=∥Eθ(α)

∞−θ∗∥2. Our second main result establishes

that the asymptotic bias is proportional to the stepsize α(up to a second order term):

Eθ(α)

∞−θ∗=αB(1) +O(α2),(1.1)

where B(1) is a vector independent of αand admits an explicit expression in terms of A, b and the transition

kernel Pof the underlying Markov chain (xk)k≥0. Crucially, the ﬁrst order term in equation (1.1) is an

equality rather than an upper bound. This equality implies that the asymptotic bias is not aﬀected by

averaging the LSA iterates.

Bias expansion and extrapolation. While the asymptotic bias persists under Polyak-Ruppert averaging,

the equality (1.1) implies that bias can be reduced using a simple technique called Richardson-Romberg (RR)

extrapolation: run LSA with two stepsizes αand 2α, compute the respective averaged iterates ¯

θ(α)

kand ¯

θ(2α)

and output their linear combination e

θ(α)

k:= 2¯

θ(α)

k−¯

θ(2α)

k. Doing so cancels out the leading term in the bias

characterization (1.1), resulting in an order-wise smaller bias Ee

θ(α)

∞−θ∗=O(α2).

In fact, the bias characterization (1.1) can be generalized to higher orders. We establish that the bias

admits the following inﬁnite series expansion:

Eθ(α)

∞−θ∗=αB(1) +α2B(2) +α3B(3) +··· ,(1.2)

where the vectors {B(i)}are independent of α. Consequently, RR extrapolation can be executed with m≥2

stepsizes to eliminate the m−1 leading terms in equation (1.2), reducing the bias to the order O(αm).

1This term is sometimes called “(ﬁnite-time) bias” in prior work. It should not be confused with the asymptotic bias

discussed below.

Put together, the above results show that the combination of constant stepsize, averaging, and ex-

trapolation allows one to approach the best of three worlds: (a) using a constant stepsize leads to fast,

exponential-in-kconvergence of the optimization error, (b) tail-averaging eliminates the variance at an (op-

timal) 1/k rate, and (c) RR extrapolation order-wise reduces the asymptotic bias. We highlight that the

miterate sequences used in RR extrapolation can be computed in parallel, using the same data stream

(xk)k≥0. Compared with standard LSA, the above combined procedure is data eﬃcient (in terms of the

sample complexity kfor achieving a given MSE), does not require sophisticated tuning of the stepsize, and

incurs a minimal increase in the computational cost.

The results above should be contrasted with the setting of LSA with i.i.d. data, where the xk’s are

sampled independently from some distribution π. In this setting, it has been shown (sometimes implicitly)

in existing work that the asymptotic bias is zero [LS18,MLW+20]. Similar results are known for SGD

applied to a quadratic function with gradients contaminated by independent noise [BM13,DDB20]. It is

perhaps surprising that using Markovian data leads to a non-zero bias, even when the LSA iteration is linear

in θk. To provide intuition, we plot the dependency graphs for LSA with i.i.d. data and Markovian data

in Figure 1. It can be seen that in the Markovian setting, the correlation between xk’s leads to additional

correlation among θk’s; in particular, the iterate sequence (θk)k≥0is no longer a Markov chain by itself. As

such, θk+1 has an implicit, nonlinear dependence on θkthrough (xk−1, xk). This non-linearity is the source

of the asymptotic bias.

θ0

··· xkxk+1 ···

··· θkθk+1 ···

θ0

··· xkxk+1 ···

··· θkθk+1 ···

Figure 1: Dependency graphs of LSA. Left: i.i.d. data. Right: Markovian data.

Bias and mixing time. We quantify the observations above by relating the magnitude of the asymptotic

bias to the mixing time of the underlying Markov chain (xk)k≥0and the absolute spectral gap of the transi-

tion kernel P, denoted by γ⋆(P). We show that the leading coeﬃcient B(1) in the expansion (1.2) has a norm

upper bounded by O1−γ⋆(P)

γ⋆(P). It is well known that the mixing time of (xk)k≥0can be tightly upper and

lower bounded by functions of γ⋆(P) [Pau15,LP17]. Consequently, the faster the underlying chain (xk)k≥0

mixes, the smaller the asymptotic bias is. As a special case, LSA with i.i.d. data has zero mixing time and

1−γ⋆(P) = 0, hence zero bias.

Our results hold for LSA driven by a Markov chain (xk)k≥0on a general (possibly continuous) state

space. These results immediately apply to Markovian settings of the TD algorithm with linear function

approximation and SGD for quadratic functions. Furthermore, we provide numerical results for LSA, TD and

SGD, which corroborate our theory and demonstrate the beneﬁt of using constant stepsizes, tail averaging,

and RR extrapolation.

Paper Organization: In Section 2, we review existing results related to our work. After setting up the

problem and assumptions in Section 3, we present our main results in Section 4. In Section 5, we provide

numerical results for LSA, TD and SGD. We outline the proofs of the main results in Section 6. The paper

is concluded in Section 7with a discussion of future directions. The proofs of our theoretical results are

provided in the Appendix.

2 Related Work

In this section, we review existing results that are most related to our work. The literature on SA and SGD

is vast. Here we mainly discuss prior works in the non-asymptotic and constant stepsize regime, with a focus

on the Markovian noise setting.

2.1 Classical Results on SA and SGD

The study of stochastic approximation and stochastic gradient descent dates back to the seminal work

of Robbins and Monro [RM51]. Convergence results in classical works typically assume that the stepsize

sequence (αk)k≥1satisﬁes P∞

k=1 αk=∞and P∞

k=1 α2

k<∞.This assumption implies that the stepsize

sequence is diminishing. Under suitable conditions, Robbins and Monro prove that SA and SGD algorithms

asymptotically converge in the L2sense [RM51], and Blum shows that the convergence holds almost surely

[Blu54]. Subsequent works [Rup88,Pol90] propose the technique now known as the Polyak-Ruppert (PR)

averaging. A Central Limit Theorem (CLT) for the asymptotic normality of the averaged iterates is estab-

lished in [PJ92]. Borkar and Meyn [BM00] introduce the Ordinary Diﬀerential Equation (ODE) technique for

analyzing SA algorithms. Utilizing the ODE technique, the recent work [BCD+21] establishes a functional

CLT for SA driven by Markovian noise.

The asymptotic theory of SA and SGD is well-developed and covered in several excellent textbooks [KY03,

Bor08,BMP12,WR22]. Our work, in comparison, focuses on the setting of constant stepsizes and provides

non-asymptotic bounds.

2.2 SA and SGD with Constant Stepsizes

A growing body of recent works considers the constant stepsize setting of SA and the closely related SGD

algorithm. A majority of works in this line assume i.i.d. noise, with a number of ﬁnite-time results. The

work in [LS18] analyzes LSA and establishes ﬁnite-time upper and lower bounds on the MSE. The work

[MLW+20] provides reﬁned results with the optimal dependence on problem-speciﬁc constants, as well as a

CLT for the averaged iterates with an exact characterization of the asymptotic covariance matrix. Using

new results on random matrix products, the work [DMN+21] establishes tight concentration bounds of LSA,

which are extended to LSA with iterate averaging in [DMNS22].

Closely related to our work is [DDB20], which studies constant stepsize SGD for strongly convex and

smooth functions. By connecting SGD to time-homogeneous Markov chains, they establish that the iterates

converge to a unique stationary distribution. This result is generalized to non-convex and non-smooth

functions with quadratic growth in the work [YBVE21], which further establishes asymptotic normality of

the averaged SGD iterates. Subsequent work [CMM22] studies the limit of the stationary distribution as

stepsize goes to zero. Note that these aforementioned results are established under the i.i.d. noise setting.

More recent works study constant-stepsize SA and SGD under Markovian noise. The work [SY19] provides

ﬁnite-time bounds on the MSE of LSA. The work [MPWB21] considers LSA with averaging and establishes

instance-dependent MSE upper bounds with tight dimension dependence. The papers [SY19,DMNS22]

establish bounds on higher moments of LSA iterates. Going beyond linear SA, the work [CMSS20] considers

general SA with contractive mapping and provides ﬁnite-time convergence results. The work [BJN+20]

studies SGD for linear regression problems with Markovian data and constant stepsizes. Most of these

results focus on the upper bounds of the MSE and do not decouple the eﬀect of the asymptotic bias.

A portion of our results are similar in spirit to [DDB20, Proposition 2] and [DMN+21, Theorem 3], in

that we both study LSA and SGD with constant stepsizes in the lens of Markov chain analysis. A crucial

diﬀerence is that we consider the Markovian data setting whereas they consider i.i.d. data. Arising naturally

in stochastic control and RL problems, the Markovian setting leads to non-zero asymptotic bias and new

analytical challenges, which are not present in the i.i.d. setting. In particular, our analysis involves more

delicate coupling arguments and builds on the Lyapunov function techniques from [SY19]. Along the way,

we obtain a reﬁnement of the MSE bounds from the work [SY19]. We discuss these analytical challenges

and improvements in greater detail after stating our theorems; see Sections 4and 6.

2.3 Applications in Reinforcement Learning and TD Learning

Many iterative algorithms in RL solve for the ﬁxed point of Bellman equations and can be viewed as special

cases of SA [SB18,Ber19]. The TD algorithms [Sut88] with linear function approximation, including TD(0)

and more generally TD(λ), are LSA procedures. Our results can be specialized to TD learning and hence

are related to existing works in this line.

Classical results on TD Learning, similarly to those on SA, focus on asymptotic convergence under

diminishing stepsizes [Sut88,Day92,DS94,TVR97]. More recent works provide ﬁnite-time results. The

work [DSTM18] is among the ﬁrst to provide MSE and concentration bounds for linear TD learning in its

original form, and their analysis assumes diminishing stepsize and i.i.d. noise. The work [BRS21] presents

ﬁnite-time analysis of TD(0) under both i.i.d. and Markovian noise, with both diminishing and constant

stepsizes. Their results require a projection step in TD(0) to ensure boundedness. The Lyapunov analysis

in [SY19] on LSA, when specialized to TD(0), removes this projection step and proves upper bounds on the

MSE. The recent work in [CMSS21a,CMSS21b] uses Lyapunov theory to study the tabular TD and obtains

ﬁnite sample convergence guarantees. The paper [KPR+21] provides sharp, instance-dependent ℓ∞error

bounds for the tabular TD algorithm with i.i.d. data.

We mention in passing that Q-learning [WD92], another standard algorithm in RL, can be viewed as a

nonlinear SA procedure with contractive mappings. Q-learning has been studied in both classical and recent

work, e.g., [Tsi94,Sze97,EDMB03] and [CMSS21b,CBD22]. Generalizing our results to nonlinear SA and

Q-learning is an interesting future direction.

3 Set-up and Assumptions

We formally set up the problem and introduce the assumptions and notations used in the sequel.

3.1 Problem Set-up

Let (xk)k≥0be a Markov chain on a general state space X. Consider the following linear stochastic approx-

imation iteration:

θ(α)

k+1 =θ(α)

k+αA(xk)θ(α)

k+b(xk), k = 0,1,..., (3.1)

where A:X → Rd×dand b:X → Rdare deterministic functions, and α > 0 is a constant stepsize. In what

follows, we omit the superscript in θ(α)

kwhen the dependence on αis clear from the context. The initial

distribution of θ0may depend on x0but is independent of (xk)k≥1given x0.

Let πbe the stationary distribution of the Markov chain (xk) and deﬁne the shorthands

A:= Eπ[A(x)] ∈Rd×dand ¯

b:= Eπ[b(x)] ∈Rd,(3.2)

where Eπ[·] denotes the expectation with respect to x∼π. The iterative procedure (3.1) computes an

approximation of the target vector θ∗, deﬁned as the solution to the steady-state equation

Aθ +¯

b= 0.(3.3)

Our general goal is to characterize the relationship between the iterate θkand the target solution θ∗.

The stochastic process of the LSA iterates, (θk)k≥0, is not a Markov chain itself: given θk, the random

variables θk+1 and θk−1are correlated through the underlying Markov process (x0, x1, . . . , xk). However,

direct calculation veriﬁes that the joint process (xk, θk)k≥0is a Markov chain on the state space X × Rd.

This chain is time-homogeneous as the stepsize αis independent of k.

Remark 1. The LSA iteration (3.1)covers as a special case the SGD algorithm applied to minimizing a

quadratic function f(θ) = −1

2θ⊤¯

Aθ −¯

bθ =Eπ−1

2θ⊤A(x)θ−b(x)θ,where −¯

Ais the symmetric expected

Hessian matrix. Note that LSA is more general than SGD for quadratic minimization, as A(x)need not be

symmetric, in which case θ7→ −(¯

Aθ +¯

b)is not a gradient ﬁeld.

Remark 2. One primary advantage of LSA (and SGD) is the low computational cost of the iteration (3.1),

particularly in forming the matrix-vector product A(xk)θk. For example, in TD(0) the matrix A(xk)is rank-

one (and in addition 2-sparse in the tabular case); see Section 4.4. In comparison, the expected matrix ¯

as well as its running empirical estimate ˆ

Ak:= 1

kPk−1

t=0 A(xt), are typically dense and full-rank.

3.1.1 General State Space Markov Chains

As the underlying Markov chain (xk)k≥0is on a general state space X, we review some relevant concepts

and notations. We assume throughout the paper that the Xis Borel, i.e., the σ-algebra B(X) is Borel.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BiasandExtrapolationinMarkovianLinearStochasticApproximationwithConstantStepsizesDongyan(Lucy)Huo,§YudongChen,†QiaominXie,‡∗§SchoolofOperationsResearchandInformationEngineering,CornellUniversity†DepartmentofComputerSciences,UniversityofWisconsin-Madison‡DepartmentofIndustrialandSystemsEngineering,Un...

展开>> 收起<<

Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes_2.pdf

共52页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bias and Extrapolation in Markovian Linear Stochastic Approximation with Constant Stepsizes_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: