2.1 Classical Results on SA and SGD
The study of stochastic approximation and stochastic gradient descent dates back to the seminal work
of Robbins and Monro [RM51]. Convergence results in classical works typically assume that the stepsize
sequence (αk)k≥1satisfies P∞
k=1 αk=∞and P∞
k=1 α2
k<∞.This assumption implies that the stepsize
sequence is diminishing. Under suitable conditions, Robbins and Monro prove that SA and SGD algorithms
asymptotically converge in the L2sense [RM51], and Blum shows that the convergence holds almost surely
[Blu54]. Subsequent works [Rup88,Pol90] propose the technique now known as the Polyak-Ruppert (PR)
averaging. A Central Limit Theorem (CLT) for the asymptotic normality of the averaged iterates is estab-
lished in [PJ92]. Borkar and Meyn [BM00] introduce the Ordinary Differential Equation (ODE) technique for
analyzing SA algorithms. Utilizing the ODE technique, the recent work [BCD+21] establishes a functional
CLT for SA driven by Markovian noise.
The asymptotic theory of SA and SGD is well-developed and covered in several excellent textbooks [KY03,
Bor08,BMP12,WR22]. Our work, in comparison, focuses on the setting of constant stepsizes and provides
non-asymptotic bounds.
2.2 SA and SGD with Constant Stepsizes
A growing body of recent works considers the constant stepsize setting of SA and the closely related SGD
algorithm. A majority of works in this line assume i.i.d. noise, with a number of finite-time results. The
work in [LS18] analyzes LSA and establishes finite-time upper and lower bounds on the MSE. The work
[MLW+20] provides refined results with the optimal dependence on problem-specific constants, as well as a
CLT for the averaged iterates with an exact characterization of the asymptotic covariance matrix. Using
new results on random matrix products, the work [DMN+21] establishes tight concentration bounds of LSA,
which are extended to LSA with iterate averaging in [DMNS22].
Closely related to our work is [DDB20], which studies constant stepsize SGD for strongly convex and
smooth functions. By connecting SGD to time-homogeneous Markov chains, they establish that the iterates
converge to a unique stationary distribution. This result is generalized to non-convex and non-smooth
functions with quadratic growth in the work [YBVE21], which further establishes asymptotic normality of
the averaged SGD iterates. Subsequent work [CMM22] studies the limit of the stationary distribution as
stepsize goes to zero. Note that these aforementioned results are established under the i.i.d. noise setting.
More recent works study constant-stepsize SA and SGD under Markovian noise. The work [SY19] provides
finite-time bounds on the MSE of LSA. The work [MPWB21] considers LSA with averaging and establishes
instance-dependent MSE upper bounds with tight dimension dependence. The papers [SY19,DMNS22]
establish bounds on higher moments of LSA iterates. Going beyond linear SA, the work [CMSS20] considers
general SA with contractive mapping and provides finite-time convergence results. The work [BJN+20]
studies SGD for linear regression problems with Markovian data and constant stepsizes. Most of these
results focus on the upper bounds of the MSE and do not decouple the effect of the asymptotic bias.
A portion of our results are similar in spirit to [DDB20, Proposition 2] and [DMN+21, Theorem 3], in
that we both study LSA and SGD with constant stepsizes in the lens of Markov chain analysis. A crucial
difference is that we consider the Markovian data setting whereas they consider i.i.d. data. Arising naturally
in stochastic control and RL problems, the Markovian setting leads to non-zero asymptotic bias and new
analytical challenges, which are not present in the i.i.d. setting. In particular, our analysis involves more
delicate coupling arguments and builds on the Lyapunov function techniques from [SY19]. Along the way,
we obtain a refinement of the MSE bounds from the work [SY19]. We discuss these analytical challenges
and improvements in greater detail after stating our theorems; see Sections 4and 6.
2.3 Applications in Reinforcement Learning and TD Learning
Many iterative algorithms in RL solve for the fixed point of Bellman equations and can be viewed as special
cases of SA [SB18,Ber19]. The TD algorithms [Sut88] with linear function approximation, including TD(0)
and more generally TD(λ), are LSA procedures. Our results can be specialized to TD learning and hence
are related to existing works in this line.
Classical results on TD Learning, similarly to those on SA, focus on asymptotic convergence under
diminishing stepsizes [Sut88,Day92,DS94,TVR97]. More recent works provide finite-time results. The
4