Local SGD in Overparameterized Linear Regression Mike Nguyen Technical University of Braunschweig

2025-05-02 0 0 1.89MB 41 页 10玖币
侵权投诉
Local SGD in Overparameterized Linear Regression
Mike Nguyen
Technical University of Braunschweig
mike.nguyen@tu-braunschweig.de
Charly Kirst
Technical University of Braunschweig
c.kirst@tu-braunschweig.de
Nicole M¨ucke
Technical University of Braunschweig
nicole.muecke@tu-braunschweig.de
October 24, 2022
Abstract
We consider distributed learning using constant stepsize SGD (DSGD) over several devices,
each sending a final model update to a central server. In a final step, the local estimates
are aggregated. We prove in the setting of overparameterized linear regression general upper
bounds with matching lower bounds and derive learning rates for specific data generating
distributions. We show that the excess risk is of order of the variance provided the number of
local nodes grows not too large with the global sample size. We further compare the sample
complexity of DSGD with the sample complexity of distributed ridge regression (DRR)
and show that the excess SGD-risk is smaller than the excess RR-risk, where both sample
complexities are of the same order.
1 INTRODUCTION
Deep neural networks possess powerful generalization properties in various machine learning
applications, despite being overparameterized. It is generally believed that the optimization
algorithm itself, e.g., stochastic gradient descent (SGD), implicitly regularizes such overparam-
eterized models. This regularizing effect due to the choice of the optimization algorithm is often
referred to as implicit regularization. A refined understanding of this phenomenon was recently
gained in the setting of linear regression (to be considered as a reasonable approximation of
neural network learning) for different variants of SGD. Constant stepsize SGD (with last iterate
or tail-averaging) is investigated in [JKK+18], in [DB16] in an RKHS frameowrk and also in
[MNR19] with additional mini-batching, see also [MR20] for a more general analysis in Hilbert
scales. In [ZWB+21b, ZWB+21a] it is shown that benign overfitting also occurs for SGD. Multi-
pass SGD is analyzed in [LCR16, JKK+16, LR17, ZWB+22] while last iterate bounds can be
found in [JNN19, WZB+22, VPVF21].
Despite the attractive statistical properties of all these SGD variants, the complexity of com-
puting regression estimates prevents it from being routinely used in large-scale problems. More
precisely, the time complexity and space complexity of SGD and other regularization methods
1
arXiv:2210.11562v1 [stat.ML] 20 Oct 2022
in a standard implementation scale as O(nα), α[2,3]. Such scalings are prohibitive when the
sample size nis large.
Distributed learning (DL) based on a divided-and-conquer approach is an effective way to analyze
large scale data that can not be handled by a single machine. In this paper we study a distributed
learning strategy in linear regression (including both underparameterized and overparameterized
regimes) via (tail-) averaged stochastic gradient descent with constant stepsize (DSGD). The
approach is quite simple and communication efficient: The training data is distributed across
several computing nodes where on each a local SGD is run. In a final step, these local estimates
are aggregated (a.k.a. one-shot SGD). Local SGD has become state of the art in large scale
distributed learning, showing a linear speed-up in the number of workers for convex problems,
see e.g. [MMS+09, ZWLS10, DP19, Sti18, SOP21] and references therein.
The field of DL has gained increasing attention in statistical learning theory with the aim of
deriving conditions under which minimax optimal rates of convergence can be guaranteed, see
e.g. [CX14], [MTJ11], [XSC19], [FWWZ19], [SLS18], [BFL+18], [FGW21], [BX21]. Indeed, the
learning properties of DL in regression settings over Hilbert spaces are widely well understood.
The authors in [ZDW15] analyze distributed (kernel) ridge regression and show optimal learning
rates with appropriate regularization, provided the number of machines increases sufficiently
slowly with the sample size, though under restrictive assumptions on the eigenfunctions of the
kernel integral operator. This has been alleviated in [LGZ17]. However, in these works the
number of machines saturates if the target is very smooth, meaning that large parallelization
seems not possible in this regime.
An extension of these works to more general spectral regularization algorithms for nonparamet-
ric least square regression in (reproducing kernel) Hilbert spaces is given in [GLZ17], [MB18],
including gradient descent ([LZ18]) and stochastic gradient descent ([LC18]). The recent work
[Ton21] studies DL for functional linear regression.
We finally mention the work of [MRRK22], where distributed ordinary least squares (DOLS) in
overparameterized linear regression is studied, i.e. one-shot OLS without any explicit or implicit
regularization. It is shown that the number of workers acts as a regularization parameter itself.
Contributions. We analyze the performance of DSGD with constant stepsize in overparam-
eterized linear regression and provide upper bounds with matching lower bounds for the excess
risk under suitable noise assumptions. Our results show that optimal rates of convergence can
be achieved if the number of local nodes grows sufficiently slowly with the sample size. The ex-
cess risk as a function of data splits remains constant until a certain threshold is reached. This
threshold depends on the structural assumptions imposed on the problem, i.e. on the eigenvalue
decay of the Hessian and the coefficients of the true regression parameter.
We additionally perform a comparison between DSGD and DRR, showing that the excess risk
of DSGD is upper bounded by the excess risk of DRR under an assumption on the sample
complexity (SC) of DSGD, depending on the same structural assumptions. We show that the
SC of DSGD remains within constant factors of the SC of DRR.
Our analysis extends known results in this direction from [ZWB+21b, ZWB+21a] for the single
machine case to the distributed learning setting and from DOLS in [MRRK22] to SGD with
implicit regularization.
2
Organization. In Section 2 we define the mathematical framework needed to present our
main results in Section 3, where we provide a theoretical analysis of DSGD with a discussion of
our results. In Section 4 we compare DSGD with DRR while Section 5 is devoted to showing
some numerical illustrations. The proofs a deferred to the Appendix.
Notation. By L(H1,H2) we denote the space of bounded linear operators between real Hilbert
spaces H1,H2. We write L(H,H) = L(H). For A∈ L(H) we denote by ATthe adjoint operator.
By Awe denote the pseudoinverse of Aand for w∈ H we write ||w||2
A:= ||A1
2w|| for an PSD
operator A.
We let [n] = {1, ..., n}for every nN. For two positive sequences (an)n, (bn)nwe write an.bn
if ancbnfor some c > 0 and an'bnif both an.bnand bn.an.
2 SETUP
In this section we provide the mathematical framework for our analysis. More specifically, we
introduce distributed SGD and state the main assumptions on our model.
2.1 SGD and linear regression
We consider a linear regression model over a real separable Hilbert space Hin random design.
More precisely, we are given a random covariate vector x∈ H and a random output yR
following the model
y=hw, xi+ , (2.1)
where Ris a noise variable. We will impose some assumptions on the noise model in Section
3. The true regression parameter w∈ H minimizes the least squares test risk, i.e.
L(w) = min
w∈H L(w), L(w) := 1
2E[(y− hw, xi)2],
where the expectation is taken with respect to the joint distribution Pof the pair (x, y)∈ H×R.
More specifically, we let wbe the minimum norm element in the set of all minimizers of L.
To derive an estimator ˆw∈ H for wwe are given an i.i.d. dataset
D:= {(x1, y1), ..., (xn, yn)} ⊂ H × R,
following the above model (2.1), i.e.,
Y=Xw+ε ,
with i.i.d. noise ε= (ε1, ..., εn)Rn. The corresponding random vector of outputs is denoted
as Y= (y1, . . . , yn)TRnand we arrange the data xj∈ H into a data matrix X∈ L(H,Rn)
by setting (Xv)j=hxj, vifor v∈ H,1jn. If H=Rd, then Xis a n×dmatrix
(with row vectors xj). We are particular interested in the overparameterized regime, i.e. where
dim(H)> n.
3
In the classical setting of stochastic approximation with constant stepsize, the SGD iterates are
computed by the recursion
wt+1 =wtγ(hwt, xti − yt)xt, t = 1, ..., n ,
with some initialization w1∈ H and where γ > 0 is the stepsize. The tail average of the iterates
is denoted by
¯wn
2:n:= 1
nn/2
n
X
t=n/2+1
wt,(2.2)
and where we denote by ¯wn:= ¯w0:nthe full (uniform) average.
Various forms of SGD (with iterate averaging, tail averaging, multi passes) in the setting
of overparameterized linear regression has been analyzed recently in [ZWB+21b], [WZB+22],
[ZWB+22], respectively. In particular, the phenomenon of benign overfitting is theoretically
investigated in these works. It could be shown that benign overfitting occurs in this setting, i.e.
the SGD estimator fits training data very well and still generalizes.
We are interested in this phenomenon for localized SGD, i.e. when our training data is dis-
tributed over several computing devices.
2.2 Local SGD
In the distributed setting, our data are evenly divided into MNlocal disjoint subsets
D=D1... DM
of size |Dm|=n
M, for m= 1, ..., M. To each local dataset we associate a local design matrix
Xm∈ L(H,Rn
M) (build with local row vectors x(m)
j) with local output vector YmRn
Mand a
local noise vector εmRn
M.
The local SGD iterates are defined as
w(m)
t+1 =w(m)
tγDw(m)
t, x(m)
tEytx(m)
t,
for t= 1, ..., n
Mand m= 1, ..., M. The averaged local iterates ¯w(m)
n
M
are computed according
to (2.2). We are finally interested in the uniform average of the local SGD iterates, building a
global estimator:
wM:= 1
M
M
X
m=1
¯w(m)
n
M
.
Distributed learning in overparameterized linear regression is studied in [MRRK22] for the ordi-
nary least squares estimator (OLS), i.e. without any implicit or explicit regularization and with
local interpolation. It is shown that local overfitting is harmless and regularization is done by
the number of data splits.
We aim at finding optimal bounds for the excess risk
EL(wM)L(w),
of distributed SGD (DSGD) with potential local overparameterization and as function of the
number of local nodes Mand under various model assumptions, to be given in the next section.
4
3 MAIN RESULTS
In this section we present our main results. To do so, we first impose some model assumptions.
Definition 3.1. 1. We define the second moment of xPxto be the operator H:H → H,
given by
H:= E[xx] = E[, xix].
2. The fourth moment operator M:L(H)→ L(H)is defined by
M:= E[xxxx],
with M(A)(w) = E[hx, Axihw, xix], for all w∈ H.
3. The covariance operator of the gradient noise at wis defined as Σ:H → H,
Σ:= E[(hw, xi − y)2xx].
Assumption 3.2 (Second Moment Condition).We assume that E[y2|x]<almost surely.
Moreover, we assume that the trace of His finite, i.e., Tr[H]<.
Assumption 3.3 (Fourth Moment Condition).We assume there exists a positive constant
τ > 0such that for any PSD operator A, we have
M(A)τTr[HA]H.
Note that this assumption holds if H1xis sub-Gaussian, being a standard assumption in least
squares regression, see e.g. [BLLT20], [ZWB+21b], [TB20].
Assumption 3.4 (Noise Condition).Assume that
σ2:= ||H1
2ΣH1
2|| <.
This assumption on the noise is standard in the literature about averaged SGD, see e.g. [ZWB+21b],
[DB16].
We introduce some further notation involving the second moment operator H: We denote the
eigendecomposition as
H=
X
j=1
λjvjvj,
where the λ1λ2... are the eigenvalues of Hand the v0
jsare the corresponding eigenvectors.
For k1, we let
H0:k:=
k
X
j=1
λjvjvj,Hk::=
X
j=k+1
λjvjvj.
Similarly,
I0:k=
k
X
j=1
vjvj,Ik::=
X
j=k+1
vjvj.
5
摘要:

LocalSGDinOverparameterizedLinearRegressionMikeNguyenTechnicalUniversityofBraunschweigmike.nguyen@tu-braunschweig.deCharlyKirstTechnicalUniversityofBraunschweigc.kirst@tu-braunschweig.deNicoleMuckeTechnicalUniversityofBraunschweignicole.muecke@tu-braunschweig.deOctober24,2022AbstractWeconsiderdistr...

展开>> 收起<<
Local SGD in Overparameterized Linear Regression Mike Nguyen Technical University of Braunschweig.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:1.89MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注