Local SGD in Overparameterized Linear Regression Mike Nguyen Technical University of Braunschweig

2025-05-02 0 0 1.89MB 41 页 10玖币

侵权投诉

Local SGD in Overparameterized Linear Regression

Mike Nguyen

Technical University of Braunschweig

mike.nguyen@tu-braunschweig.de

Charly Kirst

Technical University of Braunschweig

c.kirst@tu-braunschweig.de

Nicole M¨ucke

Technical University of Braunschweig

nicole.muecke@tu-braunschweig.de

October 24, 2022

Abstract

We consider distributed learning using constant stepsize SGD (DSGD) over several devices,

each sending a ﬁnal model update to a central server. In a ﬁnal step, the local estimates

are aggregated. We prove in the setting of overparameterized linear regression general upper

bounds with matching lower bounds and derive learning rates for speciﬁc data generating

distributions. We show that the excess risk is of order of the variance provided the number of

local nodes grows not too large with the global sample size. We further compare the sample

complexity of DSGD with the sample complexity of distributed ridge regression (DRR)

and show that the excess SGD-risk is smaller than the excess RR-risk, where both sample

complexities are of the same order.

1 INTRODUCTION

Deep neural networks possess powerful generalization properties in various machine learning

applications, despite being overparameterized. It is generally believed that the optimization

algorithm itself, e.g., stochastic gradient descent (SGD), implicitly regularizes such overparam-

eterized models. This regularizing eﬀect due to the choice of the optimization algorithm is often

referred to as implicit regularization. A reﬁned understanding of this phenomenon was recently

gained in the setting of linear regression (to be considered as a reasonable approximation of

neural network learning) for diﬀerent variants of SGD. Constant stepsize SGD (with last iterate

or tail-averaging) is investigated in [JKK+18], in [DB16] in an RKHS frameowrk and also in

[MNR19] with additional mini-batching, see also [MR20] for a more general analysis in Hilbert

scales. In [ZWB+21b, ZWB+21a] it is shown that benign overﬁtting also occurs for SGD. Multi-

pass SGD is analyzed in [LCR16, JKK+16, LR17, ZWB+22] while last iterate bounds can be

found in [JNN19, WZB+22, VPVF21].

Despite the attractive statistical properties of all these SGD variants, the complexity of com-

puting regression estimates prevents it from being routinely used in large-scale problems. More

precisely, the time complexity and space complexity of SGD and other regularization methods

arXiv:2210.11562v1 [stat.ML] 20 Oct 2022

in a standard implementation scale as O(nα), α∈[2,3]. Such scalings are prohibitive when the

sample size nis large.

Distributed learning (DL) based on a divided-and-conquer approach is an eﬀective way to analyze

large scale data that can not be handled by a single machine. In this paper we study a distributed

learning strategy in linear regression (including both underparameterized and overparameterized

regimes) via (tail-) averaged stochastic gradient descent with constant stepsize (DSGD). The

approach is quite simple and communication eﬃcient: The training data is distributed across

several computing nodes where on each a local SGD is run. In a ﬁnal step, these local estimates

are aggregated (a.k.a. one-shot SGD). Local SGD has become state of the art in large scale

distributed learning, showing a linear speed-up in the number of workers for convex problems,

see e.g. [MMS+09, ZWLS10, DP19, Sti18, SOP21] and references therein.

The ﬁeld of DL has gained increasing attention in statistical learning theory with the aim of

deriving conditions under which minimax optimal rates of convergence can be guaranteed, see

e.g. [CX14], [MTJ11], [XSC19], [FWWZ19], [SLS18], [BFL+18], [FGW21], [BX21]. Indeed, the

learning properties of DL in regression settings over Hilbert spaces are widely well understood.

The authors in [ZDW15] analyze distributed (kernel) ridge regression and show optimal learning

rates with appropriate regularization, provided the number of machines increases suﬃciently

slowly with the sample size, though under restrictive assumptions on the eigenfunctions of the

kernel integral operator. This has been alleviated in [LGZ17]. However, in these works the

number of machines saturates if the target is very smooth, meaning that large parallelization

seems not possible in this regime.

An extension of these works to more general spectral regularization algorithms for nonparamet-

ric least square regression in (reproducing kernel) Hilbert spaces is given in [GLZ17], [MB18],

including gradient descent ([LZ18]) and stochastic gradient descent ([LC18]). The recent work

[Ton21] studies DL for functional linear regression.

We ﬁnally mention the work of [MRRK22], where distributed ordinary least squares (DOLS) in

overparameterized linear regression is studied, i.e. one-shot OLS without any explicit or implicit

regularization. It is shown that the number of workers acts as a regularization parameter itself.

Contributions. We analyze the performance of DSGD with constant stepsize in overparam-

eterized linear regression and provide upper bounds with matching lower bounds for the excess

risk under suitable noise assumptions. Our results show that optimal rates of convergence can

be achieved if the number of local nodes grows suﬃciently slowly with the sample size. The ex-

cess risk as a function of data splits remains constant until a certain threshold is reached. This

threshold depends on the structural assumptions imposed on the problem, i.e. on the eigenvalue

decay of the Hessian and the coeﬃcients of the true regression parameter.

We additionally perform a comparison between DSGD and DRR, showing that the excess risk

of DSGD is upper bounded by the excess risk of DRR under an assumption on the sample

complexity (SC) of DSGD, depending on the same structural assumptions. We show that the

SC of DSGD remains within constant factors of the SC of DRR.

Our analysis extends known results in this direction from [ZWB+21b, ZWB+21a] for the single

machine case to the distributed learning setting and from DOLS in [MRRK22] to SGD with

implicit regularization.

Organization. In Section 2 we deﬁne the mathematical framework needed to present our

main results in Section 3, where we provide a theoretical analysis of DSGD with a discussion of

our results. In Section 4 we compare DSGD with DRR while Section 5 is devoted to showing

some numerical illustrations. The proofs a deferred to the Appendix.

Notation. By L(H1,H2) we denote the space of bounded linear operators between real Hilbert

spaces H1,H2. We write L(H,H) = L(H). For A∈ L(H) we denote by ATthe adjoint operator.

By A†we denote the pseudoinverse of Aand for w∈ H we write ||w||2

A:= ||A1

2w|| for an PSD

operator A.

We let [n] = {1, ..., n}for every n∈N. For two positive sequences (an)n, (bn)nwe write an.bn

if an≤cbnfor some c > 0 and an'bnif both an.bnand bn.an.

2 SETUP

In this section we provide the mathematical framework for our analysis. More speciﬁcally, we

introduce distributed SGD and state the main assumptions on our model.

2.1 SGD and linear regression

We consider a linear regression model over a real separable Hilbert space Hin random design.

More precisely, we are given a random covariate vector x∈ H and a random output y∈R

following the model

y=hw∗, xi+ , (2.1)

where ∈Ris a noise variable. We will impose some assumptions on the noise model in Section

3. The true regression parameter w∗∈ H minimizes the least squares test risk, i.e.

L(w∗) = min

w∈H L(w), L(w) := 1

2E[(y− hw, xi)2],

where the expectation is taken with respect to the joint distribution Pof the pair (x, y)∈ H×R.

More speciﬁcally, we let w∗be the minimum norm element in the set of all minimizers of L.

To derive an estimator ˆw∈ H for w∗we are given an i.i.d. dataset

D:= {(x1, y1), ..., (xn, yn)} ⊂ H × R,

following the above model (2.1), i.e.,

Y=Xw∗+ε ,

with i.i.d. noise ε= (ε1, ..., εn)∈Rn. The corresponding random vector of outputs is denoted

as Y= (y1, . . . , yn)T∈Rnand we arrange the data xj∈ H into a data matrix X∈ L(H,Rn)

by setting (Xv)j=hxj, vifor v∈ H,1≤j≤n. If H=Rd, then Xis a n×dmatrix

(with row vectors xj). We are particular interested in the overparameterized regime, i.e. where

dim(H)> n.

In the classical setting of stochastic approximation with constant stepsize, the SGD iterates are

computed by the recursion

wt+1 =wt−γ(hwt, xti − yt)xt, t = 1, ..., n ,

with some initialization w1∈ H and where γ > 0 is the stepsize. The tail average of the iterates

is denoted by

¯wn

2:n:= 1

n−n/2

t=n/2+1

wt,(2.2)

and where we denote by ¯wn:= ¯w0:nthe full (uniform) average.

Various forms of SGD (with iterate averaging, tail averaging, multi passes) in the setting

of overparameterized linear regression has been analyzed recently in [ZWB+21b], [WZB+22],

[ZWB+22], respectively. In particular, the phenomenon of benign overﬁtting is theoretically

investigated in these works. It could be shown that benign overﬁtting occurs in this setting, i.e.

the SGD estimator ﬁts training data very well and still generalizes.

We are interested in this phenomenon for localized SGD, i.e. when our training data is dis-

tributed over several computing devices.

2.2 Local SGD

In the distributed setting, our data are evenly divided into M∈Nlocal disjoint subsets

D=D1∪... ∪DM

of size |Dm|=n

M, for m= 1, ..., M. To each local dataset we associate a local design matrix

Xm∈ L(H,Rn

M) (build with local row vectors x(m)

j) with local output vector Ym∈Rn

Mand a

local noise vector εm∈Rn

The local SGD iterates are deﬁned as

w(m)

t+1 =w(m)

t−γDw(m)

t, x(m)

tE−ytx(m)

for t= 1, ..., n

Mand m= 1, ..., M. The averaged local iterates ¯w(m)

are computed according

to (2.2). We are ﬁnally interested in the uniform average of the local SGD iterates, building a

global estimator:

wM:= 1

m=1

¯w(m)

Distributed learning in overparameterized linear regression is studied in [MRRK22] for the ordi-

nary least squares estimator (OLS), i.e. without any implicit or explicit regularization and with

local interpolation. It is shown that local overﬁtting is harmless and regularization is done by

the number of data splits.

We aim at ﬁnding optimal bounds for the excess risk

EL(wM)−L(w∗),

of distributed SGD (DSGD) with potential local overparameterization and as function of the

number of local nodes Mand under various model assumptions, to be given in the next section.

3 MAIN RESULTS

In this section we present our main results. To do so, we ﬁrst impose some model assumptions.

Deﬁnition 3.1. 1. We deﬁne the second moment of x∼Pxto be the operator H:H → H,

given by

H:= E[x⊗x] = E[h·, xix].

2. The fourth moment operator M:L(H)→ L(H)is deﬁned by

M:= E[x⊗x⊗x⊗x],

with M(A)(w) = E[hx, Axihw, xix], for all w∈ H.

3. The covariance operator of the gradient noise at w∗is deﬁned as Σ:H → H,

Σ:= E[(hw∗, xi − y)2x⊗x].

Assumption 3.2 (Second Moment Condition).We assume that E[y2|x]<∞almost surely.

Moreover, we assume that the trace of His ﬁnite, i.e., Tr[H]<∞.

Assumption 3.3 (Fourth Moment Condition).We assume there exists a positive constant

τ > 0such that for any PSD operator A, we have

M(A)τTr[HA]H.

Note that this assumption holds if H−1xis sub-Gaussian, being a standard assumption in least

squares regression, see e.g. [BLLT20], [ZWB+21b], [TB20].

Assumption 3.4 (Noise Condition).Assume that

σ2:= ||H−1

2ΣH−1

2|| <∞.

This assumption on the noise is standard in the literature about averaged SGD, see e.g. [ZWB+21b],

[DB16].

We introduce some further notation involving the second moment operator H: We denote the

eigendecomposition as

∞

j=1

λjvj⊗vj,

where the λ1≥λ2≥... are the eigenvalues of Hand the v0

jsare the corresponding eigenvectors.

For k≥1, we let

H0:k:=

j=1

λjvj⊗vj,Hk:∞:=

∞

j=k+1

λjvj⊗vj.

Similarly,

I0:k=

j=1

vj⊗vj,Ik:∞:=

∞

j=k+1

vj⊗vj.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LocalSGDinOverparameterizedLinearRegressionMikeNguyenTechnicalUniversityofBraunschweigmike.nguyen@tu-braunschweig.deCharlyKirstTechnicalUniversityofBraunschweigc.kirst@tu-braunschweig.deNicoleMuckeTechnicalUniversityofBraunschweignicole.muecke@tu-braunschweig.deOctober24,2022AbstractWeconsiderdistr...

展开>> 收起<<

Local SGD in Overparameterized Linear Regression Mike Nguyen Technical University of Braunschweig.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Local SGD in Overparameterized Linear Regression Mike Nguyen Technical University of Braunschweig

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: