in a standard implementation scale as O(nα), α∈[2,3]. Such scalings are prohibitive when the
sample size nis large.
Distributed learning (DL) based on a divided-and-conquer approach is an effective way to analyze
large scale data that can not be handled by a single machine. In this paper we study a distributed
learning strategy in linear regression (including both underparameterized and overparameterized
regimes) via (tail-) averaged stochastic gradient descent with constant stepsize (DSGD). The
approach is quite simple and communication efficient: The training data is distributed across
several computing nodes where on each a local SGD is run. In a final step, these local estimates
are aggregated (a.k.a. one-shot SGD). Local SGD has become state of the art in large scale
distributed learning, showing a linear speed-up in the number of workers for convex problems,
see e.g. [MMS+09, ZWLS10, DP19, Sti18, SOP21] and references therein.
The field of DL has gained increasing attention in statistical learning theory with the aim of
deriving conditions under which minimax optimal rates of convergence can be guaranteed, see
e.g. [CX14], [MTJ11], [XSC19], [FWWZ19], [SLS18], [BFL+18], [FGW21], [BX21]. Indeed, the
learning properties of DL in regression settings over Hilbert spaces are widely well understood.
The authors in [ZDW15] analyze distributed (kernel) ridge regression and show optimal learning
rates with appropriate regularization, provided the number of machines increases sufficiently
slowly with the sample size, though under restrictive assumptions on the eigenfunctions of the
kernel integral operator. This has been alleviated in [LGZ17]. However, in these works the
number of machines saturates if the target is very smooth, meaning that large parallelization
seems not possible in this regime.
An extension of these works to more general spectral regularization algorithms for nonparamet-
ric least square regression in (reproducing kernel) Hilbert spaces is given in [GLZ17], [MB18],
including gradient descent ([LZ18]) and stochastic gradient descent ([LC18]). The recent work
[Ton21] studies DL for functional linear regression.
We finally mention the work of [MRRK22], where distributed ordinary least squares (DOLS) in
overparameterized linear regression is studied, i.e. one-shot OLS without any explicit or implicit
regularization. It is shown that the number of workers acts as a regularization parameter itself.
Contributions. We analyze the performance of DSGD with constant stepsize in overparam-
eterized linear regression and provide upper bounds with matching lower bounds for the excess
risk under suitable noise assumptions. Our results show that optimal rates of convergence can
be achieved if the number of local nodes grows sufficiently slowly with the sample size. The ex-
cess risk as a function of data splits remains constant until a certain threshold is reached. This
threshold depends on the structural assumptions imposed on the problem, i.e. on the eigenvalue
decay of the Hessian and the coefficients of the true regression parameter.
We additionally perform a comparison between DSGD and DRR, showing that the excess risk
of DSGD is upper bounded by the excess risk of DRR under an assumption on the sample
complexity (SC) of DSGD, depending on the same structural assumptions. We show that the
SC of DSGD remains within constant factors of the SC of DRR.
Our analysis extends known results in this direction from [ZWB+21b, ZWB+21a] for the single
machine case to the distributed learning setting and from DOLS in [MRRK22] to SGD with
implicit regularization.
2