1 STSyn Speeding Up Local SGD with Straggler-Tolerant Synchronization

2025-04-28 0 0 838.18KB 13 页 10玖币
侵权投诉
1
STSyn: Speeding Up Local SGD with
Straggler-Tolerant Synchronization
Feng Zhu, Jingjing Zhang, Member, IEEE and Xin Wang, Fellow, IEEE
Abstract—Synchronous local stochastic gradient descent (local
SGD) suffers from some workers being idle and random delays
due to slow and straggling workers, as it waits for the workers to
complete the same amount of local updates. To address this issue,
a novel local SGD strategy called STSyn is proposed in this paper.
The key point is to wait for the Kfastest workers while keeping
all the workers computing continually at each synchronization
round, and making full use of any effective (completed) local
update of each worker regardless of stragglers. To evaluate the
performance of STSyn, an analysis of the average wall-clock
time, average number of local updates, and average number of
uploading workers per round is provided. The convergence of
STSyn is also rigorously established even when the objective
function is nonconvex for both homogeneous and heterogeneous
data distributions. Experimental results highlight the superiority
of STSyn over state-of-the-art schemes, thanks to its straggler-
tolerant technique and the inclusion of additional effective local
updates at each worker. Furthermore, the impact of system
parameters is investigated. By waiting for faster workers and
allowing heterogeneous synchronization with different numbers
of local updates across workers, STSyn provides substantial
improvements both in time and communication efficiency.
Index Terms—Local SGD, heterogeneous synchronization,
straggler tolerance, distributed learning
I. INTRODUCTION
AS the size of datasets increases exponentially, it is no
longer economic and feasible to leverage all the data to
update the model parameters in large-scale machine learning,
which magnifies the disadvantage of gradient descent (GD)
methods in computational complexity. Instead, stochastic gra-
dient descent (SGD) that uses just one or a mini-batch of
samples is proved to converge faster and can achieve signifi-
cantly reduced computation load [1]–[3]. Therefore, SGD has
become the backbone of large-scale machine learning tasks.
To exploit parallelism and further accelerate the training
process, the distributed implementation of worker nodes,
i.e., distributed learning, has been gaining popularity [4]–
[6]. The parameter server (PS) setting is one of the most
common scenarios of SGD-based distributed learning frame-
works, where in each iteration the PS aggregates the uploaded
gradients/models from all the worker nodes to update the
global parameter [2], [4], [5], [7]–[9]. In such a setting,
The authors are with the Key Laboratory for Information Science of
Electromagnetic Waves (MoE), Department of Communication Science and
Engineering, School of Information Science and Technology, Fudan Uni-
versity, Shanghai 200433, China (e-mail: {20210720072, jingjingzhang,
xwang11}@fudan.edu.cn). (Corresponding author: Jingjing Zhang.)
This work was supported by the National Natural Science Foundation
of China Grants No. 62101134 and No. 62071126, and the Innovation
Program of Shanghai Municipal Science and Technology Commission Grant
20JC1416400.
communication cost has become the bottleneck for the perfor-
mance of the system due to the two-way communication and
the larger number of parameters to be updated at each iteration
[1], [5], [10]–[12]. Accordingly, a novel scheme justified by
the name local SGD is proposed to reduce the communication
frequency between the PS and the workers by allowing each
worker to perform local updates.
The idea of local SGD is first introduced in [6] where each
worker performs a certain number of local updates instead of
one, and the PS aggregates the latest model of each worker as
the final output. However, the scheme in [6] has been proved
to be no better than single-worker algorithms in the worst-
case scenario due to the divergence of local optima across
workers [13]. To address this issue, [14] proposes the federated
averaging (FedAvg) scheme, where each sampled worker runs
multiple local updates before uploading the local model to
the central server for aggregation and then the server re-sends
the aggregated model back to the workers to repeat the above
process. By increasing the communication frequency, FedAvg
has been proved to be communication-efficient and enjoy
remarkable performance under homogenous data distribution.
The convergence analysis of the latter is widely explored
in [15]–[18]. It should also be pointed out that FedAvg is
proved to be able to achieve linear speedup concerning the
number of workers in this setting. Furthermore, when data
is heterogeneously distributed, the convergence of FedAvg is
also proved by [19]–[23]. Local SGD has also been proved
to achieve better time and communication performance than
large mini-batch training with the size of the mini-batch being
the same as the total batchsize used in a training round of
local SGD [25].
Adaptive Communication Strategies for Local SGD: It
has been proved by [15], [21] that it is theoretically reason-
able and advantageous to vary the communication frequency
as the training proceeds. The authors of [26] present the
convergence analysis of periodic averaging SGD (PASGD),
which is FedAvg with no user sampling, and propose an
adaptive scheme that gradually decreases the communication
frequency between workers and the server to reduce total
communication rounds. Meanwhile, [27] also proposes an
adaptive strategy based on PASGD where the communication
frequency is increased along the training process to achieve
a decrease in the true convergence time. In addition, [28]
proposes the STagewise Local SGD (STL-SGD) that increases
the communication period along with decreasing learning rate
to achieve further improvement.
Local SGD presents a possibility to reduce the commu-
nication cost of the system; yet, other than communication
arXiv:2210.03521v3 [cs.LG] 29 May 2023
2
efficiency, the wall-clock time required for convergence is also
an important issue. Due to the discrepancy among computing
capabilities of machines, there is considerable variability in
the computing time of different machines, and the machines
that consume excessively more time per round than others are
normally known as stragglers [29]. Numerous researches have
been conducted in the development of straggler-tolerant tech-
niques, which can be generally categorized into synchronous
schemes and asynchronous ones.
Dealing with Stragglers with Synchronous SGD: In
synchronous parallel SGD, the PS always waits for all the
workers to finish their computation. As a result, the whole
process could be severely slowed down due to the existence
of potential stragglers. One way to mitigate the effect of
straggling is to exploit redundancy such as [33]–[37] that allow
the PS to wait for only a subset of workers by utilizing storage
and computation redundancy. Other than redundancy, [38]
develops the FlexRR system that attempts to detect and avoid
stragglers. Another well-known truth is that the convergence
analysis of M-worker synchronous parallel SGD is the same
as mini-batch SGD, only with the mini-batch size enlarged M
times. Inspired by this idea, [9] and [39] propose the K-sync
SGD where the PS only waits for the first Kworkers to finish
their computation. In addition, [39] extends the K-sync SGD
to K-batch-sync SGD in which the PS updates the parameter
using the first Kmini-batches that finish, such that no worker
is idle in each training round.
In addition to the approaches mentioned above, there is a
growing interest in allowing heterogeneous synchronization to
deal with stragglers, where the number of local updates across
workers varies in a synchronization round. Heterogeneous syn-
chronization has been studied in several works, including [30]–
[32]. In [30], dynamic communication patterns are explored,
where workers are manually designated to perform different
numbers of local updates under a convex objective global func-
tion. FedNova, proposed in [31], balances the heterogeneity of
local updates across workers by imposing weights inversely
proportional to the number of local updates performed by the
workers over the local models. Meanwhile, [32] investigates
the impact of incomplete local updates under heterogeneous
data distribution by comparing the performances of three
schemes. In that work, workers are required to perform a given
number of local updates, and by the time of synchronization,
those who have finished the task are considered to have
“complete” local updates, while the rest are considered to
have “incomplete” local updates. Scheme A only utilizes the
complete local updates of the workers; Scheme B aggregates
both complete and incomplete updates, and Scheme C is the
canonical FedAvg scheme. However, all of [30], [31], and [32]
do not address the issue of fast workers being idle.
Dealing with Stragglers with Asynchronous SGD: Apart
from confronting stragglers in the synchronous schemes, asyn-
chronous SGD (ASGD) is also proposed to address this issue
[40]. In ASGD, the PS updates the parameter immediately
after any worker finishes its computation and uploads it to the
server. Apparently, the time used per round of ASGD schemes
is much less than that used in synchronous SGD. Nevertheless,
since the gradient used to update the parameter may not be
the one used to compute it, the trade-off between wall-clock
time and the staleness of the gradient becomes the main issue
in ASGD, for which a number of variants of ASGD have been
proposed. In [41], the authors develop the Hogwild! scheme
which is a lock-free implementation of ASGD and proves its
effectiveness both theoretically and experimentall; and [42]
improves the performance of ASGD based on Hogwild! with
no conflict in parallel execution. In [39], the authors review
the K-async SGD and K-batch-async SGD first introduced in
[9] and [43] respectively, and propose the AdaSync scheme
to achieve the best error-runtime trade-off. Another trend of
research is to process the delayed gradients in ASGD. To this
end, [44] proposes the AdaDelay algorithm that imposes a
penalty on delayed gradients, and [45] introduces the DC-
ASGD scheme that compensates for the delayed gradients to
make full use of all the gradients.
A. Our Contributions
This paper investigates the efficiency of straggler mitigation
with heterogeneous synchronization both under homogenous
and heterogeneous data distributions in a distributed PS-
based architecture. The main contributions of our work are
as follows:
First, we propose a novel local SGD scheme that allows
different numbers of local updates across workers while
remaining robust to stragglers. Our approach builds on
previous works that use heterogeneous synchronization,
but we enhance it with straggler-tolerant techniques and
the inclusion of all effective local updates from every
worker to improve both time and communication effi-
ciency.
Second, we provide an analysis of the average runtime,
average number of uploading workers and average num-
ber of local updates per round, to justify the improvement
in time and communication efficiency of our proposed
scheme, named STSyn. We also rigorously establish the
convergence of STSyn and prove that it can achieve a
sublinear convergence rate, similar to other local SGD
variants, but with a reduced convergence upper bound.
Finally, we present experimental results to corroborate the
superiority of STSyn against state-of-the-art schemes for
both homogeneous and heterogeneous data distributions
in terms of wall-clock time and communication efficiency.
We also study the influence of system hyper-parameters.
The rest of the paper is organized as follows. Section II
describes the system model. The development of the proposed
STSyn scheme is delineated in Section III. Section IV presents
the analysis of STSyn. Numerical results are provided in
Section V. Section VI concludes the work.
Notations: Boldface lowercase letters and calligraphic let-
ters represent vectors and sets, respectively; Rdenotes the
real number fields; E[·]denotes the statistical expectation; S
denotes the union of sets; A⊆Bmeans that set Ais a subset
of set B;|A| denotes the number of elements in set A;F
represents the gradient of the function F;xdenotes the 2-
norm of vector x;x,ydenotes the inner product of vectors
xand y.
3
II. PROBLEM FORMULATION
A. System Model
We first introduce the general problem framework of dis-
tributed SGD with local updates in order to reduce communi-
cation overhead. To implement the local-SGD-based training
process, a parameter-server setting with a set of Mworkers in
set M{1, ..., M}is considered. Each worker mmaintains
a local dataset Dm, which is equally allocated from the global
training dataset, and we have D=Sm∈M Dm. Our objective
is to minimize the following empirical risk function given as
F(w) = 1
M
M
X
m=1
Fm(w),(1)
where variable wRdis the d-dimensional parameter to be
optimized; and Fm(w)is the local loss function of worker m
with Fm(w)E[Fm(w;ξm)] where ξiis drawn randomly
from the local dataset Dm.
To elaborate on the communication and updating protocol,
we define wjas the global parameter, and wj,0
mas the local
parameter that worker m∈ M has available at training round
jprior to computation.
In local SGD, at each synchronization round j, the PS
first broadcasts the global model parameter wjto a subset
of workers Mj⊆ M. Then by setting the local model as
wj,0
m=wj, each worker min Mjstarts to perform local
updates. At the end of round j, each worker mis assumed to
complete Uj
m0local updates, with each given as
wj,u+1
m=wj,u
mα
B
B
X
b=1 Fm(wj,u
m;ξj,u
m,b),(2)
for any u= 0, ..., Uj
m1if Uj
m1. Here αis the stepsize;
the mini-batch {ξj,u
m,b}B
b=1 of size Bis generated randomly
for each local update from dataset Dm; and Fm(wj,u
m;ξj,u
m,b)
is the local gradient of loss function Fcomputed with local
parameter wj,u
mand data sample ξj,u
m,b. After the computation,
the PS then selects a subset Sj⊆ Mjof Sjworkers to
upload their local parameters, yielding the global parameter
wj+1 given as
wj+1 =1
SjX
m∈Sj
wj,Uj
m
m.(3)
The next round j+ 1 then starts and the training continues
until a convergence criterion is satisfied, i.e., the trained model
reaches a target test accuracy, with the total number of rounds
defined as J. The whole process is illustrated in Fig. 1.
B. Preliminaries
Based on the distributed PS architecture under considera-
tion, our proposed scheme stems from a couple of local SGD
variants, which we briefly review next.
Federated Averaging (FedAvg). One pioneering work on
local update is the so-called FedAvg. More precisely, at each
round, the PS first randomly selects a subset Mj⊆ M
of workers to broadcast the global parameter to. Then each
selected worker mperforms the same amount Uj
m=U
Parameter server (PS)
 =1

,
Worker 1 Worker 2 Worker 3
,
,
Fig. 1. Illustration of the PS setting with M= 3,Mj={1,2,3}and
Sj={1,2}at the jth round.
of local updates and sends the local parameter back to the
PS, i.e., we have Sj=Mj. In particular, when all the
workers participate in the computation in each round, i.e.,
with Sj=Mj=M, we can arrive at periodic-averaging
SGD (PASGD) [26].
Adaptive Communication Strategy (AdaComm). Build-
ing upon PASGD, [27] proposes an adaptive communication
strategy called AdaComm in order to achieve a faster conver-
gence speed in terms of wall-clock runtime. This is achieved
by adaptively increasing the communication frequency be-
tween the PS and the workers. Note here that each worker still
performs the same number of local updates in each training
round.
FedNova. This is one of the pioneering works to investigate
heterogeneous synchronization due to the discrepancy among
the computing capabilities of machines [31], i.e., the numbers
{Uj
m}m∈M of local updates across workers are not necessarily
the same. FedNova aggregates the local models in a weighted
manner to balance the heterogeneity of local updates, with the
weight inversely proportional to the number of local updates
the worker has completed.
C. Performance Metrics
In addition to training/learning accuracy, the time con-
sumption and communication cost with the distributed learn-
ing scheme are also important performance metrics. In-
deed, the training accuracy should be evaluated as a func-
tion of time and/or communication cost to fairly gauge the
time/communication efficiency of a learning scheme. To this
end, we specifically define wall-clock runtime and the com-
munication cost as follows.
Firstly, the wall-clock runtime for each worker mat the
local iteration u(u<Uj
m)of round jis defined as
the time elapsed while computing the mini-batch gradient
1
BPB
b=1 F(wj,u
m;ξj,u
m,b). It is denoted by Tj,u
m, and is as-
sumed to conform to exponential distribution with mean µ.
We also assume that the random variable Tj,u
mis independent
across all workers, rounds and local iterations, which is a com-
mon assumption since workers are similar machines working
摘要:

1STSyn:SpeedingUpLocalSGDwithStraggler-TolerantSynchronizationFengZhu,JingjingZhang,Member,IEEEandXinWang,Fellow,IEEEAbstract—Synchronouslocalstochasticgradientdescent(localSGD)suffersfromsomeworkersbeingidleandrandomdelaysduetoslowandstragglingworkers,asitwaitsfortheworkerstocompletethesameamountof...

展开>> 收起<<
1 STSyn Speeding Up Local SGD with Straggler-Tolerant Synchronization.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:838.18KB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注