1 STSyn Speeding Up Local SGD with Straggler-Tolerant Synchronization

2025-04-28 0 0 838.18KB 13 页 10玖币

侵权投诉

STSyn: Speeding Up Local SGD with

Straggler-Tolerant Synchronization

Feng Zhu, Jingjing Zhang, Member, IEEE and Xin Wang, Fellow, IEEE

Abstract—Synchronous local stochastic gradient descent (local

SGD) suffers from some workers being idle and random delays

due to slow and straggling workers, as it waits for the workers to

complete the same amount of local updates. To address this issue,

a novel local SGD strategy called STSyn is proposed in this paper.

The key point is to wait for the Kfastest workers while keeping

all the workers computing continually at each synchronization

round, and making full use of any effective (completed) local

update of each worker regardless of stragglers. To evaluate the

performance of STSyn, an analysis of the average wall-clock

time, average number of local updates, and average number of

uploading workers per round is provided. The convergence of

STSyn is also rigorously established even when the objective

function is nonconvex for both homogeneous and heterogeneous

data distributions. Experimental results highlight the superiority

of STSyn over state-of-the-art schemes, thanks to its straggler-

tolerant technique and the inclusion of additional effective local

updates at each worker. Furthermore, the impact of system

parameters is investigated. By waiting for faster workers and

allowing heterogeneous synchronization with different numbers

of local updates across workers, STSyn provides substantial

improvements both in time and communication efﬁciency.

Index Terms—Local SGD, heterogeneous synchronization,

straggler tolerance, distributed learning

I. INTRODUCTION

AS the size of datasets increases exponentially, it is no

longer economic and feasible to leverage all the data to

update the model parameters in large-scale machine learning,

which magniﬁes the disadvantage of gradient descent (GD)

methods in computational complexity. Instead, stochastic gra-

dient descent (SGD) that uses just one or a mini-batch of

samples is proved to converge faster and can achieve signiﬁ-

cantly reduced computation load [1]–[3]. Therefore, SGD has

become the backbone of large-scale machine learning tasks.

To exploit parallelism and further accelerate the training

process, the distributed implementation of worker nodes,

i.e., distributed learning, has been gaining popularity [4]–

[6]. The parameter server (PS) setting is one of the most

common scenarios of SGD-based distributed learning frame-

works, where in each iteration the PS aggregates the uploaded

gradients/models from all the worker nodes to update the

global parameter [2], [4], [5], [7]–[9]. In such a setting,

The authors are with the Key Laboratory for Information Science of

Electromagnetic Waves (MoE), Department of Communication Science and

Engineering, School of Information Science and Technology, Fudan Uni-

versity, Shanghai 200433, China (e-mail: {20210720072, jingjingzhang,

xwang11}@fudan.edu.cn). (Corresponding author: Jingjing Zhang.)

This work was supported by the National Natural Science Foundation

of China Grants No. 62101134 and No. 62071126, and the Innovation

Program of Shanghai Municipal Science and Technology Commission Grant

20JC1416400.

communication cost has become the bottleneck for the perfor-

mance of the system due to the two-way communication and

the larger number of parameters to be updated at each iteration

[1], [5], [10]–[12]. Accordingly, a novel scheme justiﬁed by

the name local SGD is proposed to reduce the communication

frequency between the PS and the workers by allowing each

worker to perform local updates.

The idea of local SGD is ﬁrst introduced in [6] where each

worker performs a certain number of local updates instead of

one, and the PS aggregates the latest model of each worker as

the ﬁnal output. However, the scheme in [6] has been proved

to be no better than single-worker algorithms in the worst-

case scenario due to the divergence of local optima across

workers [13]. To address this issue, [14] proposes the federated

averaging (FedAvg) scheme, where each sampled worker runs

multiple local updates before uploading the local model to

the central server for aggregation and then the server re-sends

the aggregated model back to the workers to repeat the above

process. By increasing the communication frequency, FedAvg

has been proved to be communication-efﬁcient and enjoy

remarkable performance under homogenous data distribution.

The convergence analysis of the latter is widely explored

in [15]–[18]. It should also be pointed out that FedAvg is

proved to be able to achieve linear speedup concerning the

number of workers in this setting. Furthermore, when data

is heterogeneously distributed, the convergence of FedAvg is

also proved by [19]–[23]. Local SGD has also been proved

to achieve better time and communication performance than

large mini-batch training with the size of the mini-batch being

the same as the total batchsize used in a training round of

local SGD [25].

Adaptive Communication Strategies for Local SGD: It

has been proved by [15], [21] that it is theoretically reason-

able and advantageous to vary the communication frequency

as the training proceeds. The authors of [26] present the

convergence analysis of periodic averaging SGD (PASGD),

which is FedAvg with no user sampling, and propose an

adaptive scheme that gradually decreases the communication

frequency between workers and the server to reduce total

communication rounds. Meanwhile, [27] also proposes an

adaptive strategy based on PASGD where the communication

frequency is increased along the training process to achieve

a decrease in the true convergence time. In addition, [28]

proposes the STagewise Local SGD (STL-SGD) that increases

the communication period along with decreasing learning rate

to achieve further improvement.

Local SGD presents a possibility to reduce the commu-

nication cost of the system; yet, other than communication

arXiv:2210.03521v3 [cs.LG] 29 May 2023

efﬁciency, the wall-clock time required for convergence is also

an important issue. Due to the discrepancy among computing

capabilities of machines, there is considerable variability in

the computing time of different machines, and the machines

that consume excessively more time per round than others are

normally known as stragglers [29]. Numerous researches have

been conducted in the development of straggler-tolerant tech-

niques, which can be generally categorized into synchronous

schemes and asynchronous ones.

Dealing with Stragglers with Synchronous SGD: In

synchronous parallel SGD, the PS always waits for all the

workers to ﬁnish their computation. As a result, the whole

process could be severely slowed down due to the existence

of potential stragglers. One way to mitigate the effect of

straggling is to exploit redundancy such as [33]–[37] that allow

the PS to wait for only a subset of workers by utilizing storage

and computation redundancy. Other than redundancy, [38]

develops the FlexRR system that attempts to detect and avoid

stragglers. Another well-known truth is that the convergence

analysis of M-worker synchronous parallel SGD is the same

as mini-batch SGD, only with the mini-batch size enlarged M

times. Inspired by this idea, [9] and [39] propose the K-sync

SGD where the PS only waits for the ﬁrst Kworkers to ﬁnish

their computation. In addition, [39] extends the K-sync SGD

to K-batch-sync SGD in which the PS updates the parameter

using the ﬁrst Kmini-batches that ﬁnish, such that no worker

is idle in each training round.

In addition to the approaches mentioned above, there is a

growing interest in allowing heterogeneous synchronization to

deal with stragglers, where the number of local updates across

workers varies in a synchronization round. Heterogeneous syn-

chronization has been studied in several works, including [30]–

[32]. In [30], dynamic communication patterns are explored,

where workers are manually designated to perform different

numbers of local updates under a convex objective global func-

tion. FedNova, proposed in [31], balances the heterogeneity of

local updates across workers by imposing weights inversely

proportional to the number of local updates performed by the

workers over the local models. Meanwhile, [32] investigates

the impact of incomplete local updates under heterogeneous

data distribution by comparing the performances of three

schemes. In that work, workers are required to perform a given

number of local updates, and by the time of synchronization,

those who have ﬁnished the task are considered to have

“complete” local updates, while the rest are considered to

have “incomplete” local updates. Scheme A only utilizes the

complete local updates of the workers; Scheme B aggregates

both complete and incomplete updates, and Scheme C is the

canonical FedAvg scheme. However, all of [30], [31], and [32]

do not address the issue of fast workers being idle.

Dealing with Stragglers with Asynchronous SGD: Apart

from confronting stragglers in the synchronous schemes, asyn-

chronous SGD (ASGD) is also proposed to address this issue

[40]. In ASGD, the PS updates the parameter immediately

after any worker ﬁnishes its computation and uploads it to the

server. Apparently, the time used per round of ASGD schemes

is much less than that used in synchronous SGD. Nevertheless,

since the gradient used to update the parameter may not be

the one used to compute it, the trade-off between wall-clock

time and the staleness of the gradient becomes the main issue

in ASGD, for which a number of variants of ASGD have been

proposed. In [41], the authors develop the Hogwild! scheme

which is a lock-free implementation of ASGD and proves its

effectiveness both theoretically and experimentall; and [42]

improves the performance of ASGD based on Hogwild! with

no conﬂict in parallel execution. In [39], the authors review

the K-async SGD and K-batch-async SGD ﬁrst introduced in

[9] and [43] respectively, and propose the AdaSync scheme

to achieve the best error-runtime trade-off. Another trend of

research is to process the delayed gradients in ASGD. To this

end, [44] proposes the AdaDelay algorithm that imposes a

penalty on delayed gradients, and [45] introduces the DC-

ASGD scheme that compensates for the delayed gradients to

make full use of all the gradients.

A. Our Contributions

This paper investigates the efﬁciency of straggler mitigation

with heterogeneous synchronization both under homogenous

and heterogeneous data distributions in a distributed PS-

based architecture. The main contributions of our work are

as follows:

•First, we propose a novel local SGD scheme that allows

different numbers of local updates across workers while

remaining robust to stragglers. Our approach builds on

previous works that use heterogeneous synchronization,

but we enhance it with straggler-tolerant techniques and

the inclusion of all effective local updates from every

worker to improve both time and communication efﬁ-

ciency.

•Second, we provide an analysis of the average runtime,

average number of uploading workers and average num-

ber of local updates per round, to justify the improvement

in time and communication efﬁciency of our proposed

scheme, named STSyn. We also rigorously establish the

convergence of STSyn and prove that it can achieve a

sublinear convergence rate, similar to other local SGD

variants, but with a reduced convergence upper bound.

•Finally, we present experimental results to corroborate the

superiority of STSyn against state-of-the-art schemes for

both homogeneous and heterogeneous data distributions

in terms of wall-clock time and communication efﬁciency.

We also study the inﬂuence of system hyper-parameters.

The rest of the paper is organized as follows. Section II

describes the system model. The development of the proposed

STSyn scheme is delineated in Section III. Section IV presents

the analysis of STSyn. Numerical results are provided in

Section V. Section VI concludes the work.

Notations: Boldface lowercase letters and calligraphic let-

ters represent vectors and sets, respectively; Rdenotes the

real number ﬁelds; E[·]denotes the statistical expectation; S

denotes the union of sets; A⊆Bmeans that set Ais a subset

of set B;|A| denotes the number of elements in set A;∇F

represents the gradient of the function F;∥x∥denotes the ℓ2-

norm of vector x;⟨x,y⟩denotes the inner product of vectors

xand y.

II. PROBLEM FORMULATION

A. System Model

We ﬁrst introduce the general problem framework of dis-

tributed SGD with local updates in order to reduce communi-

cation overhead. To implement the local-SGD-based training

process, a parameter-server setting with a set of Mworkers in

set M≜{1, ..., M}is considered. Each worker mmaintains

a local dataset Dm, which is equally allocated from the global

training dataset, and we have D=Sm∈M Dm. Our objective

is to minimize the following empirical risk function given as

F(w) = 1

m=1

Fm(w),(1)

where variable w∈Rdis the d-dimensional parameter to be

optimized; and Fm(w)is the local loss function of worker m

with Fm(w)≜E[Fm(w;ξm)] where ξiis drawn randomly

from the local dataset Dm.

To elaborate on the communication and updating protocol,

we deﬁne wjas the global parameter, and wj,0

mas the local

parameter that worker m∈ M has available at training round

jprior to computation.

In local SGD, at each synchronization round j, the PS

ﬁrst broadcasts the global model parameter wjto a subset

of workers Mj⊆ M. Then by setting the local model as

wj,0

m=wj, each worker min Mjstarts to perform local

updates. At the end of round j, each worker mis assumed to

complete Uj

m≥0local updates, with each given as

wj,u+1

m=wj,u

m−α

b=1 ∇Fm(wj,u

m;ξj,u

m,b),(2)

for any u= 0, ..., Uj

m−1if Uj

m≥1. Here αis the stepsize;

the mini-batch {ξj,u

m,b}B

b=1 of size Bis generated randomly

for each local update from dataset Dm; and ∇Fm(wj,u

m;ξj,u

m,b)

is the local gradient of loss function Fcomputed with local

parameter wj,u

mand data sample ξj,u

m,b. After the computation,

the PS then selects a subset Sj⊆ Mjof Sjworkers to

upload their local parameters, yielding the global parameter

wj+1 given as

wj+1 =1

SjX

m∈Sj

wj,Uj

m.(3)

The next round j+ 1 then starts and the training continues

until a convergence criterion is satisﬁed, i.e., the trained model

reaches a target test accuracy, with the total number of rounds

deﬁned as J. The whole process is illustrated in Fig. 1.

B. Preliminaries

Based on the distributed PS architecture under considera-

tion, our proposed scheme stems from a couple of local SGD

variants, which we brieﬂy review next.

Federated Averaging (FedAvg). One pioneering work on

local update is the so-called FedAvg. More precisely, at each

round, the PS ﬁrst randomly selects a subset Mj⊆ M

of workers to broadcast the global parameter to. Then each

selected worker mperforms the same amount Uj

m=U

Parameter server (PS)

 =1





,



Worker 1 Worker 2 Worker 3





,











,



Fig. 1. Illustration of the PS setting with M= 3,Mj={1,2,3}and

Sj={1,2}at the jth round.

of local updates and sends the local parameter back to the

PS, i.e., we have Sj=Mj. In particular, when all the

workers participate in the computation in each round, i.e.,

with Sj=Mj=M, we can arrive at periodic-averaging

SGD (PASGD) [26].

Adaptive Communication Strategy (AdaComm). Build-

ing upon PASGD, [27] proposes an adaptive communication

strategy called AdaComm in order to achieve a faster conver-

gence speed in terms of wall-clock runtime. This is achieved

by adaptively increasing the communication frequency be-

tween the PS and the workers. Note here that each worker still

performs the same number of local updates in each training

round.

FedNova. This is one of the pioneering works to investigate

heterogeneous synchronization due to the discrepancy among

the computing capabilities of machines [31], i.e., the numbers

{Uj

m}m∈M of local updates across workers are not necessarily

the same. FedNova aggregates the local models in a weighted

manner to balance the heterogeneity of local updates, with the

weight inversely proportional to the number of local updates

the worker has completed.

C. Performance Metrics

In addition to training/learning accuracy, the time con-

sumption and communication cost with the distributed learn-

ing scheme are also important performance metrics. In-

deed, the training accuracy should be evaluated as a func-

tion of time and/or communication cost to fairly gauge the

time/communication efﬁciency of a learning scheme. To this

end, we speciﬁcally deﬁne wall-clock runtime and the com-

munication cost as follows.

Firstly, the wall-clock runtime for each worker mat the

local iteration u(u<Uj

m)of round jis deﬁned as

the time elapsed while computing the mini-batch gradient

BPB

b=1 ∇F(wj,u

m;ξj,u

m,b). It is denoted by Tj,u

m, and is as-

sumed to conform to exponential distribution with mean µ.

We also assume that the random variable Tj,u

mis independent

across all workers, rounds and local iterations, which is a com-

mon assumption since workers are similar machines working

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1STSyn:SpeedingUpLocalSGDwithStraggler-TolerantSynchronizationFengZhu,JingjingZhang,Member,IEEEandXinWang,Fellow,IEEEAbstract—Synchronouslocalstochasticgradientdescent(localSGD)suffersfromsomeworkersbeingidleandrandomdelaysduetoslowandstragglingworkers,asitwaitsfortheworkerstocompletethesameamountof...

展开>> 收起<<

1 STSyn Speeding Up Local SGD with Straggler-Tolerant Synchronization.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 STSyn Speeding Up Local SGD with Straggler-Tolerant Synchronization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: