2
efficiency, the wall-clock time required for convergence is also
an important issue. Due to the discrepancy among computing
capabilities of machines, there is considerable variability in
the computing time of different machines, and the machines
that consume excessively more time per round than others are
normally known as stragglers [29]. Numerous researches have
been conducted in the development of straggler-tolerant tech-
niques, which can be generally categorized into synchronous
schemes and asynchronous ones.
Dealing with Stragglers with Synchronous SGD: In
synchronous parallel SGD, the PS always waits for all the
workers to finish their computation. As a result, the whole
process could be severely slowed down due to the existence
of potential stragglers. One way to mitigate the effect of
straggling is to exploit redundancy such as [33]–[37] that allow
the PS to wait for only a subset of workers by utilizing storage
and computation redundancy. Other than redundancy, [38]
develops the FlexRR system that attempts to detect and avoid
stragglers. Another well-known truth is that the convergence
analysis of M-worker synchronous parallel SGD is the same
as mini-batch SGD, only with the mini-batch size enlarged M
times. Inspired by this idea, [9] and [39] propose the K-sync
SGD where the PS only waits for the first Kworkers to finish
their computation. In addition, [39] extends the K-sync SGD
to K-batch-sync SGD in which the PS updates the parameter
using the first Kmini-batches that finish, such that no worker
is idle in each training round.
In addition to the approaches mentioned above, there is a
growing interest in allowing heterogeneous synchronization to
deal with stragglers, where the number of local updates across
workers varies in a synchronization round. Heterogeneous syn-
chronization has been studied in several works, including [30]–
[32]. In [30], dynamic communication patterns are explored,
where workers are manually designated to perform different
numbers of local updates under a convex objective global func-
tion. FedNova, proposed in [31], balances the heterogeneity of
local updates across workers by imposing weights inversely
proportional to the number of local updates performed by the
workers over the local models. Meanwhile, [32] investigates
the impact of incomplete local updates under heterogeneous
data distribution by comparing the performances of three
schemes. In that work, workers are required to perform a given
number of local updates, and by the time of synchronization,
those who have finished the task are considered to have
“complete” local updates, while the rest are considered to
have “incomplete” local updates. Scheme A only utilizes the
complete local updates of the workers; Scheme B aggregates
both complete and incomplete updates, and Scheme C is the
canonical FedAvg scheme. However, all of [30], [31], and [32]
do not address the issue of fast workers being idle.
Dealing with Stragglers with Asynchronous SGD: Apart
from confronting stragglers in the synchronous schemes, asyn-
chronous SGD (ASGD) is also proposed to address this issue
[40]. In ASGD, the PS updates the parameter immediately
after any worker finishes its computation and uploads it to the
server. Apparently, the time used per round of ASGD schemes
is much less than that used in synchronous SGD. Nevertheless,
since the gradient used to update the parameter may not be
the one used to compute it, the trade-off between wall-clock
time and the staleness of the gradient becomes the main issue
in ASGD, for which a number of variants of ASGD have been
proposed. In [41], the authors develop the Hogwild! scheme
which is a lock-free implementation of ASGD and proves its
effectiveness both theoretically and experimentall; and [42]
improves the performance of ASGD based on Hogwild! with
no conflict in parallel execution. In [39], the authors review
the K-async SGD and K-batch-async SGD first introduced in
[9] and [43] respectively, and propose the AdaSync scheme
to achieve the best error-runtime trade-off. Another trend of
research is to process the delayed gradients in ASGD. To this
end, [44] proposes the AdaDelay algorithm that imposes a
penalty on delayed gradients, and [45] introduces the DC-
ASGD scheme that compensates for the delayed gradients to
make full use of all the gradients.
A. Our Contributions
This paper investigates the efficiency of straggler mitigation
with heterogeneous synchronization both under homogenous
and heterogeneous data distributions in a distributed PS-
based architecture. The main contributions of our work are
as follows:
•First, we propose a novel local SGD scheme that allows
different numbers of local updates across workers while
remaining robust to stragglers. Our approach builds on
previous works that use heterogeneous synchronization,
but we enhance it with straggler-tolerant techniques and
the inclusion of all effective local updates from every
worker to improve both time and communication effi-
ciency.
•Second, we provide an analysis of the average runtime,
average number of uploading workers and average num-
ber of local updates per round, to justify the improvement
in time and communication efficiency of our proposed
scheme, named STSyn. We also rigorously establish the
convergence of STSyn and prove that it can achieve a
sublinear convergence rate, similar to other local SGD
variants, but with a reduced convergence upper bound.
•Finally, we present experimental results to corroborate the
superiority of STSyn against state-of-the-art schemes for
both homogeneous and heterogeneous data distributions
in terms of wall-clock time and communication efficiency.
We also study the influence of system hyper-parameters.
The rest of the paper is organized as follows. Section II
describes the system model. The development of the proposed
STSyn scheme is delineated in Section III. Section IV presents
the analysis of STSyn. Numerical results are provided in
Section V. Section VI concludes the work.
Notations: Boldface lowercase letters and calligraphic let-
ters represent vectors and sets, respectively; Rdenotes the
real number fields; E[·]denotes the statistical expectation; S
denotes the union of sets; A⊆Bmeans that set Ais a subset
of set B;|A| denotes the number of elements in set A;∇F
represents the gradient of the function F;∥x∥denotes the ℓ2-
norm of vector x;⟨x,y⟩denotes the inner product of vectors
xand y.