
Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout
before updating the global model, this might result in in-
consistent update schedules of the global model, compared
to that of faster devices. This might have ramifications:
i)
For FL on i.i.d. data, this will cause the gradient staleness
problem and result in convergence rate decrease; and,
ii)
on
non-i.i.d. data, this will result in a significant drop in global
model final accuracy.
As solutions, novel approaches on asynchronous FL propose
weighted global aggregation techniques that take into consid-
eration the heterogeneity of the devices [
49
,
5
,
38
]; yet, these
methods often place a heavy computation/communication
burden, as they rely on broadcasting full model updates
to all the clients and/or the server. Other works monitor
client speed to guide the training assignments [
26
,
3
]. Fi-
nally, recent efforts propose semi-asynchronous methods,
where participating devices are selected and buffered in
order to complete a semi-synchronous global update peri-
odically [
17
,
48
]. A thorough discussion on the existing
asynchronous methods in FL can be found in [50].
What is different in this work?
As most algorithms stem
from adapting asynchrony in synchronous FL, one still
needs to broadcast the full model to all devices, following
a data parallel distributed protocol [
11
,
35
], regardless of
device heterogeneity. This inspire us to ask a key question:
“Can we select submodels out of the global model
and send these instead to each device, taking into
account the device heterogeneity?”
We answer this question affirmatively, by proposing a novel
distributed dropout method for FL. We dub our method
AsyncDrop
. Our approach assigns different submodels
to each device
1
; empirically, such a strategy decreases the
required time to converge to an accuracy level, while pre-
serving favorable final accuracy. This work attempts to rein-
stitute the discussion between synchrony and asynchrony in
heterogeneous distributed scenarios, as in FL. Our idea is
based on the ideas of HogWild! [
33
,
29
] –in terms of sparse
submodels– and Independent Subnetwork Training (IST)
[
52
,
10
,
27
,
47
] –where submodels are deliberately created
for distribution, in order to decrease both computational and
communication requirements.
Yet, we deviate from these works:
i)
The combination of
HogWild! and IST ideas has not been stated and tested
before this work, to the best of our knowledge.
ii)
While
HogWild!-line of work provides optimization guarantees,
we consider the non-trivial, non-convex neural network set-
ting and provide theoretical guarantees for convergence;
such a result combines tools from asynchronous optimiza-
tion [
33
,
29
], Neural Tangent Kernel assumptions [
19
],
dropout theory analysis [
27
], and focuses on convolutional
neural networks [
24
], deviating from fully-connected layer
1
We consider both random assignment, as well as structured
assignments, based on the computation power of the devices.
simplified scenarios. Finally,
iii)
we provide system-level
algorithmic solutions for our approach, mirroring best-
practices found during our experiments. Overall, the contri-
butions of this work can be summarized as follows:
•
We consider and propose asynchronous distributed
dropout (
AsyncDrop
) for efficient large-scale FL. Our
focus is on non-trivial, non-convex ML models –as in neu-
ral network training– and our framework provides specific
engineering solutions for these cases in practice.
•
We theoretically characterize and support our proposal
with rigorous and non-trivial convergence rate guarantees.
Currently, our theory assumes bounded delays; our future
goal is to exploit recent developments that drop such
assumptions [
23
]. Yet, our theory already considers the
harder case of neural network training, which is often
omitted in existing theory results.
•
We provide specific implementation instances and share
“best practices” for faster distributed FL in practice. As
a side-product, our preliminary results include baseline
asynchronous implementations of many synchronous
methods (such as FedAvg, FedProx, and more), that are
not existent currently, to the best of our knowledge.
2 Problem Setup and Challenges
Optimization in neural network training
. We consider
FL scenarios over supervised neural network training: i.e.,
we optimize a loss function
`(·,·)
over a dataset, such that
the model maps unseen data to their true labels, unless
otherwise stated. For clarity, the loss
`(W,·)
encodes both
the loss metric and the neural architecture, with parameters
W
. Formally, given a data distribution
D
and
{xi, yi}∼D
,
where
xi
is a data sample, and
yi
is its corresponding label,
classical deep learning aims in finding W?as in:
W?= argmin
W∈H (L(W) := 1
n
n
X
i=1
`(W,{xi, yi})),
where
H
denotes the model hypothesis class that “molds”
the trainable parameters W.
The minimization above can be achieved by using different
approaches, but almost all training is accomplished via a
variation of stochastic gradient descent (SGD) [
37
]. SGD
modifies the current guess
Wt
using stochastic directions
∇`it(Wi) := ∇`(Wi,{xit, yit})
. I.e.,
Wt+1 ←Wt−
η∇`it(Wt).
Here,
η > 0
is the learning rate, and
it
is a
single or a mini-batch of examples. Most FL algorithms are
based on these basic stochastic motions, like FedAvg [
30
],
FedProx [25], FedNova [45] and SCAFFOLD [21].
FL formulation
. Let
S
be the total number of clients in
a distributed FL scenario. Each client
i
has its own local
data
Di
such that the whole dataset satisfies
D=∪iDi
, and
usually
Di∩ Dj=∅,∀i6=j
. The goal of FL is to find a