Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout

2025-08-18 1 0 1.44MB 29 页 10玖币
侵权投诉
Efficient and Light-Weight Federated Learning
via Asynchronous Distributed Dropout
Chen Dun Mirian Hipolito Chris Jermaine Dimitrios Dimitriadis Anastasios Kyrillidis
Rice University Microsoft Research Rice University Microsoft Research Rice University
Abstract
Asynchronous learning protocols have regained
attention lately, especially in the Federated Learn-
ing (FL) setup, where slower clients can severely
impede the learning process. Herein, we propose
AsyncDrop
, a novel asynchronous FL frame-
work that utilizes dropout regularization to handle
device heterogeneity in distributed settings. Over-
all,
AsyncDrop
achieves better performance
compared to state of the art asynchronous method-
ologies, while resulting in less communication
and training time overheads. The key idea re-
volves around creating “submodels” out of the
global model, and distributing their training to
workers, based on device heterogeneity. We rigor-
ously justify that such an approach can be theoret-
ically characterized. We implement our approach
and compare it against other asynchronous base-
lines, both by design and by adapting existing syn-
chronous FL algorithms to asynchronous scenar-
ios. Empirically,
AsyncDrop
reduces the com-
munication cost and training time, while match-
ing or improving the final test accuracy in diverse
non-i.i.d. FL scenarios.
1 Introduction
Background on Federated Learning.
Federated Learning
(FL) [
30
,
25
,
21
] is a distributed learning protocol that has
witnessed fast development the past demi-decade. FL de-
viates from the traditional distributed learning paradigms
and allows the integration of edge devices —such as smart-
phones [
41
], drones [
34
], and IoT devices [
31
]— in the
learning procedure. Yet, such real-life, edge devices are ex-
tremely heterogeneous [
44
]: they have drastically different
specifications in terms of compute power, device memory
and achievable communication bandwidths. Directly apply-
ing common synchronized FL algorithms –such as FedAvg
and FedProx [
25
,
30
] that require full model broadcasting
and global synchronization– results often in a “stragglers”
effect [
32
,
17
,
42
]; i.e., computationally powerful edge de-
vices wait for slower ones during the synchronization step.
The ubiquitous synchronous training.
One way to handle
such issues is by utilizing asynchrony instead of synchrony
in the learning process. To explain the main differences,
let us first set up the background. In a synchronous dis-
tributed algorithm, a global model is usually stored at a
central server and is broadcast periodically to all the partici-
pating devices. Then, each device performs local training
steps on its own model copy, before the device sends the up-
dated model to the central server. Finally, the central server
updates the global model by aggregating the received model
copies. This protocol is followed in most FL algorithms,
including the well-established FedAvg [
30
], FedProx [
25
],
FedNova [
45
] and SCAFFOLD [
21
]. The main criticism
against synchronous learning could be that it often results
in heavy communication/computation overheads and long
idle/waiting times for workers.
Asynchrony and its challenges.
The deployment of a asyn-
chronous learning method is often convoluted. In the past
decade, HogWild! [
33
,
29
] has emerged as a general asyn-
chronous distributed methodology, and has been applied
initially in basic ML problems like sparse linear/logistic re-
gression [
54
,
53
,
15
]. Ideally, based on sparsity arguments,
each edge device can independently update parts of the
global model –that overlap only slightly with the updates
of other workers– in a lock-free fashion [
33
,
29
]. This way,
faster, more powerful edge workers do not suffer from idle
waiting due to slower stragglers. Yet, the use of asynchrony
has been a topic of dispute in distributed neural network
training [
7
,
4
]. Asynchronous training often suffers from
lower accuracy as compared to synchronized SGD, which
results in the dominance of synchronized SGD in neural
network training [4].
Resurgence in asynchrony.
Recently, asynchronous meth-
ods have regained popularity, mainly due to the interest in
applying asynchronous motions within FL on edge devices:
the heterogeneity of edge networks, the ephemeral nature of
the workers, the computational, communication and energy
restriction of mobile devices are some impediments towards
applying synchronous algorithms in realistic environments.
Yet, traditional off-the-shelf asynchronous distributed algo-
rithms still have issues, which might be exacerbated in the
FL setting. As slower devices take longer local training time
arXiv:2210.16105v1 [cs.LG] 28 Oct 2022
Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout
before updating the global model, this might result in in-
consistent update schedules of the global model, compared
to that of faster devices. This might have ramifications:
i)
For FL on i.i.d. data, this will cause the gradient staleness
problem and result in convergence rate decrease; and,
ii)
on
non-i.i.d. data, this will result in a significant drop in global
model final accuracy.
As solutions, novel approaches on asynchronous FL propose
weighted global aggregation techniques that take into consid-
eration the heterogeneity of the devices [
49
,
5
,
38
]; yet, these
methods often place a heavy computation/communication
burden, as they rely on broadcasting full model updates
to all the clients and/or the server. Other works monitor
client speed to guide the training assignments [
26
,
3
]. Fi-
nally, recent efforts propose semi-asynchronous methods,
where participating devices are selected and buffered in
order to complete a semi-synchronous global update peri-
odically [
17
,
48
]. A thorough discussion on the existing
asynchronous methods in FL can be found in [50].
What is different in this work?
As most algorithms stem
from adapting asynchrony in synchronous FL, one still
needs to broadcast the full model to all devices, following
a data parallel distributed protocol [
11
,
35
], regardless of
device heterogeneity. This inspire us to ask a key question:
“Can we select submodels out of the global model
and send these instead to each device, taking into
account the device heterogeneity?”
We answer this question affirmatively, by proposing a novel
distributed dropout method for FL. We dub our method
AsyncDrop
. Our approach assigns different submodels
to each device
1
; empirically, such a strategy decreases the
required time to converge to an accuracy level, while pre-
serving favorable final accuracy. This work attempts to rein-
stitute the discussion between synchrony and asynchrony in
heterogeneous distributed scenarios, as in FL. Our idea is
based on the ideas of HogWild! [
33
,
29
] –in terms of sparse
submodels– and Independent Subnetwork Training (IST)
[
52
,
10
,
27
,
47
] –where submodels are deliberately created
for distribution, in order to decrease both computational and
communication requirements.
Yet, we deviate from these works:
i)
The combination of
HogWild! and IST ideas has not been stated and tested
before this work, to the best of our knowledge.
ii)
While
HogWild!-line of work provides optimization guarantees,
we consider the non-trivial, non-convex neural network set-
ting and provide theoretical guarantees for convergence;
such a result combines tools from asynchronous optimiza-
tion [
33
,
29
], Neural Tangent Kernel assumptions [
19
],
dropout theory analysis [
27
], and focuses on convolutional
neural networks [
24
], deviating from fully-connected layer
1
We consider both random assignment, as well as structured
assignments, based on the computation power of the devices.
simplified scenarios. Finally,
iii)
we provide system-level
algorithmic solutions for our approach, mirroring best-
practices found during our experiments. Overall, the contri-
butions of this work can be summarized as follows:
We consider and propose asynchronous distributed
dropout (
AsyncDrop
) for efficient large-scale FL. Our
focus is on non-trivial, non-convex ML models –as in neu-
ral network training– and our framework provides specific
engineering solutions for these cases in practice.
We theoretically characterize and support our proposal
with rigorous and non-trivial convergence rate guarantees.
Currently, our theory assumes bounded delays; our future
goal is to exploit recent developments that drop such
assumptions [
23
]. Yet, our theory already considers the
harder case of neural network training, which is often
omitted in existing theory results.
We provide specific implementation instances and share
“best practices” for faster distributed FL in practice. As
a side-product, our preliminary results include baseline
asynchronous implementations of many synchronous
methods (such as FedAvg, FedProx, and more), that are
not existent currently, to the best of our knowledge.
2 Problem Setup and Challenges
Optimization in neural network training
. We consider
FL scenarios over supervised neural network training: i.e.,
we optimize a loss function
`(·,·)
over a dataset, such that
the model maps unseen data to their true labels, unless
otherwise stated. For clarity, the loss
`(W,·)
encodes both
the loss metric and the neural architecture, with parameters
W
. Formally, given a data distribution
D
and
{xi, yi}∼D
,
where
xi
is a data sample, and
yi
is its corresponding label,
classical deep learning aims in finding W?as in:
W?= argmin
W∈H (L(W) := 1
n
n
X
i=1
`(W,{xi, yi})),
where
H
denotes the model hypothesis class that “molds”
the trainable parameters W.
The minimization above can be achieved by using different
approaches, but almost all training is accomplished via a
variation of stochastic gradient descent (SGD) [
37
]. SGD
modifies the current guess
Wt
using stochastic directions
`it(Wi) := `(Wi,{xit, yit})
. I.e.,
Wt+1 Wt
η`it(Wt).
Here,
η > 0
is the learning rate, and
it
is a
single or a mini-batch of examples. Most FL algorithms are
based on these basic stochastic motions, like FedAvg [
30
],
FedProx [25], FedNova [45] and SCAFFOLD [21].
FL formulation
. Let
S
be the total number of clients in
a distributed FL scenario. Each client
i
has its own local
data
Di
such that the whole dataset satisfies
D=iDi
, and
usually
Di∩ Dj=,i6=j
. The goal of FL is to find a
Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis
global model
W
that achieves good accuracy on all data
D
,
by minimizing the following optimization problem:
W?= argmin
W∈H (L(W) := 1
S
S
X
i=1
`(W,Di)),
where
`(W,Di) = 1
|Di|P{xj,yj}∈Di`(W,{xj, yj})
.
With a slight abuse of notation,
`(W,Di)
denotes the local
loss function for user
i
, associated with a local model
Wi
(not indicated above), that gets aggregated with the models
of other users. Herein, we consider both i.i.d. and non-i.i.d.
cases, since local data distribution
Di
can be heterogeneous
and follow a non-i.i.d. distribution.
Algorithm 1 Meta Asynchronous FL
Parameters
:
T
iters,
S
clients,
l
local iters.,
W
as cur-
rent global model,
Wi
as local model for
i
-th client,
α(0,1),ηstep size.
Wrandomly initialized global model.
//On each client iasynchronously:
for t= 0, . . . , T 1do
Wi,t W
//Train Wifor liters. via SGD
for j= 1, . . . , l do
Wi,t Wi,t ηL
Wi,t
end for
//Write local to global model
W(1 α)·W+α·Wi,t
end for
Details of asynchronous training.
An abstract description
of how asynchronous FL operates is provided in Algorithm
1. In particular, given a number of server iterations
T
, each
client
i
gets the updated global model
Wt
from the server,
and further locally trains it using
Di
for a number of local
iterations
l
.
2
Asynchronous FL assumes each client has
different computation power and communication bandwidth;
this can be abstracted by different wall-clock times required
to finish a number of local training iterations. Thus, when
client
i
has completed its round, the updated model is shared
with the server to be aggregated, before the next round of
communication and computation starts for client
i
.This is
different from classical synchronous FL, where the global
model is updated only when all participating clients finish
(or time-out) certain local training iterations.
3 Challenges in Asynchronous FL and
Related Work
Challenges.
Asynchronous steps often lead to inconsistent
update schedules of the global model and are characterized
2
Details on the use of the optimizer, how it is tuned with respect
to step size, mini-batch size, etc. are intentionally hidden at this
point of the discussion.
by gradient staleness and drifting. Real-life FL applica-
tions include edge devices with limited communication and
computation capabilities (e.g., how often and fast they can
connect with the central server, and how powerful as devices
they are). For instance, edge devices such as IoT devices
or mobile phones [
31
], might only be able to communicate
with the server within short time windows, due to network
conditions or user permission policy.
Figure 1: Potential issues in asynchronous FL.
Consider the toy setting in Figure 1. The two clients (Clients
A and B) have a significantly different update schedule on
the global model: Here, Client A has higher computational
power or communication bandwidth –compared to client B–
potentially leading to model drifting, lack of fair training and
more severe gradient staleness. On top, consider these two
clients having different local (non-i.i.d.) data distributions.
Related Work.
The issue of model drifting due to data
“non-iidness” is a central piece in FL research. Algorithms,
such as FedProx [
25
], utilize regularization to constrain lo-
cal parameters “drifting" away from previous global models.
The gradient staleness problem has been widely studied
in asynchronous FL, like in [
49
,
5
,
38
,
26
,
3
]. These ap-
proaches can be summarized as weighted asynchronous
FedAvg protocols, in which the weight of each local client
update is proportional to the “capabilities” of the client. This
should decrease the negative impact from stale gradients
by slower clients. Semi-asynchronous methods have been
proposed [
17
,
48
]; yet, they require fast clients to wait until
all other clients’ updates are completed, in order to receive
the updated model for the next round of local training.
Finally, numerous quantization [
2
,
51
] and sparsification
[
1
,
20
] techniques have been proposed for reducing compu-
tation and communication costs in FL.
4 Asynchronous Distributed Dropout
(Distributed) Dropout
. Dropout [
43
,
39
,
12
,
6
] is a widely-
accepted regularization technique in deep learning. The
procedure of Dropout is as follows: per training round, a
random mask over the parameters is generated; this mask
is used to nullify part of the neurons in the neural network
Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout
for this particular iteration. Variants of dropout include
the drop-connect [
43
], multisample dropout [
18
], Gaussian
dropout [46], and the variational dropout [22].
The idea of dropout has also been used in efficient dis-
tributed and/or FL scenarios. [
13
] introduces FjORD and
the Ordered Dropout, a synchronous distributed dropout
technique that leads to ordered, nested representation of
knowledge in models, and enables the extraction of lower
footprint submodels without the need of retraining. Such
submodels are more suitable in client heterogeneity, as they
adapt submodel’s width to the client’s capabilities. See also
Nested Dropout [36] and HeteroFL [8].
Algorithm 2 Asynchronous dropout (AsyncDrop)
Parameters
:
T
iters,
S
clients,
l
local iters.,
W
as cur-
rent global model,
Wi
as local model for
i
-th client,
α(0,1),ηstep size.
Wrandomly initialized global model.
//On each client iasynchronously:
for t= 0, . . . , T 1do
Generate mask Mi,t
Wi,t WtMi,t
//Train Wi,t for liters. via SGD
for j= 1, . . . , l do
Wi,t Wi,t ηL
Wi,t
end for
//Write local to global model
Wt+1 Wt(Mi,t)c
+((1 α)·Wt+α·Wi,t)Mi,t
end for
Our proposal and main hypothesis
. We focus on the
asynchronous version of distributed dropout. We study
theoretically whether asynchrony provably works in non-
trivial non-convex scenarios –as in training neural net-
works– with random masks that generate submodels for
each worker. The algorithm is described in Algorithm 2,
dubbed as
AsyncDrop
, and is based on recent distributed
protocols [
52
,
10
,
27
,
47
]; key features are highlighted in
teal-colored text. The main difference from Algorithm 1
is that Algorithm 2 splits the model vertically per iteration,
where each submodel contains all layers of the neural net-
work, but only with a (non-overlapping) subset of neurons
being active in each layer. Multiple local SGD steps can
be performed without the need for the workers to commu-
nicate. See also Figure 2 for a schematic representation of
asynchronous distributed dropout for training a CNN.
Theoretical Results
. We are interested in understanding
whether such a combination of asynchronous computing
and dropout techniques lead to convergence and favorable
results: given the variance introduced by both asynchronous
updates and training of submodels, it is not obvious whether
–and under which conditions– such a protocol would work.
For ease of presentation and clarity of results, we analyse
a one-hidden-layer CNN, and show convergence with ran-
dom filter dropout. Consider a training dataset
(X,y) =
{(xi, yi)}n
i=1
, where each
xiRˆ
d×p
is an image and
yi
being its label. Here,
ˆ
d
is the number of input channels
and
p
the number of pixels. Let
q
denote the size of the
filter, and let
m
be the number of filters in the first layer.
Based on previous work [
9
], we let
ˆ
φ(·)
denote the patch-
ing operator with
ˆ
φ(x)Rqˆ
d×p
. Consider the first layer
weight
WRm×qˆ
d
, and second layer (aggregation) weight
aRm×p
. We assume that only the first layer weights
W
is trainable. The CNN trained on the means squared error
has the form:
f(x,W) = Da, σ Wˆ
φ(x),;E,L(W) = kf(X,W)yk2
2,
where
f(x,·)
denotes the output of the one-layer CNN for
input
x
, and
L(·)
is the loss function. We use the
`2
-norm
loss for simplicity. We make the following assumption on
the training data and the CNN weight initialization.
Assumption 4.1
(Training Data) Assume that for all
i
[n]
, we have
kxikF=q1
2
and
|yi| ≤ C
for some constant
C. Moreover, for all i, i0[n]we have xi6=xi0.
Note that this can be satisfied by normalizing the data. For
simplicity of the analysis, let d:= qˆ
d.
Assumption 4.2
(Initialization)
w0,i N 0, κ2I
and
ai,i0n±1
pmofor i[m]and i0[p].
In an asynchronous scenario, the neural network weight
is updated with stale gradients due to the asynchronous
updates, where
δt
is the delay at training step
t
.We assume
δt
is bounded by a constant
E
.Then, a simple version of
gradient descent under these assumptions looks like:
Wt=WtηWL(Wtδt), δtE,
where
Wtδt
indicates that the gradient is evaluated on a
earlier version of the model parameters. Given the above,
we provide the following guarantees:
Theorem 4.1
Let
f(·,·)
be a one-hidden-layer CNN with
the second layer weight fixed. Let
ut
abstractly represent the
output of the model after
t
iterations, over the random selec-
tion of the masks. Let
E
denotes the maximum gradient de-
lay/staleness. Let
ξ
denote the dropout rate (
ξ= 1
dictates
that all neurons are selected), and denote
θ= 1 (1 ξ)S
the probability that a neuron is active in at least one sub-
network. Assume the number of hidden neurons satisfies
m= max{n4K2
λ4
0δ2max{n, d},n
λ0}
and the step size
satisfies
η=Oλ0
n2
. Let
κ
be a proper initialization scal-
ing factor, and it is considered constant. We use
λ0
to denote
Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis
Figure 2: Schematic representation of
AsyncDropout
.
a)
This is a simple representation of a CNN model. Our algorithm
applies for arbitrary depth of CNNs (ResNets) as well as other architectures (MLPs, LSTMs, etc); here we restrict to a
shallow CNN for illustration purposes.
b)
Per request, random sub-sampled CNN models are created that result into different
subnetworks.
c)
These submodels are distributed to devices with different computational capabilities (here GPU, CPU,
or a smartphone).
d)
Without loss of generality, we assume that all devices train locally the submodel for
l
iterations.
e)
However, each device finishes local training in different timestamps (shown as different colored arrows: red: slow speed;
orange: moderate speed; blue: fast speed).
f)
Yet, the global model is asynchronously updated and new submodels are
created without global synchronization. g)The above procedure is repeated till convergence.
the smallest eigenvalue of the Neural Tangent Kernel matrix.
Let Assumptions 1 and 2 be satisfied. Then, the following
convergence rate guarantee is proved to be satisfied:
EMthkut+1 yk2
2i1θηλ0
4t
ku0yk2
2
+Oθηλ3
0ξ2κ2E2
n2+ξ2(1 ξ)2θηn3κ2d
0
+η2θ22λ0ξ4E2
m4+ξ2(1 ξ)2θ2η2n2κ2d
m3λ0
+ξ2(1 ξ)2θ2η2κ2λ0E2
m3+ξ2(1 ξ)2θ2η2n2κ2d
m2λ0
+2θξ2
S!
Remark #1.
This theorem states that the sum of the
expected weight differences in the
t
-th iteration (i.e.,
EMt[kut+1 yk2
2]
) converges linearly to zero, as dictated
by the red term –
1θηλ0
4tku0yk2
2
– up to an error
neighborhood, denoted with the Big-Oh notation term on
the right hand side of the expression. Focusing on the latter,
there are two types of additive errors:
i)
the orange-colored
terms origin from the dropout analysis: the term
1ξ
is often called as “dropout rate” (when
ξ= 0
, no neu-
rons are selected and the loss hardly decreases, while when
ξ= 1
, all neurons are selected, and the orange-colored
terms disappear).
ii)
the violet-colored terms origin from
the asynchronous analysis: when
E= 0
(i.e., we boil down
to synchronous computations), these terms also disappear).
Remark #2.
Beyond the above extreme cases, we observe
that the error region terms can be controlled by algorithmic
and model-design choices: e.g., when the size of the dataset
n
increases, the first term
θηλ3
0ξ2κ2E2
n2
can be controlled; for
sufficiently wide neural network, the terms with
m
in the
denominator can be made arbitrarily small; finally, notice
that increasing the number of subnetworks Swill drive the
last term in the bound zero.
Vanilla asynchronous distributed dropout in practice
.
We test vanilla
AsyncDrop
with 25% Dropout rate in a
FL setting with 104 heterogeneous clients and based on
non-i.i.d. CIFAR100 dataset distribution. Beyond the exten-
sions of FL baselines to the asynchronous setting (FedAvg,
Fed-Weighted-Avg and FedProx), we further extend the
work in [
13
] into Async. FjORD; further details in Sec
6. As shown in Table 1,
AsyncDrop
indeed shows im-
摘要:

EfcientandLight-WeightFederatedLearningviaAsynchronousDistributedDropoutChenDunMirianHipolitoChrisJermaineDimitriosDimitriadisAnastasiosKyrillidisRiceUniversityMicrosoftResearchRiceUniversityMicrosoftResearchRiceUniversityAbstractAsynchronouslearningprotocolshaveregainedattentionlately,especiallyin...

展开>> 收起<<
Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:1.44MB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注