SWIFT RAPID DECENTRALIZED FEDERATED LEARNING VIA WAIT-FREE MODEL COMMUNICATION Marco Bornstein Tahseen Rabbani Evan Wang Amrit Singh Bedi Furong Huang

2025-05-02 0 0 1.2MB 30 页 10玖币
侵权投诉
SWIFT: RAPID DECENTRALIZED FEDERATED
LEARNING VIA WAIT-FREE MODEL COMMUNICATION
Marco Bornstein, Tahseen Rabbani, Evan Wang, Amrit Singh Bedi, & Furong Huang
Department of Computer Science, University of Maryland
{marcob, trabbani, ezw, amritbd, furongh}@umd.edu
ABSTRACT
The decentralized Federated Learning (FL) setting avoids the role of a poten-
tially unreliable or untrustworthy central host by utilizing groups of clients to
collaboratively train a model via localized training and model/gradient sharing.
Most existing decentralized FL algorithms require synchronization of client mod-
els where the speed of synchronization depends upon the slowest client. In this
work, we propose SWIFT: a novel wait-free decentralized FL algorithm that al-
lows clients to conduct training at their own speed. Theoretically, we prove that
SWIFT matches the gold-standard iteration convergence rate
O(1/T)
of parallel
stochastic gradient descent for convex and non-convex smooth optimization (total
iterations
T
). Furthermore, we provide theoretical results for IID and non-IID
settings without any bounded-delay assumption for slow clients which is required
by other asynchronous decentralized FL algorithms. Although SWIFT achieves the
same iteration convergence rate with respect to
T
as other state-of-the-art (SOTA)
parallel stochastic algorithms, it converges faster with respect to run-time due to its
wait-free structure. Our experimental results demonstrate that SWIFTs run-time
is reduced due to a large reduction in communication time per epoch, which falls
by an order of magnitude compared to synchronous counterparts. Furthermore,
SWIFT produces loss levels for image classification, over IID and non-IID data
settings, upwards of 50% faster than existing SOTA algorithms.
1 INTRODUCTION
Federated Learning (FL) is an increasingly popular setting to train powerful deep neural networks
with data derived from an assortment of clients. Recent research (Lian et al.,2017;Li et al.,2019;
Wang & Joshi,2018) has focused on constructing decentralized FL algorithms that overcome speed
and scalability issues found within classical centralized FL (McMahan et al.,2017;Savazzi et al.,
2020). While decentralized algorithms have eliminated a major bottleneck in the distributed setting,
the central server, their scalability potential is still largely untapped. Many are plagued by high
communication time per round (Wang et al.,2019). Shortening the communication time per round
allows more clients to connect and then communicate with one another, thereby increasing scalability.
Due to the synchronous nature of current decentralized FL algorithms, communication time per round,
and consequently run-time, is amplified by parallelization delays. These delays are caused by the
slowest client in the network. To circumvent these issues, asynchronous decentralized FL algorithms
have been proposed (Lian et al.,2018;Luo et al.,2020;Liu et al.,2022;Nadiradze et al.,2021).
However, these algorithms still suffer from high communication time per round. Furthermore, their
communication protocols either do not propagate models well throughout the network (via gossip
algorithms) or require partial synchronization. Finally, these asynchronous algorithms rely on a
deterministic bounded-delay assumption, which ensures that the slowest client in the network updates
at least every
τ
iterations. This assumption is satisfied only under certain conditions (Abbasloo &
Chao,2020), and worsens the convergence rate by adding a sub-optimal reliance on τ.
To remedy these drawbacks, we propose the
S
hared
W
a
I
t-
F
ree
T
ransmission (SWIFT) algorithm:
an efficient, scalable, and high-performing decentralized FL algorithm. Unlike other decentralized
1
arXiv:2210.14026v1 [cs.DC] 25 Oct 2022
Algorithm Iteration Convergence Rate Client (i) Comm-Time Complexity Neighborhood Avg. Asynchronous Private Memory
D-SGD O(1/T)O(Tmaxj∈NiCj)X7X
PA-SGD O(1/T)O(|Cs|maxj∈NiCj)X7X
LD-SGD O(1/T)O(|Cs|maxj∈NiCj)X7X
AD-PSGD O(τ/T)O(TCi)7X7
SWIFT O(1/T)O(|Cs|Ci)X X X
(1)
Notation: total iterations
T
, communication set
Cs
(
|Cs|< T
), client
i
s neighborhood
Ni
, maximal bounded delay
τ
, and client
i
s communication time
per round Ci.(2) As compared to AD-PSGD, SWIFT does not have a τconvergence rate term due to using an expected client delay in analysis.
Table 1: Rate and complexity comparisons for decentralized FL algorithms.
FL algorithms, SWIFT obtains minimal communication time per round due to its wait-free structure.
Furthermore, SWIFT is the first asynchronous decentralized FL algorithm to obtain an optimal
O(1/T)
convergence rate (aligning with stochastic gradient descent) without a bounded-delay
assumption. Instead, SWIFT leverages the expected delay of each client (detailed in our remarks
within Section 6). Experiments validate SWIFTs efficiency, showcasing a reduction in communication
time by nearly an order of magnitude and run-times by upwards of 50%. All the while, SWIFT
remains at state-of-the-art (SOTA) global test/train loss for image classification compared to other
decentralized FL algorithms. We summarize our main contributions as follows.
B(1)
Propose a novel wait-free decentralized FL algorithm (called SWIFT) and prove its theoretical
convergence without a bounded-delay assumption.
B(2)
Implement a novel pre-processing algorithm to ensure non-symmetric and non-doubly stochas-
tic communication matrices are symmetric and doubly-stochastic under expectation.
B(3)
Provide the first theoretical client-communication error bound for non-symmetric and non-
doubly stochastic communication matrices in the asynchronous setting.
B(4)
Demonstrate experimentally a significant reduction in communication time and run-time per
epoch for CIFAR-10 classification in both IID and non-IID settings compared to synchronous
decentralized FL algorithms.
2 RELATED WORKS
Asynchronous Learning.
HOGWILD! (Recht et al.,2011), AsySG-Con (Lian et al.,2015),
and AD-PSGD (Lian et al.,2017) are seminal examples of asynchronous algorithms that allow
clients to proceed at their own pace. However, these methods require a shared memory/oracle from
which clients grab the most up-to-date global parameters (e.g. the current graph-averaged gradient).
By contrast, SWIFT relies on a message passing interface (MPI) to exchange parameters between
neighbors, rather than interfacing with a shared memory structure. Algorithmically, each client relies
on local memory to store current neighbor weights. To circumvent local memory overload, common
in IoT clusters (Li et al.,2018), clients maintain a mailbox containing neighbor models: each client
pulls out neighbor models one at a time to sequentially compute the desired aggregated statistics.
Decentralized Stochastic Gradient Descent (SGD)
. The predecessor to decentralized FL is
gossip learning (Boyd et al.,2006;Heged˝us et al.,2021). Gossip learning was first introduced by
the control community to assist with mean estimation of decentrally-hosted data distributions (Aysal
et al.,2009;Boyd et al.,2005). Now, SGD-based gossip algorithms are used to solve large-scale
machine learning tasks (Lian et al.,2015;2018;Ghadimi et al.,2016;Nedic & Ozdaglar,2009;
Recht et al.,2011;Agarwal & Duchi,2011). A key feature of gossip learning is the presence of
a globally shared oracle/memory with whom clients exchange parameters at the end of training
rounds (Boyd et al.,2006). While read/write-accessible shared memory is well-suited for a single-
organization ecosystem (i.e. all clients are controllable and trusted), this is unrealistic for more
general edge-based paradigms. Neighborhood-based communication and aggregation algorithms,
such as D-SGD (Lian et al.,2017) and PA-SGD (Li et al.,2019), can theoretically and empirically
outperform their centralized counterparts, especially under heterogeneous client data distributions.
Unfortunately, these algorithms suffer from synchronization slowdowns. SWIFT is asynchronous
(avoiding slowdowns), utilizes neighborhood averaging, and does not require shared memory.
Communication Under Expectation.
Few works in FL center on communication uncertainty. In
(Ye et al.,2022), a lightweight, yet unreliable, transmission protocol is constructed in lieu of slow
2
Definition Notation Definition Notation
Global Iteration tNumber of Clients n
Active Client at t itParameter Dimension d
Client (i, j)Communication Coefficient at t wt
i,j Data Dimension f
Local Model of Client iat t xt
iRdStep-Size γ
Local Model of Active Client at t xt
itRdTotal SWIFT Iterations T
Mini-Batch Data from Active Client at t ξt
itRfMini-Batch Size (Uniform) M
Gradient of the Active Model at t g(xt
it, ξt
it)RdCommunication Set Cs
Client iCommunication Vector at t wt
iRnGlobal Data Distribution D
Local Model Matrix at t XtRd×nGlobal Objective f(x)
Zero-Padded Gradient of the Active Model at t G(xt
it, ξt
it)Rd×nClient Influence Score pi
Expected Local Model Gradients at t¯
G(Xt, ξt)Rd×nClient Influence Vector pRn
Active Client Communication Matrix at t W t
itRn×nOne-Hot Vector eiRn
Expected Client Communication Matrix at t¯
WtRn×nIdentity Matrix InRn×n
Table 2: Notation Table
heavyweight protocols. A synchronous algorithm is developed to converge under expectation of an
unreliable communication matrix (probabilistic link reliability). SWIFT also convergences under
expectation of a communication matrix, yet in a different and asynchronous setting. SWIFT is already
lightweight and reliable, and our use of expectation does not regard link reliability.
Communication Efficiency.
Minimizing each client
i
s communication time per round
Ci
is a
challenge in FL, as the radius of information exchange can be large (Kairouz et al.,2021). MATCHA
(Wang et al.,2019) decomposes the base network into
m
disjoint matchings. Every epoch, a random
sub-graph is generated from a combination of matchings, each having an activation probability
pk
. Clients then exchange parameters along this sub-graph. This requires a total communication-
time complexity of
O(TPm
k=1 pkmaxj∈NiCj)
, where
Ni
are client
i
s neighbors. LD-SGD (Li
et al.,2019) and PA-SGD (Wang & Joshi,2018) explore how reducing the number of neighborhood
parameter exchanges affects convergence. Both algorithms create a communication set
Cs
(defined
in Appendix C) that dictate when clients communicate with one another. The communication-time
complexities are listed in Table 1. These methods, however, are synchronous and their communication-
time complexities depend upon the slowest neighbor
maxj∈NiCj
. SWIFT improves upon this,
achieving a communication-time complexity depending on a client’s own communication-time
per round. Unlike AD-PSGD Lian et al. (2018), which achieves a similar communication-time
complexity, SWIFT allows for periodic communication, uses only local memory, and does not require
a bounded-delay assumption.
3 PROBLEM FORMULATION
Decentralized FL.
In the FL setting, we have
n
clients represented as vertices of an arbitrary
communication graph
G
with vertex set
V={1, . . . , n}
and edge set
E V × V
. Each client
i
communicates with one-hop neighboring clients
j
such that
(i, j)∈ E
. We denote the neighborhood
for client ias Ni, and clients work in tandem to find the global model parameters xby solving:
min
xRdf(x) :=
n
X
i=1
pifi(x), fi(x) := Eξi∼Di`(x, ξ),
n
X
i=1
pi= 1, pi0.(1)
The global objective function f(x)is the weighted average of all local objective functions fi(x). In
Equation 1,
pi,i[n]
denotes the client influence score. This term controls the influence of client
i
on the global consensus model, forming the client influence vector
p={pi}n
i=1
. These scores also
reflect the sampling probability of each client. We note that each local objective function
fi(x)
is
the expectation of loss function
`
with respect to potentially different local data
ξi={ξi,j }M
j=1
from
each client is distribution Di, i.e., ξi,j ∼ Di. The total number of iterations is denoted as T.
Existing Inter-Client Communication in Decentralized FL.
All clients balance their individ-
ual training with inter-client communications in order to achieve consensus while operating in a
decentralized manner. The core idea of decentralized FL is that each client communicates with
its neighbors (connected clients) and shares local information. Balancing individual training with
inter-client communication ensures individual client models are well-tailored to personal data while
remaining (i) robust to other client data, and (ii) able to converge to an optimal consensus model.
3
1"∉"C
1
Local Gradient Update Wait-Free Model Communication
2"∈"C
1
3∉"C
1
4"∈"C
1
Fetch &
Average
Fetch &
Average
Fetch &
Average
Fetch &
Average
Fetch &
Average
Client 1
Client 2
Client 3
Client 4
Fetch &
Average
Figure 1: SWIFT schematic with Cs=C1(i.e., clients communicate every two local update steps).
Periodic Averaging.
Algorithms such as Periodic Averaging SGD (PA-SGD) (Wang & Joshi,
2018) and Local Decentralized SGD (LD-SGD) reduce communication time by performing multiple
local updates before synchronizing. This process is accomplished through the use of a communication
set Cs, which defines the set of iterations a client must perform synchronization,
Cs={tN|tmod (s+ 1) = 0, t T}.(2)
We adopt this communication set notation, although synchronization is unneeded in our algorithm.
4 SHARED WAIT-FREE TRANSMISSION (SWIFT) FEDERATED LEARNING
In this section, we present the
S
hared
W
a
I
t-
F
ree
T
ransmission (SWIFT) Algorithm. SWIFT is an
asynchronous algorithm that allows clients to work at their own speed. Therefore, it removes the
dependency on the slowest client which is the major drawback of synchronous settings. Moreover,
unlike other asynchronous algorithms, SWIFT does not require a bound on the speed of the slowest
client in the network and allows for neighborhood averaging and periodic communication.
A SWIFT Overview.
Each client
i
runs SWIFT in parallel, first receiving an initial model
xi
,
communication set Cs, and counter ci1. SWIFT is concisely summarized in the following steps:
(0) Determine client-communication weights wivia Algorithm 2.
(1) Broadcast the local model to all neighboring clients.
(2) Sample a random local data batch of size M.
(3) Compute the gradient update of the loss function `with the sampled local data.
(4)
Fetch and store neighboring local models, and average them with one’s own local model if
ci∈ Cs
.
(5) Update the local model with the computed gradient update, as well as the counter cici+ 1.
(6) Repeat steps (1)-(5) until convergence.
A diagram and algorithmic block of SWIFT are depicted in Figure 1and Algorithm 1respectively.
Active Clients, Asynchronous Iterations, and the Local-Model Matrix.
Each time a client
finishes a pass through steps
(1)
-
(5)
, one global iteration is performed. Thus, the global iteration
t
is
increased after the completion of any client’s averaging and local gradient update. The client that
performs the
t
-th iteration is called the active client, and is designated as
it
(Line 6 of Algorithm 1).
There is only one active client per global iteration. All other client models remain unchanged during
the
t
-th iteration (Line 16 of Algorithm 1). In synchronous algorithms, the global iteration
t
increases
only after all clients finish an update. SWIFT, which is asynchronous, increases the global iteration
t
after any client finishes an update. In our analysis, we define local-model matrix
XtRd×n
as the
concatenation of all local client models at iteration tfor ease of notation,
Xt:= [xt
1, . . . , xt
n]Rd×n.(3)
Inspired by PA-SGD (Wang & Joshi,2018), SWIFT handles multiple local gradient steps before
averaging models amongst neighboring clients (Line 10 of Algorithm 1). Periodic averaging for
SWIFT, governed by a dynamic client-communication matrix, is detailed below.
4
Algorithm 1: Shared WaIt-Free Transmission (SWIFT)
Input :
Vertex set
V
, Total steps
T
, Step-size
γ
, Client Influence Vector
p
, Distributions of client
data Di, Communication set Cs, Batch size M, Loss function `, and Initial model x0
Output :Consensus model 1
nPn
i=1 xT
i
1Initialize each client’s local update counter ci1,i∈ V
2Obtain each client’s new communication vector wt
iusing Algorithm 2
3for t= 1, . . . , T do
4if network topology changes then
5Renew each client’s communication vector wt
iusing Algorithm 2
6Randomly select an active client itaccording to Client Influence Probability Vector p
7Broadcast active client’s model xt
itto all its neighbors {k|wt
it,k 6= 0, k 6=it}
8Sample a batch of active client its local data ξt
it:= {ξt
it,m}M
m=1 from distribution Dit
9Compute the gradient update:g(xt
it, ξt
it)1
MPM
m=1 `(xt
it, ξt
it,m)
10 if current step falls in the predefined communication set, i.e., cit∈ Csthen
11 Fetch and store the latest models {xt
k}from its neighbors {k|wt
it,k 6= 0, k 6=it}
12 Model average for the active client:xt+1/2
itPkwt
it,kxt
k+wt
it,itxt
it
13 else
14 Active client model remains the same: xt+1/2
itxt
it
15 Model update for the active client:xt+1
itxt+1/2
itγg(xt
it;ξt
it)
16 Update other clients: xt+1
jxt
j,j6=it
17 Update the active client’s counter citcit+ 1
Wait Free.
The backbone of SWIFT is its wait-free structure. Unlike any other decentralized FL
algorithms, SWIFT does not require simultaneous averaging between two clients or a neighborhood of
clients. Instead, each client fetches the latest models its neighbors have sent it and performs averaging
with those available (Lines 11-12 of Algorithm 1). There is no pause in local training waiting for a
neighboring client to finish computations or average with, making SWIFT wait-free.
Update Rule.
SWIFT runs in parallel with all clients performing local gradient updates and model
communication simultaneously. Collectively, the update rule can be written in matrix form as
Xt+1 =XtWt
itγG(xt
it, ξt
it),(4)
where
γ
denotes the step size parameter and the matrix
G(xt
it, ξt
it)Rd×n
is the zero-padded
gradient of the active model
xt
it
. The entries of
G(xt
it, ξt
it)
are zero except for the
it
-th column, which
contains the active gradient g(xt
it, ξt
it). Next, we describe the client-communication matrix Wt
it.
Client-Communication Matrix.
The backbone of decentralized FL algorithms is the client-
communication matrix
W
(also known as the weighting matrix). To remove all forms of synchroniza-
tion and to become wait-free, SWIFT relies upon a novel client-communication matrix
Wt
it
that is
neither symmetric nor doubly-stochastic, unlike other algorithms in FL (Wang & Joshi,2018;Lian
et al.,2018;Li et al.,2019;Koloskova et al.,2020). The result of a non-symmetric and non-doubly
stochastic client-communication matrix, is that averaging occurs for a single active client itand not
over a pair or neighborhood of clients. This curbs superfluous communication time.
Within SWIFT, a dynamic client-communication matrix is implemented to allow for periodic averag-
ing. We will now define the active client-communication matrix
Wt
it
in SWIFT, where
it
is the active
client which performs the
t
-th global iteration.
Wt
it
can be one of two forms: (1) an identity matrix
Wt
it=Inif cit/∈ Csor (2) a communication matrix if cit∈ Cswith structure,
Wt
it:= In+ (wt
iteit)e|
it, wt
it:= [wt
1,it, . . . , wt
n,it]|Rn,
n
X
j=1
wj,i = 1, wi,i 1/n i. (5)
5
摘要:

SWIFT:RAPIDDECENTRALIZEDFEDERATEDLEARNINGVIAWAIT-FREEMODELCOMMUNICATIONMarcoBornstein,TahseenRabbani,EvanWang,AmritSinghBedi,&FurongHuangDepartmentofComputerScience,UniversityofMaryland{marcob,trabbani,ezw,amritbd,furongh}@umd.eduABSTRACTThedecentralizedFederatedLearning(FL)settingavoidstheroleofapo...

展开>> 收起<<
SWIFT RAPID DECENTRALIZED FEDERATED LEARNING VIA WAIT-FREE MODEL COMMUNICATION Marco Bornstein Tahseen Rabbani Evan Wang Amrit Singh Bedi Furong Huang.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.2MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注