SWIFT RAPID DECENTRALIZED FEDERATED LEARNING VIA WAIT-FREE MODEL COMMUNICATION Marco Bornstein Tahseen Rabbani Evan Wang Amrit Singh Bedi Furong Huang

2025-05-02 0 0 1.2MB 30 页 10玖币

侵权投诉

SWIFT: RAPID DECENTRALIZED FEDERATED

LEARNING VIA WAIT-FREE MODEL COMMUNICATION

Marco Bornstein, Tahseen Rabbani, Evan Wang, Amrit Singh Bedi, & Furong Huang

Department of Computer Science, University of Maryland

{marcob, trabbani, ezw, amritbd, furongh}@umd.edu

ABSTRACT

The decentralized Federated Learning (FL) setting avoids the role of a poten-

tially unreliable or untrustworthy central host by utilizing groups of clients to

collaboratively train a model via localized training and model/gradient sharing.

Most existing decentralized FL algorithms require synchronization of client mod-

els where the speed of synchronization depends upon the slowest client. In this

work, we propose SWIFT: a novel wait-free decentralized FL algorithm that al-

lows clients to conduct training at their own speed. Theoretically, we prove that

SWIFT matches the gold-standard iteration convergence rate

O(1/√T)

of parallel

stochastic gradient descent for convex and non-convex smooth optimization (total

iterations

). Furthermore, we provide theoretical results for IID and non-IID

settings without any bounded-delay assumption for slow clients which is required

by other asynchronous decentralized FL algorithms. Although SWIFT achieves the

same iteration convergence rate with respect to

as other state-of-the-art (SOTA)

parallel stochastic algorithms, it converges faster with respect to run-time due to its

wait-free structure. Our experimental results demonstrate that SWIFT’s run-time

is reduced due to a large reduction in communication time per epoch, which falls

by an order of magnitude compared to synchronous counterparts. Furthermore,

SWIFT produces loss levels for image classiﬁcation, over IID and non-IID data

settings, upwards of 50% faster than existing SOTA algorithms.

1 INTRODUCTION

Federated Learning (FL) is an increasingly popular setting to train powerful deep neural networks

with data derived from an assortment of clients. Recent research (Lian et al.,2017;Li et al.,2019;

Wang & Joshi,2018) has focused on constructing decentralized FL algorithms that overcome speed

and scalability issues found within classical centralized FL (McMahan et al.,2017;Savazzi et al.,

2020). While decentralized algorithms have eliminated a major bottleneck in the distributed setting,

the central server, their scalability potential is still largely untapped. Many are plagued by high

communication time per round (Wang et al.,2019). Shortening the communication time per round

allows more clients to connect and then communicate with one another, thereby increasing scalability.

Due to the synchronous nature of current decentralized FL algorithms, communication time per round,

and consequently run-time, is ampliﬁed by parallelization delays. These delays are caused by the

slowest client in the network. To circumvent these issues, asynchronous decentralized FL algorithms

have been proposed (Lian et al.,2018;Luo et al.,2020;Liu et al.,2022;Nadiradze et al.,2021).

However, these algorithms still suffer from high communication time per round. Furthermore, their

communication protocols either do not propagate models well throughout the network (via gossip

algorithms) or require partial synchronization. Finally, these asynchronous algorithms rely on a

deterministic bounded-delay assumption, which ensures that the slowest client in the network updates

at least every

iterations. This assumption is satisﬁed only under certain conditions (Abbasloo &

Chao,2020), and worsens the convergence rate by adding a sub-optimal reliance on τ.

To remedy these drawbacks, we propose the

hared

ree

ransmission (SWIFT) algorithm:

an efﬁcient, scalable, and high-performing decentralized FL algorithm. Unlike other decentralized

arXiv:2210.14026v1 [cs.DC] 25 Oct 2022

Algorithm Iteration Convergence Rate Client (i) Comm-Time Complexity Neighborhood Avg. Asynchronous Private Memory

D-SGD O(1/√T)O(Tmaxj∈NiCj)X7X

PA-SGD O(1/√T)O(|Cs|maxj∈NiCj)X7X

LD-SGD O(1/√T)O(|Cs|maxj∈NiCj)X7X

AD-PSGD O(τ/√T)O(TCi)7X7

SWIFT O(1/√T)O(|Cs|Ci)X X X

(1)

Notation: total iterations

, communication set

(

|Cs|< T

), client

’s neighborhood

, maximal bounded delay

, and client

’s communication time

per round Ci.(2) As compared to AD-PSGD, SWIFT does not have a τconvergence rate term due to using an expected client delay in analysis.

Table 1: Rate and complexity comparisons for decentralized FL algorithms.

FL algorithms, SWIFT obtains minimal communication time per round due to its wait-free structure.

Furthermore, SWIFT is the ﬁrst asynchronous decentralized FL algorithm to obtain an optimal

O(1/√T)

convergence rate (aligning with stochastic gradient descent) without a bounded-delay

assumption. Instead, SWIFT leverages the expected delay of each client (detailed in our remarks

within Section 6). Experiments validate SWIFT’s efﬁciency, showcasing a reduction in communication

time by nearly an order of magnitude and run-times by upwards of 50%. All the while, SWIFT

remains at state-of-the-art (SOTA) global test/train loss for image classiﬁcation compared to other

decentralized FL algorithms. We summarize our main contributions as follows.

B(1)

Propose a novel wait-free decentralized FL algorithm (called SWIFT) and prove its theoretical

convergence without a bounded-delay assumption.

B(2)

Implement a novel pre-processing algorithm to ensure non-symmetric and non-doubly stochas-

tic communication matrices are symmetric and doubly-stochastic under expectation.

B(3)

Provide the ﬁrst theoretical client-communication error bound for non-symmetric and non-

doubly stochastic communication matrices in the asynchronous setting.

B(4)

Demonstrate experimentally a signiﬁcant reduction in communication time and run-time per

epoch for CIFAR-10 classiﬁcation in both IID and non-IID settings compared to synchronous

decentralized FL algorithms.

2 RELATED WORKS

Asynchronous Learning.

HOGWILD! (Recht et al.,2011), AsySG-Con (Lian et al.,2015),

and AD-PSGD (Lian et al.,2017) are seminal examples of asynchronous algorithms that allow

clients to proceed at their own pace. However, these methods require a shared memory/oracle from

which clients grab the most up-to-date global parameters (e.g. the current graph-averaged gradient).

By contrast, SWIFT relies on a message passing interface (MPI) to exchange parameters between

neighbors, rather than interfacing with a shared memory structure. Algorithmically, each client relies

on local memory to store current neighbor weights. To circumvent local memory overload, common

in IoT clusters (Li et al.,2018), clients maintain a mailbox containing neighbor models: each client

pulls out neighbor models one at a time to sequentially compute the desired aggregated statistics.

Decentralized Stochastic Gradient Descent (SGD)

. The predecessor to decentralized FL is

gossip learning (Boyd et al.,2006;Heged˝us et al.,2021). Gossip learning was ﬁrst introduced by

the control community to assist with mean estimation of decentrally-hosted data distributions (Aysal

et al.,2009;Boyd et al.,2005). Now, SGD-based gossip algorithms are used to solve large-scale

machine learning tasks (Lian et al.,2015;2018;Ghadimi et al.,2016;Nedic & Ozdaglar,2009;

Recht et al.,2011;Agarwal & Duchi,2011). A key feature of gossip learning is the presence of

a globally shared oracle/memory with whom clients exchange parameters at the end of training

rounds (Boyd et al.,2006). While read/write-accessible shared memory is well-suited for a single-

organization ecosystem (i.e. all clients are controllable and trusted), this is unrealistic for more

general edge-based paradigms. Neighborhood-based communication and aggregation algorithms,

such as D-SGD (Lian et al.,2017) and PA-SGD (Li et al.,2019), can theoretically and empirically

outperform their centralized counterparts, especially under heterogeneous client data distributions.

Unfortunately, these algorithms suffer from synchronization slowdowns. SWIFT is asynchronous

(avoiding slowdowns), utilizes neighborhood averaging, and does not require shared memory.

Communication Under Expectation.

Few works in FL center on communication uncertainty. In

(Ye et al.,2022), a lightweight, yet unreliable, transmission protocol is constructed in lieu of slow

Deﬁnition Notation Deﬁnition Notation

Global Iteration tNumber of Clients n

Active Client at t itParameter Dimension d

Client (i, j)Communication Coefﬁcient at t wt

i,j Data Dimension f

Local Model of Client iat t xt

i∈RdStep-Size γ

Local Model of Active Client at t xt

it∈RdTotal SWIFT Iterations T

Mini-Batch Data from Active Client at t ξt

it∈RfMini-Batch Size (Uniform) M

Gradient of the Active Model at t g(xt

it, ξt

it)∈RdCommunication Set Cs

Client iCommunication Vector at t wt

i∈RnGlobal Data Distribution D

Local Model Matrix at t Xt∈Rd×nGlobal Objective f(x)

Zero-Padded Gradient of the Active Model at t G(xt

it, ξt

it)∈Rd×nClient Inﬂuence Score pi

Expected Local Model Gradients at t¯

G(Xt, ξt)∈Rd×nClient Inﬂuence Vector p∈Rn

Active Client Communication Matrix at t W t

it∈Rn×nOne-Hot Vector ei∈Rn

Expected Client Communication Matrix at t¯

Wt∈Rn×nIdentity Matrix In∈Rn×n

Table 2: Notation Table

heavyweight protocols. A synchronous algorithm is developed to converge under expectation of an

unreliable communication matrix (probabilistic link reliability). SWIFT also convergences under

expectation of a communication matrix, yet in a different and asynchronous setting. SWIFT is already

lightweight and reliable, and our use of expectation does not regard link reliability.

Communication Efﬁciency.

Minimizing each client

’s communication time per round

is a

challenge in FL, as the radius of information exchange can be large (Kairouz et al.,2021). MATCHA

(Wang et al.,2019) decomposes the base network into

disjoint matchings. Every epoch, a random

sub-graph is generated from a combination of matchings, each having an activation probability

. Clients then exchange parameters along this sub-graph. This requires a total communication-

time complexity of

O(TPm

k=1 pkmaxj∈NiCj)

, where

are client

’s neighbors. LD-SGD (Li

et al.,2019) and PA-SGD (Wang & Joshi,2018) explore how reducing the number of neighborhood

parameter exchanges affects convergence. Both algorithms create a communication set

(deﬁned

in Appendix C) that dictate when clients communicate with one another. The communication-time

complexities are listed in Table 1. These methods, however, are synchronous and their communication-

time complexities depend upon the slowest neighbor

maxj∈NiCj

. SWIFT improves upon this,

achieving a communication-time complexity depending on a client’s own communication-time

per round. Unlike AD-PSGD Lian et al. (2018), which achieves a similar communication-time

complexity, SWIFT allows for periodic communication, uses only local memory, and does not require

a bounded-delay assumption.

3 PROBLEM FORMULATION

Decentralized FL.

In the FL setting, we have

clients represented as vertices of an arbitrary

communication graph

with vertex set

V={1, . . . , n}

and edge set

E ⊆ V × V

. Each client

communicates with one-hop neighboring clients

such that

(i, j)∈ E

. We denote the neighborhood

for client ias Ni, and clients work in tandem to ﬁnd the global model parameters xby solving:

min

x∈Rdf(x) :=

i=1

pifi(x), fi(x) := Eξi∼Di`(x, ξ),

i=1

pi= 1, pi≥0.(1)

The global objective function f(x)is the weighted average of all local objective functions fi(x). In

Equation 1,

pi,∀i∈[n]

denotes the client inﬂuence score. This term controls the inﬂuence of client

on the global consensus model, forming the client inﬂuence vector

p={pi}n

i=1

. These scores also

reﬂect the sampling probability of each client. We note that each local objective function

fi(x)

the expectation of loss function

with respect to potentially different local data

ξi={ξi,j }M

j=1

from

each client i’s distribution Di, i.e., ξi,j ∼ Di. The total number of iterations is denoted as T.

Existing Inter-Client Communication in Decentralized FL.

All clients balance their individ-

ual training with inter-client communications in order to achieve consensus while operating in a

decentralized manner. The core idea of decentralized FL is that each client communicates with

its neighbors (connected clients) and shares local information. Balancing individual training with

inter-client communication ensures individual client models are well-tailored to personal data while

remaining (i) robust to other client data, and (ii) able to converge to an optimal consensus model.

1"∉"C

Local Gradient Update Wait-Free Model Communication

2"∈"C

3∉"C

4"∈"C

Fetch &

Average

Fetch &

Average

Fetch &

Average

Fetch &

Average

Fetch &

Average

Client 1

Client 2

Client 3

Client 4

Fetch &

Average

Figure 1: SWIFT schematic with Cs=C1(i.e., clients communicate every two local update steps).

Periodic Averaging.

Algorithms such as Periodic Averaging SGD (PA-SGD) (Wang & Joshi,

2018) and Local Decentralized SGD (LD-SGD) reduce communication time by performing multiple

local updates before synchronizing. This process is accomplished through the use of a communication

set Cs, which deﬁnes the set of iterations a client must perform synchronization,

Cs={t∈N|tmod (s+ 1) = 0, t ≤T}.(2)

We adopt this communication set notation, although synchronization is unneeded in our algorithm.

4 SHARED WAIT-FREE TRANSMISSION (SWIFT) FEDERATED LEARNING

In this section, we present the

hared

ree

ransmission (SWIFT) Algorithm. SWIFT is an

asynchronous algorithm that allows clients to work at their own speed. Therefore, it removes the

dependency on the slowest client which is the major drawback of synchronous settings. Moreover,

unlike other asynchronous algorithms, SWIFT does not require a bound on the speed of the slowest

client in the network and allows for neighborhood averaging and periodic communication.

A SWIFT Overview.

Each client

runs SWIFT in parallel, ﬁrst receiving an initial model

communication set Cs, and counter ci←1. SWIFT is concisely summarized in the following steps:

(0) Determine client-communication weights wivia Algorithm 2.

(1) Broadcast the local model to all neighboring clients.

(2) Sample a random local data batch of size M.

(3) Compute the gradient update of the loss function `with the sampled local data.

(4)

Fetch and store neighboring local models, and average them with one’s own local model if

ci∈ Cs

(5) Update the local model with the computed gradient update, as well as the counter ci←ci+ 1.

(6) Repeat steps (1)-(5) until convergence.

A diagram and algorithmic block of SWIFT are depicted in Figure 1and Algorithm 1respectively.

Active Clients, Asynchronous Iterations, and the Local-Model Matrix.

Each time a client

ﬁnishes a pass through steps

(1)

(5)

, one global iteration is performed. Thus, the global iteration

increased after the completion of any client’s averaging and local gradient update. The client that

performs the

-th iteration is called the active client, and is designated as

(Line 6 of Algorithm 1).

There is only one active client per global iteration. All other client models remain unchanged during

the

-th iteration (Line 16 of Algorithm 1). In synchronous algorithms, the global iteration

increases

only after all clients ﬁnish an update. SWIFT, which is asynchronous, increases the global iteration

after any client ﬁnishes an update. In our analysis, we deﬁne local-model matrix

Xt∈Rd×n

as the

concatenation of all local client models at iteration tfor ease of notation,

Xt:= [xt

1, . . . , xt

n]∈Rd×n.(3)

Inspired by PA-SGD (Wang & Joshi,2018), SWIFT handles multiple local gradient steps before

averaging models amongst neighboring clients (Line 10 of Algorithm 1). Periodic averaging for

SWIFT, governed by a dynamic client-communication matrix, is detailed below.

Algorithm 1: Shared WaIt-Free Transmission (SWIFT)

Input :

Vertex set

, Total steps

, Step-size

, Client Inﬂuence Vector

, Distributions of client

data Di, Communication set Cs, Batch size M, Loss function `, and Initial model x0

Output :Consensus model 1

nPn

i=1 xT

1Initialize each client’s local update counter ci←1,∀i∈ V

2Obtain each client’s new communication vector wt

iusing Algorithm 2

3for t= 1, . . . , T do

4if network topology changes then

5Renew each client’s communication vector wt

iusing Algorithm 2

6Randomly select an active client itaccording to Client Inﬂuence Probability Vector p

7Broadcast active client’s model xt

itto all its neighbors {k|wt

it,k 6= 0, k 6=it}

8Sample a batch of active client it’s local data ξt

it:= {ξt

it,m}M

m=1 from distribution Dit

9Compute the gradient update:g(xt

it, ξt

it)←1

MPM

m=1 ∇`(xt

it, ξt

it,m)

10 if current step falls in the predeﬁned communication set, i.e., cit∈ Csthen

11 Fetch and store the latest models {xt

k}from it’s neighbors {k|wt

it,k 6= 0, k 6=it}

12 Model average for the active client:xt+1/2

it←Pkwt

it,kxt

k+wt

it,itxt

13 else

14 Active client model remains the same: xt+1/2

it←xt

15 Model update for the active client:xt+1

it←xt+1/2

it−γg(xt

it;ξt

it)

16 Update other clients: xt+1

j←xt

j,∀j6=it

17 Update the active client’s counter cit←cit+ 1

Wait Free.

The backbone of SWIFT is its wait-free structure. Unlike any other decentralized FL

algorithms, SWIFT does not require simultaneous averaging between two clients or a neighborhood of

clients. Instead, each client fetches the latest models its neighbors have sent it and performs averaging

with those available (Lines 11-12 of Algorithm 1). There is no pause in local training waiting for a

neighboring client to ﬁnish computations or average with, making SWIFT wait-free.

Update Rule.

SWIFT runs in parallel with all clients performing local gradient updates and model

communication simultaneously. Collectively, the update rule can be written in matrix form as

Xt+1 =XtWt

it−γG(xt

it, ξt

it),(4)

where

denotes the step size parameter and the matrix

G(xt

it, ξt

it)∈Rd×n

is the zero-padded

gradient of the active model

. The entries of

G(xt

it, ξt

it)

are zero except for the

-th column, which

contains the active gradient g(xt

it, ξt

it). Next, we describe the client-communication matrix Wt

it.

Client-Communication Matrix.

The backbone of decentralized FL algorithms is the client-

communication matrix

(also known as the weighting matrix). To remove all forms of synchroniza-

tion and to become wait-free, SWIFT relies upon a novel client-communication matrix

that is

neither symmetric nor doubly-stochastic, unlike other algorithms in FL (Wang & Joshi,2018;Lian

et al.,2018;Li et al.,2019;Koloskova et al.,2020). The result of a non-symmetric and non-doubly

stochastic client-communication matrix, is that averaging occurs for a single active client itand not

over a pair or neighborhood of clients. This curbs superﬂuous communication time.

Within SWIFT, a dynamic client-communication matrix is implemented to allow for periodic averag-

ing. We will now deﬁne the active client-communication matrix

in SWIFT, where

is the active

client which performs the

-th global iteration.

can be one of two forms: (1) an identity matrix

it=Inif cit/∈ Csor (2) a communication matrix if cit∈ Cswith structure,

it:= In+ (wt

it−eit)e|

it, wt

it:= [wt

1,it, . . . , wt

n,it]|∈Rn,

j=1

wj,i = 1, wi,i ≥1/n ∀i. (5)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SWIFT:RAPIDDECENTRALIZEDFEDERATEDLEARNINGVIAWAIT-FREEMODELCOMMUNICATIONMarcoBornstein,TahseenRabbani,EvanWang,AmritSinghBedi,&FurongHuangDepartmentofComputerScience,UniversityofMaryland{marcob,trabbani,ezw,amritbd,furongh}@umd.eduABSTRACTThedecentralizedFederatedLearning(FL)settingavoidstheroleofapo...

展开>> 收起<<

SWIFT RAPID DECENTRALIZED FEDERATED LEARNING VIA WAIT-FREE MODEL COMMUNICATION Marco Bornstein Tahseen Rabbani Evan Wang Amrit Singh Bedi Furong Huang.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SWIFT RAPID DECENTRALIZED FEDERATED LEARNING VIA WAIT-FREE MODEL COMMUNICATION Marco Bornstein Tahseen Rabbani Evan Wang Amrit Singh Bedi Furong Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: