Efﬁcient and Light-Weight Federated Learning via Asynchronous Distributed Dropout

2025-08-18 1 0 1.44MB 29 页 10玖币

侵权投诉

Efﬁcient and Light-Weight Federated Learning

via Asynchronous Distributed Dropout

Chen Dun Mirian Hipolito Chris Jermaine Dimitrios Dimitriadis Anastasios Kyrillidis

Rice University Microsoft Research Rice University Microsoft Research Rice University

Abstract

Asynchronous learning protocols have regained

attention lately, especially in the Federated Learn-

ing (FL) setup, where slower clients can severely

impede the learning process. Herein, we propose

AsyncDrop

, a novel asynchronous FL frame-

work that utilizes dropout regularization to handle

device heterogeneity in distributed settings. Over-

all,

AsyncDrop

achieves better performance

compared to state of the art asynchronous method-

ologies, while resulting in less communication

and training time overheads. The key idea re-

volves around creating “submodels” out of the

global model, and distributing their training to

workers, based on device heterogeneity. We rigor-

ously justify that such an approach can be theoret-

ically characterized. We implement our approach

and compare it against other asynchronous base-

lines, both by design and by adapting existing syn-

chronous FL algorithms to asynchronous scenar-

ios. Empirically,

AsyncDrop

reduces the com-

munication cost and training time, while match-

ing or improving the ﬁnal test accuracy in diverse

non-i.i.d. FL scenarios.

1 Introduction

Background on Federated Learning.

Federated Learning

(FL) [

] is a distributed learning protocol that has

witnessed fast development the past demi-decade. FL de-

viates from the traditional distributed learning paradigms

and allows the integration of edge devices —such as smart-

phones [

], drones [

], and IoT devices [

]— in the

learning procedure. Yet, such real-life, edge devices are ex-

tremely heterogeneous [

]: they have drastically different

speciﬁcations in terms of compute power, device memory

and achievable communication bandwidths. Directly apply-

ing common synchronized FL algorithms –such as FedAvg

and FedProx [

] that require full model broadcasting

and global synchronization– results often in a “stragglers”

effect [

]; i.e., computationally powerful edge de-

vices wait for slower ones during the synchronization step.

The ubiquitous synchronous training.

One way to handle

such issues is by utilizing asynchrony instead of synchrony

in the learning process. To explain the main differences,

let us ﬁrst set up the background. In a synchronous dis-

tributed algorithm, a global model is usually stored at a

central server and is broadcast periodically to all the partici-

pating devices. Then, each device performs local training

steps on its own model copy, before the device sends the up-

dated model to the central server. Finally, the central server

updates the global model by aggregating the received model

copies. This protocol is followed in most FL algorithms,

including the well-established FedAvg [

], FedProx [

FedNova [

] and SCAFFOLD [

]. The main criticism

against synchronous learning could be that it often results

in heavy communication/computation overheads and long

idle/waiting times for workers.

Asynchrony and its challenges.

The deployment of a asyn-

chronous learning method is often convoluted. In the past

decade, HogWild! [

] has emerged as a general asyn-

chronous distributed methodology, and has been applied

initially in basic ML problems like sparse linear/logistic re-

gression [

]. Ideally, based on sparsity arguments,

each edge device can independently update parts of the

global model –that overlap only slightly with the updates

of other workers– in a lock-free fashion [

]. This way,

faster, more powerful edge workers do not suffer from idle

waiting due to slower stragglers. Yet, the use of asynchrony

has been a topic of dispute in distributed neural network

training [

]. Asynchronous training often suffers from

lower accuracy as compared to synchronized SGD, which

results in the dominance of synchronized SGD in neural

network training [4].

Resurgence in asynchrony.

Recently, asynchronous meth-

ods have regained popularity, mainly due to the interest in

applying asynchronous motions within FL on edge devices:

the heterogeneity of edge networks, the ephemeral nature of

the workers, the computational, communication and energy

restriction of mobile devices are some impediments towards

applying synchronous algorithms in realistic environments.

Yet, traditional off-the-shelf asynchronous distributed algo-

rithms still have issues, which might be exacerbated in the

FL setting. As slower devices take longer local training time

arXiv:2210.16105v1 [cs.LG] 28 Oct 2022

Efﬁcient and Light-Weight Federated Learning via Asynchronous Distributed Dropout

before updating the global model, this might result in in-

consistent update schedules of the global model, compared

to that of faster devices. This might have ramiﬁcations:

For FL on i.i.d. data, this will cause the gradient staleness

problem and result in convergence rate decrease; and,

ii)

non-i.i.d. data, this will result in a signiﬁcant drop in global

model ﬁnal accuracy.

As solutions, novel approaches on asynchronous FL propose

weighted global aggregation techniques that take into consid-

eration the heterogeneity of the devices [

]; yet, these

methods often place a heavy computation/communication

burden, as they rely on broadcasting full model updates

to all the clients and/or the server. Other works monitor

client speed to guide the training assignments [

]. Fi-

nally, recent efforts propose semi-asynchronous methods,

where participating devices are selected and buffered in

order to complete a semi-synchronous global update peri-

odically [

]. A thorough discussion on the existing

asynchronous methods in FL can be found in [50].

What is different in this work?

As most algorithms stem

from adapting asynchrony in synchronous FL, one still

needs to broadcast the full model to all devices, following

a data parallel distributed protocol [

], regardless of

device heterogeneity. This inspire us to ask a key question:

“Can we select submodels out of the global model

and send these instead to each device, taking into

account the device heterogeneity?”

We answer this question afﬁrmatively, by proposing a novel

distributed dropout method for FL. We dub our method

AsyncDrop

. Our approach assigns different submodels

to each device

; empirically, such a strategy decreases the

required time to converge to an accuracy level, while pre-

serving favorable ﬁnal accuracy. This work attempts to rein-

stitute the discussion between synchrony and asynchrony in

heterogeneous distributed scenarios, as in FL. Our idea is

based on the ideas of HogWild! [

] –in terms of sparse

submodels– and Independent Subnetwork Training (IST)

[

] –where submodels are deliberately created

for distribution, in order to decrease both computational and

communication requirements.

Yet, we deviate from these works:

The combination of

HogWild! and IST ideas has not been stated and tested

before this work, to the best of our knowledge.

ii)

While

HogWild!-line of work provides optimization guarantees,

we consider the non-trivial, non-convex neural network set-

ting and provide theoretical guarantees for convergence;

such a result combines tools from asynchronous optimiza-

tion [

], Neural Tangent Kernel assumptions [

dropout theory analysis [

], and focuses on convolutional

neural networks [

], deviating from fully-connected layer

We consider both random assignment, as well as structured

assignments, based on the computation power of the devices.

simpliﬁed scenarios. Finally,

iii)

we provide system-level

algorithmic solutions for our approach, mirroring best-

practices found during our experiments. Overall, the contri-

butions of this work can be summarized as follows:

•

We consider and propose asynchronous distributed

dropout (

AsyncDrop

) for efﬁcient large-scale FL. Our

focus is on non-trivial, non-convex ML models –as in neu-

ral network training– and our framework provides speciﬁc

engineering solutions for these cases in practice.

•

We theoretically characterize and support our proposal

with rigorous and non-trivial convergence rate guarantees.

Currently, our theory assumes bounded delays; our future

goal is to exploit recent developments that drop such

assumptions [

]. Yet, our theory already considers the

harder case of neural network training, which is often

omitted in existing theory results.

•

We provide speciﬁc implementation instances and share

“best practices” for faster distributed FL in practice. As

a side-product, our preliminary results include baseline

asynchronous implementations of many synchronous

methods (such as FedAvg, FedProx, and more), that are

not existent currently, to the best of our knowledge.

2 Problem Setup and Challenges

Optimization in neural network training

. We consider

FL scenarios over supervised neural network training: i.e.,

we optimize a loss function

`(·,·)

over a dataset, such that

the model maps unseen data to their true labels, unless

otherwise stated. For clarity, the loss

`(W,·)

encodes both

the loss metric and the neural architecture, with parameters

. Formally, given a data distribution

and

{xi, yi}∼D

where

is a data sample, and

is its corresponding label,

classical deep learning aims in ﬁnding W?as in:

W?= argmin

W∈H (L(W) := 1

i=1

`(W,{xi, yi})),

where

denotes the model hypothesis class that “molds”

the trainable parameters W.

The minimization above can be achieved by using different

approaches, but almost all training is accomplished via a

variation of stochastic gradient descent (SGD) [

]. SGD

modiﬁes the current guess

using stochastic directions

∇`it(Wi) := ∇`(Wi,{xit, yit})

. I.e.,

Wt+1 ←Wt−

η∇`it(Wt).

Here,

η > 0

is the learning rate, and

is a

single or a mini-batch of examples. Most FL algorithms are

based on these basic stochastic motions, like FedAvg [

FedProx [25], FedNova [45] and SCAFFOLD [21].

FL formulation

. Let

be the total number of clients in

a distributed FL scenario. Each client

has its own local

data

such that the whole dataset satisﬁes

D=∪iDi

, and

usually

Di∩ Dj=∅,∀i6=j

. The goal of FL is to ﬁnd a

Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis

global model

that achieves good accuracy on all data

by minimizing the following optimization problem:

W?= argmin

W∈H (L(W) := 1

i=1

`(W,Di)),

where

`(W,Di) = 1

|Di|P{xj,yj}∈Di`(W,{xj, yj})

With a slight abuse of notation,

`(W,Di)

denotes the local

loss function for user

, associated with a local model

(not indicated above), that gets aggregated with the models

of other users. Herein, we consider both i.i.d. and non-i.i.d.

cases, since local data distribution

can be heterogeneous

and follow a non-i.i.d. distribution.

Algorithm 1 Meta Asynchronous FL

Parameters

iters,

clients,

local iters.,

as cur-

rent global model,

as local model for

-th client,

α∈(0,1),ηstep size.

∞

W←randomly initialized global model.

//On each client iasynchronously:

for t= 0, . . . , T −1do

Wi,t ←W

//Train Wifor liters. via SGD

for j= 1, . . . , l do

Wi,t ←Wi,t −η∂L

Wi,t

end for

//Write local to global model

W←(1 −α)·W+α·Wi,t

end for

Details of asynchronous training.

An abstract description

of how asynchronous FL operates is provided in Algorithm

1. In particular, given a number of server iterations

, each

client

gets the updated global model

from the server,

and further locally trains it using

for a number of local

iterations

Asynchronous FL assumes each client has

different computation power and communication bandwidth;

this can be abstracted by different wall-clock times required

to ﬁnish a number of local training iterations. Thus, when

client

has completed its round, the updated model is shared

with the server to be aggregated, before the next round of

communication and computation starts for client

.This is

different from classical synchronous FL, where the global

model is updated only when all participating clients ﬁnish

(or time-out) certain local training iterations.

3 Challenges in Asynchronous FL and

Related Work

Challenges.

Asynchronous steps often lead to inconsistent

update schedules of the global model and are characterized

Details on the use of the optimizer, how it is tuned with respect

to step size, mini-batch size, etc. are intentionally hidden at this

point of the discussion.

by gradient staleness and drifting. Real-life FL applica-

tions include edge devices with limited communication and

computation capabilities (e.g., how often and fast they can

connect with the central server, and how powerful as devices

they are). For instance, edge devices such as IoT devices

or mobile phones [

], might only be able to communicate

with the server within short time windows, due to network

conditions or user permission policy.

Figure 1: Potential issues in asynchronous FL.

Consider the toy setting in Figure 1. The two clients (Clients

A and B) have a signiﬁcantly different update schedule on

the global model: Here, Client A has higher computational

power or communication bandwidth –compared to client B–

potentially leading to model drifting, lack of fair training and

more severe gradient staleness. On top, consider these two

clients having different local (non-i.i.d.) data distributions.

Related Work.

The issue of model drifting due to data

“non-iidness” is a central piece in FL research. Algorithms,

such as FedProx [

], utilize regularization to constrain lo-

cal parameters “drifting" away from previous global models.

The gradient staleness problem has been widely studied

in asynchronous FL, like in [

]. These ap-

proaches can be summarized as weighted asynchronous

FedAvg protocols, in which the weight of each local client

update is proportional to the “capabilities” of the client. This

should decrease the negative impact from stale gradients

by slower clients. Semi-asynchronous methods have been

proposed [

]; yet, they require fast clients to wait until

all other clients’ updates are completed, in order to receive

the updated model for the next round of local training.

Finally, numerous quantization [

] and sparsiﬁcation

[

] techniques have been proposed for reducing compu-

tation and communication costs in FL.

4 Asynchronous Distributed Dropout

(Distributed) Dropout

. Dropout [

] is a widely-

accepted regularization technique in deep learning. The

procedure of Dropout is as follows: per training round, a

random mask over the parameters is generated; this mask

is used to nullify part of the neurons in the neural network

Efﬁcient and Light-Weight Federated Learning via Asynchronous Distributed Dropout

for this particular iteration. Variants of dropout include

the drop-connect [

], multisample dropout [

], Gaussian

dropout [46], and the variational dropout [22].

The idea of dropout has also been used in efﬁcient dis-

tributed and/or FL scenarios. [

] introduces FjORD and

the Ordered Dropout, a synchronous distributed dropout

technique that leads to ordered, nested representation of

knowledge in models, and enables the extraction of lower

footprint submodels without the need of retraining. Such

submodels are more suitable in client heterogeneity, as they

adapt submodel’s width to the client’s capabilities. See also

Nested Dropout [36] and HeteroFL [8].

Algorithm 2 Asynchronous dropout (AsyncDrop)

Parameters

iters,

clients,

local iters.,

as cur-

rent global model,

as local model for

-th client,

α∈(0,1),ηstep size.

∞

W←randomly initialized global model.

//On each client iasynchronously:

for t= 0, . . . , T −1do

Generate mask Mi,t

Wi,t ←WtMi,t

//Train Wi,t for liters. via SGD

for j= 1, . . . , l do

Wi,t ←Wi,t −η∂L

Wi,t

end for

//Write local to global model

Wt+1 ←Wt(Mi,t)c

+((1 −α)·Wt+α·Wi,t)Mi,t

end for

Our proposal and main hypothesis

. We focus on the

asynchronous version of distributed dropout. We study

theoretically whether asynchrony provably works in non-

trivial non-convex scenarios –as in training neural net-

works– with random masks that generate submodels for

each worker. The algorithm is described in Algorithm 2,

dubbed as

AsyncDrop

, and is based on recent distributed

protocols [

]; key features are highlighted in

teal-colored text. The main difference from Algorithm 1

is that Algorithm 2 splits the model vertically per iteration,

where each submodel contains all layers of the neural net-

work, but only with a (non-overlapping) subset of neurons

being active in each layer. Multiple local SGD steps can

be performed without the need for the workers to commu-

nicate. See also Figure 2 for a schematic representation of

asynchronous distributed dropout for training a CNN.

Theoretical Results

. We are interested in understanding

whether such a combination of asynchronous computing

and dropout techniques lead to convergence and favorable

results: given the variance introduced by both asynchronous

updates and training of submodels, it is not obvious whether

–and under which conditions– such a protocol would work.

For ease of presentation and clarity of results, we analyse

a one-hidden-layer CNN, and show convergence with ran-

dom ﬁlter dropout. Consider a training dataset

(X,y) =

{(xi, yi)}n

i=1

, where each

xi∈Rˆ

d×p

is an image and

being its label. Here,

is the number of input channels

and

the number of pixels. Let

denote the size of the

ﬁlter, and let

be the number of ﬁlters in the ﬁrst layer.

Based on previous work [

], we let

φ(·)

denote the patch-

ing operator with

φ(x)∈Rqˆ

d×p

. Consider the ﬁrst layer

weight

W∈Rm×qˆ

, and second layer (aggregation) weight

a∈Rm×p

. We assume that only the ﬁrst layer weights

is trainable. The CNN trained on the means squared error

has the form:

f(x,W) = Da, σ Wˆ

φ(x),;E,L(W) = kf(X,W)−yk2

where

f(x,·)

denotes the output of the one-layer CNN for

input

, and

L(·)

is the loss function. We use the

-norm

loss for simplicity. We make the following assumption on

the training data and the CNN weight initialization.

Assumption 4.1

(Training Data) Assume that for all

i∈

[n]

, we have

kxikF=q−1

and

|yi| ≤ C

for some constant

C. Moreover, for all i, i0∈[n]we have xi6=xi0.

Note that this can be satisﬁed by normalizing the data. For

simplicity of the analysis, let d:= qˆ

Assumption 4.2

(Initialization)

w0,i ∼ N 0, κ2I

and

ai,i0∼n±1

p√mofor i∈[m]and i0∈[p].

In an asynchronous scenario, the neural network weight

is updated with stale gradients due to the asynchronous

updates, where

δt

is the delay at training step

.We assume

δt

is bounded by a constant

.Then, a simple version of

gradient descent under these assumptions looks like:

Wt=Wt−η∇WL(Wt−δt), δt≤E,

where

Wt−δt

indicates that the gradient is evaluated on a

earlier version of the model parameters. Given the above,

we provide the following guarantees:

Theorem 4.1

Let

f(·,·)

be a one-hidden-layer CNN with

the second layer weight ﬁxed. Let

abstractly represent the

output of the model after

iterations, over the random selec-

tion of the masks. Let

denotes the maximum gradient de-

lay/staleness. Let

denote the dropout rate (

ξ= 1

dictates

that all neurons are selected), and denote

θ= 1 −(1 −ξ)S

the probability that a neuron is active in at least one sub-

network. Assume the number of hidden neurons satisﬁes

m= Ω max{n4K2

λ4

0δ2max{n, d},n

λ0}

and the step size

satisﬁes

η=Oλ0

n2

. Let

be a proper initialization scal-

ing factor, and it is considered constant. We use

λ0

to denote

Chen Dun, Mirian Hipolito, Chris Jermaine, Dimitrios Dimitriadis, Anastasios Kyrillidis

Figure 2: Schematic representation of

AsyncDropout

This is a simple representation of a CNN model. Our algorithm

applies for arbitrary depth of CNNs (ResNets) as well as other architectures (MLPs, LSTMs, etc); here we restrict to a

shallow CNN for illustration purposes.

Per request, random sub-sampled CNN models are created that result into different

subnetworks.

These submodels are distributed to devices with different computational capabilities (here GPU, CPU,

or a smartphone).

Without loss of generality, we assume that all devices train locally the submodel for

iterations.

However, each device ﬁnishes local training in different timestamps (shown as different colored arrows: red: slow speed;

orange: moderate speed; blue: fast speed).

Yet, the global model is asynchronously updated and new submodels are

created without global synchronization. g)The above procedure is repeated till convergence.

the smallest eigenvalue of the Neural Tangent Kernel matrix.

Let Assumptions 1 and 2 be satisﬁed. Then, the following

convergence rate guarantee is proved to be satisﬁed:

EMthkut+1 −yk2

2i≤1−θηλ0

4t

ku0−yk2

+Oθηλ3

0ξ2κ2E2

n2+ξ2(1 −ξ)2θηn3κ2d

mλ0

+η2θ2nκ2λ0ξ4E2

m4+ξ2(1 −ξ)2θ2η2n2κ2d

m3λ0

+ξ2(1 −ξ)2θ2η2κ2λ0E2

m3+ξ2(1 −ξ)2θ2η2n2κ2d

m2λ0

+nκ2θ−ξ2

Remark #1.

This theorem states that the sum of the

expected weight differences in the

-th iteration (i.e.,

EMt[kut+1 −yk2

) converges linearly to zero, as dictated

by the red term –

1−θηλ0

4tku0−yk2

– up to an error

neighborhood, denoted with the Big-Oh notation term on

the right hand side of the expression. Focusing on the latter,

there are two types of additive errors:

the orange-colored

terms origin from the dropout analysis: the term

1−ξ

is often called as “dropout rate” (when

ξ= 0

, no neu-

rons are selected and the loss hardly decreases, while when

ξ= 1

, all neurons are selected, and the orange-colored

terms disappear).

ii)

the violet-colored terms origin from

the asynchronous analysis: when

E= 0

(i.e., we boil down

to synchronous computations), these terms also disappear).

Remark #2.

Beyond the above extreme cases, we observe

that the error region terms can be controlled by algorithmic

and model-design choices: e.g., when the size of the dataset

increases, the ﬁrst term

θηλ3

0ξ2κ2E2

can be controlled; for

sufﬁciently wide neural network, the terms with

in the

denominator can be made arbitrarily small; ﬁnally, notice

that increasing the number of subnetworks Swill drive the

last term in the bound zero.

Vanilla asynchronous distributed dropout in practice

We test vanilla

AsyncDrop

with 25% Dropout rate in a

FL setting with 104 heterogeneous clients and based on

non-i.i.d. CIFAR100 dataset distribution. Beyond the exten-

sions of FL baselines to the asynchronous setting (FedAvg,

Fed-Weighted-Avg and FedProx), we further extend the

work in [

] into Async. FjORD; further details in Sec

6. As shown in Table 1,

AsyncDrop

indeed shows im-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientandLight-WeightFederatedLearningviaAsynchronousDistributedDropoutChenDunMirianHipolitoChrisJermaineDimitriosDimitriadisAnastasiosKyrillidisRiceUniversityMicrosoftResearchRiceUniversityMicrosoftResearchRiceUniversityAbstractAsynchronouslearningprotocolshaveregainedattentionlately,especiallyin...

展开>> 收起<<

Efﬁcient and Light-Weight Federated Learning via Asynchronous Distributed Dropout.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient and Light-Weight Federated Learning via Asynchronous Distributed Dropout

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: