Taming Fat-Tailed Heavier-Tailed with Potentially Inﬁnite Variance Noise in Federated Learning Haibo Yang

2025-05-02 0 0 780.28KB 28 页 10玖币

侵权投诉

Taming Fat-Tailed (“Heavier-Tailed” with Potentially

Inﬁnite Variance) Noise in Federated Learning

Haibo Yang

Dept. of ECE

The Ohio State University

Columbus, OH 43210

yang.5952@osu.edu

Peiwen Qiu

Dept. of ECE

The Ohio State University

Columbus, OH 43210

qiu.617@osu.edu

Jia Liu

Dept. of ECE

The Ohio State University

Columbus, OH 43210

liu@ece.osu.edu

Abstract

In recent years, federated learning (FL) has emerged as an important distributed

machine learning paradigm to collaboratively learn a global model with multiple

clients, while keeping data local and private. However, a key assumption in most

existing works on FL algorithms’ convergence analysis is that the noise in stochas-

tic ﬁrst-order information has a ﬁnite variance. Although this assumption covers all

light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g.,

log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed

noise distributions (i.e., “heavier-tailed” with potentially inﬁnite variance) that

have been empirically observed in the FL literature. To date, it remains unclear

whether one can design convergent algorithms for FL systems that experience

fat-tailed noise. This motivates us to ﬁll this gap in this paper by proposing an

algorithmic framework called

FAT

Clipping

(federated averaging with two-sided

learning rates and clipping), which contains two variants:

FAT

Clipping

per-round

(

FAT

Clipping

) and

FAT

Clipping

per-iteration (

FAT

Clipping

). Speciﬁ-

cally, for the largest tail-index

α∈(1,2]

such that the fat-tailed noise in FL still

has a bounded

-moment, we show that both variants achieve

O((mT )2−α

α)

and

O((mT )1−α

3α−2)

convergence rates in the strongly-convex and general non-convex

settings, respectively, where

and

are the numbers of clients and communication

rounds. Moreover, with more clipping operations compared to

FAT

Clipping

FAT

Clipping

further enjoys a linear speedup effect with respect to the number

of local updates at each client and being lower-bound-matching (i.e., order-optimal).

Collectively, our results advance the understanding of designing efﬁcient algorithms

for FL systems that exhibit fat-tailed ﬁrst-order oracle information.

1 Introduction

In recent years, federated learning (FL) has emerged as an important distributed machine learning

paradigm, where, coordinated by a server, a set of clients collaboratively learn a global model, while

keeping their training data local and private. With intensive research in recent years, researchers

have developed many FL algorithms (e.g., FedAvg [1] and many follow-ups [2

–

12]) that have been

theoretically shown to achieve fast convergence rates in the presence of various types of randomness

and heterogeneity resulted from training data, network environments, computing resources at clients,

etc. Moreover, many of these algorithms enjoy the so-called “linear speedup” effect, i.e., the

convergence time to a ﬁrst-order stationary point is inversely proportional to the number of workers

and local update steps.

However, despite the recent advances in FL algorithm design and theoretical understanding, a “cloud

that remains obscures the sky of FL” is a common assumption that can be found in almost all works

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.00690v2 [cs.LG] 25 Dec 2022

on performance analysis of FL algorithms, which states that the random noise in stochastic ﬁrst-order

oracles (e.g., stochastic gradients or associated estimators) has a ﬁnite variance. Although this

assumption is not too restrictive and can cover all light-tailed (i.e., sub-exponential) and some heavy-

tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many

“fat-tailed” distributions (i.e., “heavier-tailed” with potentially inﬁnite variance

). In fact, fat-tailed

distributions have already been empirically observed under centralized learning settings [15

–

19], let

alone in the more heterogeneous FL environments. Later in Section 3, we will also provide empirical

evidence that shows that fat-tailed noise distributions can be easily induced by FL systems with

non-i.i.d. datasets and heterogeneous local updates across clients.

The presence of fat-tailed noise poses two major challenges in FL algorithm design and analysis:

i) Experimentally, it has been shown in [20] that many existing FL algorithms suffer severely from

fat-tailed noise and frequently exhibit the so-called “catastrophic failure of model performance” (i.e.,

sudden and dramatic drops of learning accuracy during the training phase); ii) Theoretically, the

inﬁnite variance of the random noise in the stochastic ﬁrst-order oracles renders most of the proof

techniques in existing FL algorithmic convergence analysis inapplicable, which necessitates new

algorithmic ideas and proof strategies. In light of these empirical and theoretical challenges, two

foundational questions naturally emerge in FL algorithm design and analysis: 1) Can we develop FL

algorithms with convergence guarantee under fat-tailed noise? 2) If the answer to 1) is “yes,” could

we characterize their ﬁnite-time convergence rates? In this paper, we provide afﬁrmative answer to

the above questions. Our major contributions in this paper are highlighted as follows:

•

To address the challenges of the fat-tailed noise in FL algorithm design, we propose an algorithmic

framework called

FAT

Clipping

(federated averaging with two-sided learning rates and clipping),

which leverages a clipping technique to mitigate the impact of fat-tailed noise and uses a two-sided

learning rate mechanism to lower communication complexity. Our

FAT

Clipping

framework

contains two variants:

FAT

Clipping

per-round (

FAT

Clipping

) and

FAT

Clipping

per-iteration

(

FAT

Clipping

). We show that, for the largest tail-index

α∈(1,2]

such that the fat-tailed

noise in FL still has a bounded

-moment, both

FAT

Clipping

variants achieve

O((mT )2−α

α)

and

O((mT )1−α

3α−2)

convergence rates in the strongly-convex and general non-convex settings,

respectively, where mand Tare the numbers of clients and communication rounds.

•

Between the proposed

FAT

Clipping

variants,

FAT

Clipping

only performs one clipping oper-

ation in each communication round before client communicates to the server, while

FAT

Clipping

performs clipping in each iteration of local model update. We show that, at the expense of more

clipping operations compared to

FAT

Clipping

FAT

Clipping

further achieves a linear

speedup effect with respect to the number local model updates at each client and is lower-bound

matching in terms of convergence rate.

•

In addition to theoretical analysis, we also conduct extensive numerical experiments to study the fat-

tailed phenomenon in FL systems and verify the efﬁcacy of our proposed

FAT

Clipping

algorithms

for FL systems with fat-tailed noise. We ﬁrst provide concrete empirical evidence that fail-tailed

noise distributions are not uncommon in FL systems with non-i.i.d. datasets and heterogeneous

local updates. We show that our

FAT

Clipping

algorithms render a much smoother FL training

process, which effectively prevents the “catastrophic failure” in various FL settings.

For quick reference and easy comparisons, we summarize all convergence rate results in Table 1. The

rest of the paper is organized as follows. In Section 2, we review the literature to put our work in

comparative perspectives. In Section 3, we provide empirical fat-tailed evidence for FL to further

motivate this work. Section 4 presents our

FAT

Clipping

algorithms and their convergence analyses.

Section 5 presents numerical results and Section 6 concludes this paper. Due to space limitation, all

proof details and some experiments are provided in the supplementary material.

In the literature, the terminologies “heavy-tailed” and “fat-tailed” are not universally deﬁned and could

be interchangeable sometimes. In this paper, we follow the convention of those authors who reserve the term

“fat-tailed” to mean the subclass of heavy-tailed distributions that exhibit power law decay behavior as well as

inﬁnite variance (see, e.g., [13,14]). Thus, every fat-tailed distribution is heavy-tailed, but the reverse is not true.

Table 1: Convergence rate comparisons under fat-tailed noise distributions (shaded parts are our

results; metrics:

f(x)−f(x∗)≤

and

k∇f(x)k ≤ 

for strongly-convex and non-convex functions,

respectively):

α= 2

and

α∈(1,2)

correspond to non-fat-tailed and fat-tailed noises, respectively.

Here,

is the total number of iterations for centralized algorithms (SGD and GClip);

and

are

local update steps and communication rounds in the FL setting, respectively;

is the number of

clients. N/A means no theoretical guarantee for convergence. Note that the total number of iterations

Rin FL can be computed as R=KT , which relates to that in the centralized setting.

Methods Strongly Convex Objective Functions Nonconvex Objective Functions

Fat-Tailed Non-Fat-Tailed Fat-Tailed Non-Fat-Tailed

SGD [21] N/A O(R−1)N/A O(R−1

GClip [22] O(R2−2α

α)O(R−1)O(R

1−α

3α−2)O(R−1

FedAvg [3, 7] N/A ˜

O((mKT )−1)N/A O((mKT )−1

FAT-Clipping-PR O((mT )2−2α

αK2

α)˜

O((mKT )−1)O((mT )

1−α

3α−2K

2−α

3α−2)O((mKT )−1

FAT-Clipping-PI ˜

O((mKT )2−2α

α)˜

O((mKT )−1)O((mKT )

1−α

3α−2)O((mKT )−1

Lower Bound Ω((mKT )2−2α

α) Ω((mKT )−1) Ω((mKT )

1−α

3α−2) Ω((mKT )−1

2 Related work

In this section, we will provide a quick overview on three related topics in the literature: i) federated

learning, ii) heavy-tailed noise in learning, and iii) the clipping techniques, thus putting our work into

comparative perspective to highlight our novelty and differences.

1) Federated Learning:

As mentioned earlier, FL has recently emerged as an important distributed

learning paradigm. The ﬁrst and perhaps the most popular FL method, the federated averaging

(FedAvg) algorithm [1], was initially proposed as a heuristic to improve communication efﬁciency

and data privacy. Since then, FedAvg has sparked many follow-ups to further address the challenges

of data/system heterogeneity and further reduce iteration and communication complexities. Notable

approaches include adding regularization for the local loss function [2, 5, 6], using variance reduction

techniques [3], taking adaptive learning rate strategy [8] or adaptive communication strategy [23,24],

and many momentum variants [4,9,10]. Empirically, these algorithms are shown to be communication-

efﬁcient [1] and enjoy better generalization performance [25]. Moreover, many state-of-the-art

algorithms enjoy the “linear speedup” effect in terms of the numbers of clients and local update steps

in different FL settings [3, 7, 24, 26]. We note, however, that all these theoretical results are built

upon the ﬁnite variance assumption of stochastic gradient noise. Unfortunately, when the stochastic

gradient noise is fat-tailed, the ﬁnite variance assumption no longer holds, and hence the associated

theoretical analysis is also invalid. This motivates us to ﬁll this gap in this paper and conduct the ﬁrst

theoretical analysis for FL systems that experience fat-tailed noise.

2) Heavy-Tailed Noise in Learning:

Recently, heavy-tailed noise has been empirically observed in

modern machine learning systems and theoretically analyzed [15, 16, 18, 22, 27

–

30]. Heavy-tailed

noise signiﬁcantly affects the learning dynamics and computational complexity, such as the ﬁrst exit

time escaping from saddle point [27] and iteration complexity [22]. This is dramatically different

from classic dynamic analysis often based on sub-Gaussian noise assumption [31,32] and algorithmic

convergence analysis with bounded variance assumption [21, 33]. However, for FL, there exist few

investigations about heavy-tailed behaviors. In this paper, we ﬁrst demonstrate through extensive

experiments that fat-tailed (i.e., heavier-tailed) noise in FL can be easily induced by data heterogeneity

and local update steps. We then propose efﬁcient algorithms to mitigate the impacts of fat-tails.

3) The Clipping Technique:

Since our

FAT

Clipping

algorithms are based on the idea of clipping,

here we provide an overview on this technique. As far as we know, dating back to at least 1985 [34],

gradient clipping has been an effective technique to ensure convergence for optimization problems

with fast-growing objective functions. In deep learning, clipping is a widely adopted technique to

address the exploding gradient problem. Recently, gradient clipping was theoretically shown to

be able to accelerate the training of centralized learning [17, 35

–

37]. Also, clipping is an effective

approach to mitigate heavy-tailed noise [17, 18] in centralized learning. In FL, clipping has been used

as the preconditioning step for preserving differential privacy (DP) [38

–

40]. Unlike these works, in

this paper, we utilize clipping to address algorithmic divergence caused by fat-tailed noise in FL.

Algorithm 1 Generalized FedAvg Algorithm (GFedAvg).

1: Initialize x1.

2: for t= 1,· · · , T (communication round) do

3: for each client i∈[m]in parallel do

4: Update local model: x1

t,i =xt.

5: for k= 1,· · · , K (local update step) do

6: Compute an unbiased estimate ∇fi(xk

t,i, ξk

t,i)of ∇fi(xk

t,i).

7: Local update: xk+1

t,i =xk

t,i −ηL∇fi(xk

t,i, ξk

t,i).

8: end for

9: Send ∆i

t=Pk∈[K]∇fi(xk

t,i, ξk

t,i)to the server.

10: end for

11: Global Aggregation At Server:

12: Receive ∆i

t, i ∈[m].

13: Server Update: xt+1 =xt−ηηL

mPi∈[m]∆i

14: Broadcasting xt+1 to clients.

15: end for

3 Fat-tailed noise phenomenon in federated learning

In this section, we ﬁrst introduce the basic FL problem statement and the standard FedAvg algorithm

for FL. Then, we provide some necessary background of fat-tailed distributions and provide empirical

evidence to show that fat-tailed noise can be easily induced by heterogeneity of data and local updates

in FL, which further motivates this work. Lastly, we demonstrate the algorithmic divergence and

frequently catastrophic model failure under fat-tailed noise.

1) Problem Statement of Federated Learning and the FedAvg Algorithm:

The goal of FL is to

solve the following optimization problem:

min

x∈Rdf(x) := 1

i=1

fi(x),(1)

where

is the number of clients and

fi(x),Eξi∼Di[f(x, ξi)]

is the local loss function associated

with a local data distribution

. A key challenge in FL stems from data heterogeneity, i.e.,

Di6=

Dj,∀i6=j

. In FL, the standard and perhaps the most popular algorithm is the federated averaging

(FedAvg) method. Here in Algorithm 1, we illustrate a more generalized version of the original

FedAvg (GFedAvg) with separate learning rates on the client and server sides [3,7,8]. Note that when

η= 1

, GFedAvg reduces to the original FedAvg [1]. In each communication round of GFedAvg, each

client performs local update steps and returns the update difference

∆i

. The server then aggregates

these results and update the global model

and the updated model parameters will then be retrieved

by the clients to start the next round of local updates.

2) Empirical Evidence of Fat-Tailed Noise Phenomenon in Federated Learning:

With the basics

of FL and the FedAvg algorithm, we are now in a position to demonstrate the empirical evidence

of the existence of fat-tailed noise in FL systems. As mentioned earlier, in most performance

analyses of FL algorithms, a common assumption is the bounded variance assumption of the local

stochastic gradients:

E[k∇fi(x, ξ)−∇fi(x)k2]≤σ2

. This assumption holds for all light-tailed noise

distributions (i.e., the sub-exponential family) and some heavy-tailed distributions (e.g., log-normal,

Weibull, and some Pareto distributions).

However, the ﬁnite-variance assumption fails to hold for many fat-tailed noise distributions. For

instance, for a random variable

, if its density

p(x)

has a power-law tail decreasing as

1/|x|α+1

with

α∈(0,2)

, then only the

-moment of this noise exists with

α < 2

. To more precisely characterize

fat-tailed distributions, in this paper, we adopt the notion of tail-index

[15] to parameterize fat-tailed

and heavy-tailed distributions. More speciﬁcally, if the density of a random variable

’s distribution

decays with a power law tail as

1/|x|α+1

where

α∈(0,2]

, then

is called the tail-infex. This

-parameter determines the behavior of the distribution: the smaller the

-value, the heavier the tail

We assume all clients participate in the training at each communication round, but the results can be extended

to that with (uniformly random sampled) subset of clients in each communication round [3, 7].

0.4 0.6 0.8 1.0 1.2 1.4 1.6

Noise Norm

0.00

0.05

0.10

0.15

0.20

0.25

Density

non-i.i.d. case

0.4 0.6 0.8 1.0 1.2 1.4 1.6

0.00

0.05

0.10

0.15

0.20

0.25

Density

i.i.d. case

Figure 1: Distributions of the

norms of the pseudo-gradient

noises computed with CNN on

CIFAR-10 dataset in i.i.d. case

(top) and non-i.i.d. case (bot-

tom).

m= 100

clients partic-

ipate in the training.

Non-IID Index (p)

1 2 5 10

Alpha

0.95

1.05

1.1

1.15

Single SGD

Local Epoch=1

Local Epoch=2

Local Epoch=5

Figure 2: Estimation of

for

CIFAR-10 dataset. The non-IID

index

represents the data het-

erogeneity level, and

p= 10

the IID case. The smaller the

the more heterogeneous the data

across clients.

0 200 400 600 800 100012001400

0.2

0.4

0.6

Test Accu.

0 200 400 600 800 100012001400

Communication Round

Grad Norm

Figure 3: Catastrophic training

failures happen when applying

GFedAvg on CIFAR-10 dataset,

where the test accuracy expe-

riences a sudden and dramatic

drop and the pseudo-gradient

norm increases substantially.

of the distribution. Also, the

-parameter also determines the moments:

E[|X|r]<∞

if and only if

r < α, which implies that Xhas inﬁnite variance when α < 2, i.e., being fat-tailed.

Next, we investigate the tail property of model updates returned by clients in the GFedAvg algorithm.

Due to multiple local steps in the GFedAvg algorithm, we view the whole update vector

∆i

returned

by each client, which we called “pseudo-gradient,” as a random vector and then analyze its statistical

properties. Note that in the special case with the number of local update

K= 1

∆i

coincides with a

single stochastic gradient of a random sample, (i.e., ∆i

t=∇fi(xt, ξt)).

We study the mismatch between the “non-fat-tailed” condition (

α= 2

) and the empirical behavior

of the stochastic psudo-gradient noise. In Fig. 1, we illustrate the distributions of the norms of

the stochastic pseudo-gradient noises computed with convolutional neural network (CNN) on the

CIFAR-10 dataset in both i.i.d. and non-i.i.d. client dataset settings. We can clearly observe that the

non-i.i.d. case exhibits a rather fat-tailed behavior, where the pseudo-gradient norm could be as large

1.6

. Although the i.i.d. case appears to have a much lighter tail, our detailed analysis shows that

it still exhibits a fat-tailed behavior. To see this, in Fig. 2, we estimate

-value for the CIFAR-10

dataset in different scenarios: 1) different local update steps, and 2) different data heterogeneity. We

use a parameter

to characterize the data heterogeneity level, with

p= 10

corresponding to the i.i.d.

case. The smaller the

, the more heterogeneous the data among clients. Fig. 2 shows that the

-value

is smaller than

1.15

in all scenarios, and

increases as the non-i.i.d. index

increases (i.e., closer to

the i.i.d. case). This implies that the stochastic pseudo-gradient noise is fat-tailed and the “fatness”

increases as the clients’ data become more heterogeneous.

3) The Impacts of Fat-Tailed Noise on Federated Learning:

Next, we show that the fat-tailed

noise could lead to a “catastrophic model failure” (i.e., a sudden and dramatic drop of learning

accuracy), consistent with previous observations in the FL literature [20]. To demonstrate this, we

apply GFedAvg on the CIFAR-10 dataset and randomly sample ﬁve clients among

m= 10

clients

in each communication round. In Fig. 3, we illustrate a trial where a catastrophic training failure

occurred. Correspondingly, we can observe in Fig. 3 a spike in the norm of the pseudo-gradient. This

exceedingly large pseudo-gradient norm motivates us to apply the clipping technique to curtail the

gradient updates. It is also worth noting that even if the squared norm of stochastic gradient may

not be inﬁnitely large in practice (i.e., having a bounded support empirically), it could still be too

large and cause catastrophic model failures. In fact, under fat-tailed noise, the FedAvg algorithm

could diverge, which follows from the fact that there exists one function that SGD diverges under

heavy-tailed noise (see Remark 1 in [22]). As a result, the returned value by one client might be

exceedingly large, leading to divergence of the FedAvg-type algorithms.

It is worth pointing out that, although we have empirically shown heavy/fat-tailed noise in FL for the

ﬁrst time in this paper, we are by no means the only one to have observed heavy-tailed or fat-tailed

noise phenomenon property in learning. Previous works have also found heavy/fat-tailed noise

phenomenon in centralized training with SGD-type algorithms. For example, the work in [15] showed

the heavy-tailed noise phenomenon while (centralized) training the AlexNet on CIFAR-10. Here, we

adopt a procedure similar to that in [15] to evaluate the tail index

of the noise norm distribution in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TamingFat-Tailed(Heavier-TailedwithPotentiallyInniteVariance)NoiseinFederatedLearningHaiboYangDept.ofECETheOhioStateUniversityColumbus,OH43210yang.5952@osu.eduPeiwenQiuDept.ofECETheOhioStateUniversityColumbus,OH43210qiu.617@osu.eduJiaLiuDept.ofECETheOhioStateUniversityColumbus,OH43210liu@ece.osu....

展开>> 收起<<

Taming Fat-Tailed Heavier-Tailed with Potentially Inﬁnite Variance Noise in Federated Learning Haibo Yang.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Taming Fat-Tailed Heavier-Tailed with Potentially Inﬁnite Variance Noise in Federated Learning Haibo Yang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: