Taming Fat-Tailed Heavier-Tailed with Potentially Infinite Variance Noise in Federated Learning Haibo Yang

2025-05-02 0 0 780.28KB 28 页 10玖币
侵权投诉
Taming Fat-Tailed (“Heavier-Tailed” with Potentially
Infinite Variance) Noise in Federated Learning
Haibo Yang
Dept. of ECE
The Ohio State University
Columbus, OH 43210
yang.5952@osu.edu
Peiwen Qiu
Dept. of ECE
The Ohio State University
Columbus, OH 43210
qiu.617@osu.edu
Jia Liu
Dept. of ECE
The Ohio State University
Columbus, OH 43210
liu@ece.osu.edu
Abstract
In recent years, federated learning (FL) has emerged as an important distributed
machine learning paradigm to collaboratively learn a global model with multiple
clients, while keeping data local and private. However, a key assumption in most
existing works on FL algorithms’ convergence analysis is that the noise in stochas-
tic first-order information has a finite variance. Although this assumption covers all
light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g.,
log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed
noise distributions (i.e., “heavier-tailed” with potentially infinite variance) that
have been empirically observed in the FL literature. To date, it remains unclear
whether one can design convergent algorithms for FL systems that experience
fat-tailed noise. This motivates us to fill this gap in this paper by proposing an
algorithmic framework called
FAT
-
Clipping
(federated averaging with two-sided
learning rates and clipping), which contains two variants:
FAT
-
Clipping
per-round
(
FAT
-
Clipping
-
PR
) and
FAT
-
Clipping
per-iteration (
FAT
-
Clipping
-
PI
). Specifi-
cally, for the largest tail-index
α(1,2]
such that the fat-tailed noise in FL still
has a bounded
α
-moment, we show that both variants achieve
O((mT )2α
α)
and
O((mT )1α
3α2)
convergence rates in the strongly-convex and general non-convex
settings, respectively, where
m
and
T
are the numbers of clients and communication
rounds. Moreover, with more clipping operations compared to
FAT
-
Clipping
-
PR
,
FAT
-
Clipping
-
PI
further enjoys a linear speedup effect with respect to the number
of local updates at each client and being lower-bound-matching (i.e., order-optimal).
Collectively, our results advance the understanding of designing efficient algorithms
for FL systems that exhibit fat-tailed first-order oracle information.
1 Introduction
In recent years, federated learning (FL) has emerged as an important distributed machine learning
paradigm, where, coordinated by a server, a set of clients collaboratively learn a global model, while
keeping their training data local and private. With intensive research in recent years, researchers
have developed many FL algorithms (e.g., FedAvg [1] and many follow-ups [2
12]) that have been
theoretically shown to achieve fast convergence rates in the presence of various types of randomness
and heterogeneity resulted from training data, network environments, computing resources at clients,
etc. Moreover, many of these algorithms enjoy the so-called “linear speedup” effect, i.e., the
convergence time to a first-order stationary point is inversely proportional to the number of workers
and local update steps.
However, despite the recent advances in FL algorithm design and theoretical understanding, a “cloud
that remains obscures the sky of FL” is a common assumption that can be found in almost all works
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.00690v2 [cs.LG] 25 Dec 2022
on performance analysis of FL algorithms, which states that the random noise in stochastic first-order
oracles (e.g., stochastic gradients or associated estimators) has a finite variance. Although this
assumption is not too restrictive and can cover all light-tailed (i.e., sub-exponential) and some heavy-
tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many
“fat-tailed” distributions (i.e., “heavier-tailed” with potentially infinite variance
1
). In fact, fat-tailed
distributions have already been empirically observed under centralized learning settings [15
19], let
alone in the more heterogeneous FL environments. Later in Section 3, we will also provide empirical
evidence that shows that fat-tailed noise distributions can be easily induced by FL systems with
non-i.i.d. datasets and heterogeneous local updates across clients.
The presence of fat-tailed noise poses two major challenges in FL algorithm design and analysis:
i) Experimentally, it has been shown in [20] that many existing FL algorithms suffer severely from
fat-tailed noise and frequently exhibit the so-called “catastrophic failure of model performance” (i.e.,
sudden and dramatic drops of learning accuracy during the training phase); ii) Theoretically, the
infinite variance of the random noise in the stochastic first-order oracles renders most of the proof
techniques in existing FL algorithmic convergence analysis inapplicable, which necessitates new
algorithmic ideas and proof strategies. In light of these empirical and theoretical challenges, two
foundational questions naturally emerge in FL algorithm design and analysis: 1) Can we develop FL
algorithms with convergence guarantee under fat-tailed noise? 2) If the answer to 1) is “yes,” could
we characterize their finite-time convergence rates? In this paper, we provide affirmative answer to
the above questions. Our major contributions in this paper are highlighted as follows:
To address the challenges of the fat-tailed noise in FL algorithm design, we propose an algorithmic
framework called
FAT
-
Clipping
(federated averaging with two-sided learning rates and clipping),
which leverages a clipping technique to mitigate the impact of fat-tailed noise and uses a two-sided
learning rate mechanism to lower communication complexity. Our
FAT
-
Clipping
framework
contains two variants:
FAT
-
Clipping
per-round (
FAT
-
Clipping
-
PR
) and
FAT
-
Clipping
per-iteration
(
FAT
-
Clipping
-
PI
). We show that, for the largest tail-index
α(1,2]
such that the fat-tailed
noise in FL still has a bounded
α
-moment, both
FAT
-
Clipping
variants achieve
O((mT )2α
α)
and
O((mT )1α
3α2)
convergence rates in the strongly-convex and general non-convex settings,
respectively, where mand Tare the numbers of clients and communication rounds.
Between the proposed
FAT
-
Clipping
variants,
FAT
-
Clipping
-
PR
only performs one clipping oper-
ation in each communication round before client communicates to the server, while
FAT
-
Clipping
-
PI
performs clipping in each iteration of local model update. We show that, at the expense of more
clipping operations compared to
FAT
-
Clipping
-
PR
,
FAT
-
Clipping
-
PI
further achieves a linear
speedup effect with respect to the number local model updates at each client and is lower-bound
matching in terms of convergence rate.
In addition to theoretical analysis, we also conduct extensive numerical experiments to study the fat-
tailed phenomenon in FL systems and verify the efficacy of our proposed
FAT
-
Clipping
algorithms
for FL systems with fat-tailed noise. We first provide concrete empirical evidence that fail-tailed
noise distributions are not uncommon in FL systems with non-i.i.d. datasets and heterogeneous
local updates. We show that our
FAT
-
Clipping
algorithms render a much smoother FL training
process, which effectively prevents the “catastrophic failure” in various FL settings.
For quick reference and easy comparisons, we summarize all convergence rate results in Table 1. The
rest of the paper is organized as follows. In Section 2, we review the literature to put our work in
comparative perspectives. In Section 3, we provide empirical fat-tailed evidence for FL to further
motivate this work. Section 4 presents our
FAT
-
Clipping
algorithms and their convergence analyses.
Section 5 presents numerical results and Section 6 concludes this paper. Due to space limitation, all
proof details and some experiments are provided in the supplementary material.
1
In the literature, the terminologies “heavy-tailed” and “fat-tailed” are not universally defined and could
be interchangeable sometimes. In this paper, we follow the convention of those authors who reserve the term
“fat-tailed” to mean the subclass of heavy-tailed distributions that exhibit power law decay behavior as well as
infinite variance (see, e.g., [13,14]). Thus, every fat-tailed distribution is heavy-tailed, but the reverse is not true.
2
Table 1: Convergence rate comparisons under fat-tailed noise distributions (shaded parts are our
results; metrics:
f(x)f(x)
and
k∇f(x)k ≤
for strongly-convex and non-convex functions,
respectively):
α= 2
and
α(1,2)
correspond to non-fat-tailed and fat-tailed noises, respectively.
Here,
R
is the total number of iterations for centralized algorithms (SGD and GClip);
K
and
T
are
local update steps and communication rounds in the FL setting, respectively;
m
is the number of
clients. N/A means no theoretical guarantee for convergence. Note that the total number of iterations
Rin FL can be computed as R=KT , which relates to that in the centralized setting.
Methods Strongly Convex Objective Functions Nonconvex Objective Functions
Fat-Tailed Non-Fat-Tailed Fat-Tailed Non-Fat-Tailed
SGD [21] N/A O(R1)N/A O(R1
4)
GClip [22] O(R22α
α)O(R1)O(R
1α
3α2)O(R1
4)
FedAvg [3, 7] N/A ˜
O((mKT )1)N/A O((mKT )1
4)
FAT-Clipping-PR O((mT )22α
αK2
α)˜
O((mKT )1)O((mT )
1α
3α2K
2α
3α2)O((mKT )1
4)
FAT-Clipping-PI ˜
O((mKT )22α
α)˜
O((mKT )1)O((mKT )
1α
3α2)O((mKT )1
4)
Lower Bound Ω((mKT )22α
α) Ω((mKT )1) Ω((mKT )
1α
3α2) Ω((mKT )1
4)
2 Related work
In this section, we will provide a quick overview on three related topics in the literature: i) federated
learning, ii) heavy-tailed noise in learning, and iii) the clipping techniques, thus putting our work into
comparative perspective to highlight our novelty and differences.
1) Federated Learning:
As mentioned earlier, FL has recently emerged as an important distributed
learning paradigm. The first and perhaps the most popular FL method, the federated averaging
(FedAvg) algorithm [1], was initially proposed as a heuristic to improve communication efficiency
and data privacy. Since then, FedAvg has sparked many follow-ups to further address the challenges
of data/system heterogeneity and further reduce iteration and communication complexities. Notable
approaches include adding regularization for the local loss function [2, 5, 6], using variance reduction
techniques [3], taking adaptive learning rate strategy [8] or adaptive communication strategy [23,24],
and many momentum variants [4,9,10]. Empirically, these algorithms are shown to be communication-
efficient [1] and enjoy better generalization performance [25]. Moreover, many state-of-the-art
algorithms enjoy the “linear speedup” effect in terms of the numbers of clients and local update steps
in different FL settings [3, 7, 24, 26]. We note, however, that all these theoretical results are built
upon the finite variance assumption of stochastic gradient noise. Unfortunately, when the stochastic
gradient noise is fat-tailed, the finite variance assumption no longer holds, and hence the associated
theoretical analysis is also invalid. This motivates us to fill this gap in this paper and conduct the first
theoretical analysis for FL systems that experience fat-tailed noise.
2) Heavy-Tailed Noise in Learning:
Recently, heavy-tailed noise has been empirically observed in
modern machine learning systems and theoretically analyzed [15, 16, 18, 22, 27
30]. Heavy-tailed
noise significantly affects the learning dynamics and computational complexity, such as the first exit
time escaping from saddle point [27] and iteration complexity [22]. This is dramatically different
from classic dynamic analysis often based on sub-Gaussian noise assumption [31,32] and algorithmic
convergence analysis with bounded variance assumption [21, 33]. However, for FL, there exist few
investigations about heavy-tailed behaviors. In this paper, we first demonstrate through extensive
experiments that fat-tailed (i.e., heavier-tailed) noise in FL can be easily induced by data heterogeneity
and local update steps. We then propose efficient algorithms to mitigate the impacts of fat-tails.
3) The Clipping Technique:
Since our
FAT
-
Clipping
algorithms are based on the idea of clipping,
here we provide an overview on this technique. As far as we know, dating back to at least 1985 [34],
gradient clipping has been an effective technique to ensure convergence for optimization problems
with fast-growing objective functions. In deep learning, clipping is a widely adopted technique to
address the exploding gradient problem. Recently, gradient clipping was theoretically shown to
be able to accelerate the training of centralized learning [17, 35
37]. Also, clipping is an effective
approach to mitigate heavy-tailed noise [17, 18] in centralized learning. In FL, clipping has been used
as the preconditioning step for preserving differential privacy (DP) [38
40]. Unlike these works, in
this paper, we utilize clipping to address algorithmic divergence caused by fat-tailed noise in FL.
3
Algorithm 1 Generalized FedAvg Algorithm (GFedAvg).
1: Initialize x1.
2: for t= 1,· · · , T (communication round) do
3: for each client i[m]in parallel do
4: Update local model: x1
t,i =xt.
5: for k= 1,· · · , K (local update step) do
6: Compute an unbiased estimate fi(xk
t,i, ξk
t,i)of fi(xk
t,i).
7: Local update: xk+1
t,i =xk
t,i ηLfi(xk
t,i, ξk
t,i).
8: end for
9: Send i
t=Pk[K]fi(xk
t,i, ξk
t,i)to the server.
10: end for
11: Global Aggregation At Server:
12: Receive i
t, i [m].
13: Server Update: xt+1 =xtηηL
mPi[m]i
t.
14: Broadcasting xt+1 to clients.
15: end for
3 Fat-tailed noise phenomenon in federated learning
In this section, we first introduce the basic FL problem statement and the standard FedAvg algorithm
for FL. Then, we provide some necessary background of fat-tailed distributions and provide empirical
evidence to show that fat-tailed noise can be easily induced by heterogeneity of data and local updates
in FL, which further motivates this work. Lastly, we demonstrate the algorithmic divergence and
frequently catastrophic model failure under fat-tailed noise.
1) Problem Statement of Federated Learning and the FedAvg Algorithm:
The goal of FL is to
solve the following optimization problem:
min
xRdf(x) := 1
m
m
X
i=1
fi(x),(1)
where
m
is the number of clients and
fi(x),EξiDi[f(x, ξi)]
is the local loss function associated
with a local data distribution
Di
. A key challenge in FL stems from data heterogeneity, i.e.,
Di6=
Dj,i6=j
. In FL, the standard and perhaps the most popular algorithm is the federated averaging
(FedAvg) method. Here in Algorithm 1, we illustrate a more generalized version of the original
FedAvg (GFedAvg) with separate learning rates on the client and server sides [3,7,8]. Note that when
η= 1
, GFedAvg reduces to the original FedAvg [1]. In each communication round of GFedAvg, each
client performs local update steps and returns the update difference
i
t
. The server then aggregates
these results and update the global model
2
and the updated model parameters will then be retrieved
by the clients to start the next round of local updates.
2) Empirical Evidence of Fat-Tailed Noise Phenomenon in Federated Learning:
With the basics
of FL and the FedAvg algorithm, we are now in a position to demonstrate the empirical evidence
of the existence of fat-tailed noise in FL systems. As mentioned earlier, in most performance
analyses of FL algorithms, a common assumption is the bounded variance assumption of the local
stochastic gradients:
E[k∇fi(x, ξ)fi(x)k2]σ2
. This assumption holds for all light-tailed noise
distributions (i.e., the sub-exponential family) and some heavy-tailed distributions (e.g., log-normal,
Weibull, and some Pareto distributions).
However, the finite-variance assumption fails to hold for many fat-tailed noise distributions. For
instance, for a random variable
X
, if its density
p(x)
has a power-law tail decreasing as
1/|x|α+1
with
α(0,2)
, then only the
α
-moment of this noise exists with
α < 2
. To more precisely characterize
fat-tailed distributions, in this paper, we adopt the notion of tail-index
α
[15] to parameterize fat-tailed
and heavy-tailed distributions. More specifically, if the density of a random variable
X
s distribution
decays with a power law tail as
1/|x|α+1
where
α(0,2]
, then
α
is called the tail-infex. This
α
-parameter determines the behavior of the distribution: the smaller the
α
-value, the heavier the tail
2
We assume all clients participate in the training at each communication round, but the results can be extended
to that with (uniformly random sampled) subset of clients in each communication round [3, 7].
4
0.4 0.6 0.8 1.0 1.2 1.4 1.6
Noise Norm
0.00
0.05
0.10
0.15
0.20
0.25
Density
non-i.i.d. case
0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.00
0.05
0.10
0.15
0.20
0.25
Density
i.i.d. case
Figure 1: Distributions of the
norms of the pseudo-gradient
noises computed with CNN on
CIFAR-10 dataset in i.i.d. case
(top) and non-i.i.d. case (bot-
tom).
m= 100
clients partic-
ipate in the training.
Non-IID Index (p)
1 2 5 10
Alpha
0.95
1
1.05
1.1
1.15
Single SGD
Local Epoch=1
Local Epoch=2
Local Epoch=5
Figure 2: Estimation of
α
for
CIFAR-10 dataset. The non-IID
index
p
represents the data het-
erogeneity level, and
p= 10
is
the IID case. The smaller the
p
,
the more heterogeneous the data
across clients.
0 200 400 600 800 100012001400
0.2
0.4
0.6
Test Accu.
0 200 400 600 800 100012001400
Communication Round
0
20
40
60
Grad Norm
Figure 3: Catastrophic training
failures happen when applying
GFedAvg on CIFAR-10 dataset,
where the test accuracy expe-
riences a sudden and dramatic
drop and the pseudo-gradient
norm increases substantially.
of the distribution. Also, the
α
-parameter also determines the moments:
E[|X|r]<
if and only if
r < α, which implies that Xhas infinite variance when α < 2, i.e., being fat-tailed.
Next, we investigate the tail property of model updates returned by clients in the GFedAvg algorithm.
Due to multiple local steps in the GFedAvg algorithm, we view the whole update vector
i
t
returned
by each client, which we called “pseudo-gradient,” as a random vector and then analyze its statistical
properties. Note that in the special case with the number of local update
K= 1
,
i
t
coincides with a
single stochastic gradient of a random sample, (i.e., i
t=fi(xt, ξt)).
We study the mismatch between the “non-fat-tailed” condition (
α= 2
) and the empirical behavior
of the stochastic psudo-gradient noise. In Fig. 1, we illustrate the distributions of the norms of
the stochastic pseudo-gradient noises computed with convolutional neural network (CNN) on the
CIFAR-10 dataset in both i.i.d. and non-i.i.d. client dataset settings. We can clearly observe that the
non-i.i.d. case exhibits a rather fat-tailed behavior, where the pseudo-gradient norm could be as large
as
1.6
. Although the i.i.d. case appears to have a much lighter tail, our detailed analysis shows that
it still exhibits a fat-tailed behavior. To see this, in Fig. 2, we estimate
α
-value for the CIFAR-10
dataset in different scenarios: 1) different local update steps, and 2) different data heterogeneity. We
use a parameter
p
to characterize the data heterogeneity level, with
p= 10
corresponding to the i.i.d.
case. The smaller the
p
, the more heterogeneous the data among clients. Fig. 2 shows that the
α
-value
is smaller than
1.15
in all scenarios, and
α
increases as the non-i.i.d. index
p
increases (i.e., closer to
the i.i.d. case). This implies that the stochastic pseudo-gradient noise is fat-tailed and the “fatness”
increases as the clients’ data become more heterogeneous.
3) The Impacts of Fat-Tailed Noise on Federated Learning:
Next, we show that the fat-tailed
noise could lead to a “catastrophic model failure” (i.e., a sudden and dramatic drop of learning
accuracy), consistent with previous observations in the FL literature [20]. To demonstrate this, we
apply GFedAvg on the CIFAR-10 dataset and randomly sample five clients among
m= 10
clients
in each communication round. In Fig. 3, we illustrate a trial where a catastrophic training failure
occurred. Correspondingly, we can observe in Fig. 3 a spike in the norm of the pseudo-gradient. This
exceedingly large pseudo-gradient norm motivates us to apply the clipping technique to curtail the
gradient updates. It is also worth noting that even if the squared norm of stochastic gradient may
not be infinitely large in practice (i.e., having a bounded support empirically), it could still be too
large and cause catastrophic model failures. In fact, under fat-tailed noise, the FedAvg algorithm
could diverge, which follows from the fact that there exists one function that SGD diverges under
heavy-tailed noise (see Remark 1 in [22]). As a result, the returned value by one client might be
exceedingly large, leading to divergence of the FedAvg-type algorithms.
It is worth pointing out that, although we have empirically shown heavy/fat-tailed noise in FL for the
first time in this paper, we are by no means the only one to have observed heavy-tailed or fat-tailed
noise phenomenon property in learning. Previous works have also found heavy/fat-tailed noise
phenomenon in centralized training with SGD-type algorithms. For example, the work in [15] showed
the heavy-tailed noise phenomenon while (centralized) training the AlexNet on CIFAR-10. Here, we
adopt a procedure similar to that in [15] to evaluate the tail index
α
of the noise norm distribution in
5
摘要:

TamingFat-Tailed(“Heavier-Tailed”withPotentiallyInniteVariance)NoiseinFederatedLearningHaiboYangDept.ofECETheOhioStateUniversityColumbus,OH43210yang.5952@osu.eduPeiwenQiuDept.ofECETheOhioStateUniversityColumbus,OH43210qiu.617@osu.eduJiaLiuDept.ofECETheOhioStateUniversityColumbus,OH43210liu@ece.osu....

展开>> 收起<<
Taming Fat-Tailed Heavier-Tailed with Potentially Infinite Variance Noise in Federated Learning Haibo Yang.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:780.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注