
0.4 0.6 0.8 1.0 1.2 1.4 1.6
Noise Norm
0.00
0.05
0.10
0.15
0.20
0.25
Density
non-i.i.d. case
0.4 0.6 0.8 1.0 1.2 1.4 1.6
0.00
0.05
0.10
0.15
0.20
0.25
Density
i.i.d. case
Figure 1: Distributions of the
norms of the pseudo-gradient
noises computed with CNN on
CIFAR-10 dataset in i.i.d. case
(top) and non-i.i.d. case (bot-
tom).
m= 100
clients partic-
ipate in the training.
Non-IID Index (p)
1 2 5 10
Alpha
0.95
1
1.05
1.1
1.15
Single SGD
Local Epoch=1
Local Epoch=2
Local Epoch=5
Figure 2: Estimation of
α
for
CIFAR-10 dataset. The non-IID
index
p
represents the data het-
erogeneity level, and
p= 10
is
the IID case. The smaller the
p
,
the more heterogeneous the data
across clients.
0 200 400 600 800 100012001400
0.2
0.4
0.6
Test Accu.
0 200 400 600 800 100012001400
Communication Round
0
20
40
60
Grad Norm
Figure 3: Catastrophic training
failures happen when applying
GFedAvg on CIFAR-10 dataset,
where the test accuracy expe-
riences a sudden and dramatic
drop and the pseudo-gradient
norm increases substantially.
of the distribution. Also, the
α
-parameter also determines the moments:
E[|X|r]<∞
if and only if
r < α, which implies that Xhas infinite variance when α < 2, i.e., being fat-tailed.
Next, we investigate the tail property of model updates returned by clients in the GFedAvg algorithm.
Due to multiple local steps in the GFedAvg algorithm, we view the whole update vector
∆i
t
returned
by each client, which we called “pseudo-gradient,” as a random vector and then analyze its statistical
properties. Note that in the special case with the number of local update
K= 1
,
∆i
t
coincides with a
single stochastic gradient of a random sample, (i.e., ∆i
t=∇fi(xt, ξt)).
We study the mismatch between the “non-fat-tailed” condition (
α= 2
) and the empirical behavior
of the stochastic psudo-gradient noise. In Fig. 1, we illustrate the distributions of the norms of
the stochastic pseudo-gradient noises computed with convolutional neural network (CNN) on the
CIFAR-10 dataset in both i.i.d. and non-i.i.d. client dataset settings. We can clearly observe that the
non-i.i.d. case exhibits a rather fat-tailed behavior, where the pseudo-gradient norm could be as large
as
1.6
. Although the i.i.d. case appears to have a much lighter tail, our detailed analysis shows that
it still exhibits a fat-tailed behavior. To see this, in Fig. 2, we estimate
α
-value for the CIFAR-10
dataset in different scenarios: 1) different local update steps, and 2) different data heterogeneity. We
use a parameter
p
to characterize the data heterogeneity level, with
p= 10
corresponding to the i.i.d.
case. The smaller the
p
, the more heterogeneous the data among clients. Fig. 2 shows that the
α
-value
is smaller than
1.15
in all scenarios, and
α
increases as the non-i.i.d. index
p
increases (i.e., closer to
the i.i.d. case). This implies that the stochastic pseudo-gradient noise is fat-tailed and the “fatness”
increases as the clients’ data become more heterogeneous.
3) The Impacts of Fat-Tailed Noise on Federated Learning:
Next, we show that the fat-tailed
noise could lead to a “catastrophic model failure” (i.e., a sudden and dramatic drop of learning
accuracy), consistent with previous observations in the FL literature [20]. To demonstrate this, we
apply GFedAvg on the CIFAR-10 dataset and randomly sample five clients among
m= 10
clients
in each communication round. In Fig. 3, we illustrate a trial where a catastrophic training failure
occurred. Correspondingly, we can observe in Fig. 3 a spike in the norm of the pseudo-gradient. This
exceedingly large pseudo-gradient norm motivates us to apply the clipping technique to curtail the
gradient updates. It is also worth noting that even if the squared norm of stochastic gradient may
not be infinitely large in practice (i.e., having a bounded support empirically), it could still be too
large and cause catastrophic model failures. In fact, under fat-tailed noise, the FedAvg algorithm
could diverge, which follows from the fact that there exists one function that SGD diverges under
heavy-tailed noise (see Remark 1 in [22]). As a result, the returned value by one client might be
exceedingly large, leading to divergence of the FedAvg-type algorithms.
It is worth pointing out that, although we have empirically shown heavy/fat-tailed noise in FL for the
first time in this paper, we are by no means the only one to have observed heavy-tailed or fat-tailed
noise phenomenon property in learning. Previous works have also found heavy/fat-tailed noise
phenomenon in centralized training with SGD-type algorithms. For example, the work in [15] showed
the heavy-tailed noise phenomenon while (centralized) training the AlexNet on CIFAR-10. Here, we
adopt a procedure similar to that in [15] to evaluate the tail index
α
of the noise norm distribution in
5