
Centralized
training BatchNorm
Local
training Aggregation
Internal covariate shift
External covariate shift Device 1
Device 2
Device N
…
Figure 1: Internal and external covariate shift.
Internal covariate shift has been well studied in the centralized learning scenarios and an effective approach to mitigate
this issue is batch normalization. Further, batch normalization has many good properties which will stable the training
process exploited by previous work. In FL systems, participating devices perform several batches of local training in
each communication round, thus, internal covariate shift raises a concern for the local training. In FL, the updates of
model parameters vary across devices during local training. Without any constrains, the internal covariate shift across
devices will be varied, leading to gaps of statistics information given the same channel among different devices. We
name this unique phenomenon in FL as external covariate shift. Due to external covariate shift, the model neurons of a
given channel on one device need to adapt to the feature distribution of the same channel on other devices, which slows
down the convergence of global model training. Further, external covariate shift may also lead to large discrepancy in
the norm of weights and may obliterate contribution from devices with weights of small norm.
We show in this paper that inherited good properties of normalization will shed light on solving external covariate shift.
However, existing works Li et al
.
(2021); Hsieh et al
.
(2020) show that batch normalization will incur the accuracy
drop of global model in FL. These works simply attribute the failure of batch normalization in FL to the discrepancies
of local data distributions across devices. In this work, we show our key observation that the ineffectiveness of batch
normalization in FL is not only caused by the data distribution discrepancies, but also resulted from the diverged internal
covariate shift among different devices due to the stochastic training process. Batch normalization drops the accuracy
of global model when applied to solve external covariate shift because the feature distribution of the global model after
aggregation is not predictable. Further, we also show that layer normalization does not suffer from the problem and can
server as the placement of batch normalization in FL.
The experiment results demonstrate that layer normalization can effectively mitigate the external covariate shift and
speedup the convergence of the global model training. In particular, layer normalization achieves the fastest convergence
and best or comparable accuracy upon convergence on three different model architectures.
Our key contributions are summarized as follows:
•
To the best of our knowledge, this is the first work to explicitly reveal external covariate shift in FL, which is
an important issue that affects the convergence of FL training.
•
We propose a simple yet effective placement of batch normalization in Federated Learning, i.e., layer nor-
malization, which can effectively mitigate the external covariate shift and speedup the convergence of FL
training.
2 Preliminaries
2.1 Internal covariate shift and Activation normalization
In the training of deep neural networks, each layer’s input distribution keeps changing due to updates of parameters in
the preceding layer. Consequently, layers are forced to keep adapting to the varying input distributions, leading to slow
2