
Normalization challenge. Unsuitability of BatchNorm for
federated and differentially private learning has presented a
real challenge in the corresponding environments. Unlike the
other challenges (i.e. utility, network communication, and
privacy), the normalization issue has remained understudied
in the context of FL and DP. Previous works [9], [22]
illustrate that GroupNorm outperforms BatchNorm in terms
of accuracy in federated settings. Likewise, GroupNorm also
delivers higher accuracy than LayerNorm in differentially
private learning [23]–[25]. Additionally, KernelNorm achieves
significantly higher accuracy and faster convergence rate com-
pared to LayerNorm and GroupNorm in both FL and DP
settings according to the original study [21].
However, the prior studies have not made a comparison
between different normalization layers and the NoNorm (no
normalization layer) case in the first place. Moreover, the
experimental evaluation regarding FL and DP environments
is limited in the original KernelNorm study [21], focusing on
a cross-silo federated setting (few clients with relatively large
datasets) [26] and a shallow model in DP. Finally, the perfor-
mance comparisons in the previous works do not consider dif-
ferentially private federated learning (DP-FL) settings. Given
that, two fundamental questions arise: (1) Do LayerNorm,
GroupNorm, and KernelNorm also deliver higher performance
than NoNorm in FL, DP, and DP-FL environments?, and (2)
Does KernelNorm still outperform other normalization layers
in cross-device FL (many clients with small datasets), in DP-
FL, and using deeper models in DP?
Key findings. We conduct extensive experiments using
the VGG-6 [27], ResNet-8 [21], PreactResNet-18 [28], and
DenseNet20×16 [18] models trained on the CIFAR-10/100
[29] and Imagenette [30] datasets in FL, DP, and DP-FL
settings to address those questions. The findings are as follows:
1) LayerNorm and GroupNorm do not necessarily out-
perform the NoNorm case for shallow models in FL
and DP settings. For instance, LayerNorm and Group-
Norm provide slightly lower accuracy and communica-
tion efficiency than NoNorm in the cross-silo federated
setting, where the shallow VGG-6 model is trained
on CIFAR-10. Similarly, LayerNorm and GroupNorm
achieve lower accuracy than NoNorm using the shallow
ResNet-8 model on CIFAR-10 in DP (Section III).
2) KernelNorm significantly outperforms NoNorm, Lay-
erNorm, and GroupNorm in terms of communication
efficiency (convergence rate) and accuracy in both cross-
silo and cross-device FL, with both shallow and deeper
models in DP, and using shallow models in DP-FL
environments (Section III).
Solution. Based on our findings, we advocate employing
KernelNorm as the effective normalization layer for FL, DP,
and DP-FL settings. Given that, we propose a KernelNorm-
based ResNet architecture called KNResNet-13, and show it
delivers considerably higher accuracy than the state-of-the-art
GroupNorm-based architectures on CIFAR-10 and Imagenette
in differentially private learning environments (Section IV).
Contributions. We make the following contributions: (I)
we show LayerNorm and GroupNorm do not deliver higher
accuracy than NoNorm with shallow models in FL and DP
settings, (II) we illustrate the recently proposed KernelNorm
layer has a great potential to become the de facto normalization
layer in privacy-enhancing/preserving machine learning, and
(III) we propose the KNResNet-13 architecture, and provide
new state-of-the-art (SOTA) accuracy values on CIFAR-10 and
Imagenette using the proposed architecture in DP environ-
ments, when trained from scratch.
II. PRELIMINARIES
Federated learning (FL). A federated environment con-
sists of multiple clients as data holders and a central server
as coordinator. FL is a privacy-enhancing technique, which
enables the clients to train a global model without sharing their
private data with a third party. In FL, or more precisely in the
FederatedAveraging (FedAvg) algorithm [7], the server
randomly chooses Kclients, and sends them the global model
parameters Wg
iin each communication round i. Next, each
selected client jtrains the global model on its local dataset
using mini-batch gradient descent, and shares the local model
parameters Wl
i,j with the server. Finally, the server takes the
weighted average over the local parameters from the clients
to update the global model:
Wg
i+1 =PK
j=1 Nj·Wl
i,j
PK
j=1 Nj
,
where Njis the number of samples in client j.
Across-device federated setting contains a large number of
clients such as mobile devices with small datasets [26]. The
server selects a fraction of clients in each round. Moreover,
the underlying assumption is that the communication between
clients and server is unstable, and the clients might drop out
during training. A cross-silo setting, on the other hand, consists
of few clients such as hospitals or research institutions with
relatively large datasets and stable network connection [26].
All clients participate in model training in all communication
rounds. For more details on federated learning, the readers are
referred to [7] and [26].
Differential privacy (DP). The differential privacy ap-
proach provides a theoretical framework and collection of
techniques for privacy-preserving data processing and release
[13]. Its guarantees are formulated in an information-theoretic
fashion and describe the upper bound on the multiplicative
information gain of an adversary observing the output of a
computation over a sensitive database. This definition endows
DP with a robust theoretical underpinning and ascertains that
its guarantees hold in the presence of adversaries with un-
bounded prior knowledge and under infinite post-processing.
Moreover, DP guarantees are compositional, meaning that they
degrade predictably when a DP system is executed repeatedly
on the same database. Formally, a randomised mechanism M
is said to preserve (ε, δ)-DP if, for all databases Dand D0