Kernel Normalized Convolutional Networks for Privacy-Preserving Machine Learning Reza Nasirigerdeh

2025-05-06 0 0 732.48KB 12 页 10玖币
侵权投诉
Kernel Normalized Convolutional Networks for
Privacy-Preserving Machine Learning
Reza Nasirigerdeh
Technical University of Munich
Klinikum rechts der Isar
Munich, Germany
Javad Torkzadehmahani
Azad University of Kerman
Kerman, Iran
Daniel Rueckert
Technical University of Munich
Klinikum rechts der Isar
Munich, Germany
Imperial College London
London, United Kingdom
Georgios Kaissis
Technical University of Munich
Helmholtz Zentrum Munich
Munich, Germany
Abstract—Normalization is an important but understudied
challenge in privacy-related application domains such as fed-
erated learning (FL), differential privacy (DP), and differentially
private federated learning (DP-FL). While the unsuitability of
batch normalization for these domains has already been shown,
the impact of other normalization methods on the performance
of federated or differentially private models is not well-known.
To address this, we draw a performance comparison among layer
normalization (LayerNorm), group normalization (GroupNorm),
and the recently proposed kernel normalization (KernelNorm)
in FL, DP, and DP-FL settings. Our results indicate LayerNorm
and GroupNorm provide no performance gain compared to the
baseline (i.e. no normalization) for shallow models in FL and DP.
They, on the other hand, considerably enhance the performance
of shallow models in DP-FL and deeper models in FL and DP.
KernelNorm, moreover, significantly outperforms its competitors
in terms of accuracy and convergence rate (or communication
efficiency) for both shallow and deeper models in all considered
learning environments. Given these key observations, we propose
a kernel normalized ResNet architecture called KNResNet-13 for
differentially private learning. Using the proposed architecture,
we provide new state-of-the-art accuracy values on the CIFAR-10
and Imagenette datasets, when trained from scratch.
Index Terms—Differential Privacy, Federated Learning, Kernel
Normalization, Group Normalization, Batch Normalization
I. INTRODUCTION
Deep convolutional neural networks (CNNs) are popular
in a diverse range of image vision tasks including image
classification [1]. Deep CNNs rely on large-scale datasets to
effectively train the model, which might be difficult to provide
in a centralized manner [2]. This is because datasets are often
distributed across different sites such as hospitals, and contain
sensitive data which cannot be transferred to a centralized
location due to privacy regulations [3]. Even if such datasets
become available, training algorithms can pose privacy risks
to the individuals participating in the dataset, leaking privacy-
sensitive information through the trained model [4]–[6].
To appear in the IEEE Conference on Secure and Trustworthy Machine
Learning (SaTML), February 2023.
Federated learning (FL) [7] addresses the large-scale data
availability challenge by enabling clients to jointly train a
global model under the coordination of a central server without
sharing their private data. Network communication, on the
other hand, emerges as a new challenge in federated environ-
ments, requiring a large number of communication rounds for
model convergence, and exchanging a large amount of traffic
in each round [8]. FL also causes utility (e.g. in terms of
accuracy) reduction due to the Non-IID (not independent and
identically distributed) nature of the data across the clients
[9]. Finally, although FL eliminates the requirement of data
sharing, it might still lead to privacy leakage, where the private
data of the clients can be reconstructed from the model updates
shared with the server [10]–[12].
Differential privacy (DP) [13] copes with the privacy chal-
lenge in both centralized and federated environments by in-
jecting random noise into the model gradients to limit the
information learnt about a particular sample in the dataset [14].
DP, however, adversely affects the model utility similar to FL
because of the injected noise. In general, there is a trade-off
between privacy and utility in DP, where stronger privacy leads
to lower utility [15].
Batch normalization (BatchNorm) [16] is the de facto nor-
malization layer in popular deep CNNs such as ResNets [17]
and DenseNets [18], which remarkably improves the model
convergence rate and accuracy in centralized training. Batch-
Norm, however, is not suitable for FL and DP settings. This
is because BatchNorm relies on the IID distribution of feature
values in the batch [16], which is not the case in federated
settings. Moreover, per-sample gradients are required to be
computed in DP that is impossible for batch-normalized CNNs
[14]. Batch-independent layers such as layer normalization
(LayerNorm) [19], group normalization (GroupNorm) [20],
and the recently proposed kernel normalization (KernelNorm)
[21] do not suffer from the BatchNorm’s limitations, and
therefore, are applicable to FL and DP.
arXiv:2210.00053v2 [cs.LG] 23 Nov 2022
Normalization challenge. Unsuitability of BatchNorm for
federated and differentially private learning has presented a
real challenge in the corresponding environments. Unlike the
other challenges (i.e. utility, network communication, and
privacy), the normalization issue has remained understudied
in the context of FL and DP. Previous works [9], [22]
illustrate that GroupNorm outperforms BatchNorm in terms
of accuracy in federated settings. Likewise, GroupNorm also
delivers higher accuracy than LayerNorm in differentially
private learning [23]–[25]. Additionally, KernelNorm achieves
significantly higher accuracy and faster convergence rate com-
pared to LayerNorm and GroupNorm in both FL and DP
settings according to the original study [21].
However, the prior studies have not made a comparison
between different normalization layers and the NoNorm (no
normalization layer) case in the first place. Moreover, the
experimental evaluation regarding FL and DP environments
is limited in the original KernelNorm study [21], focusing on
a cross-silo federated setting (few clients with relatively large
datasets) [26] and a shallow model in DP. Finally, the perfor-
mance comparisons in the previous works do not consider dif-
ferentially private federated learning (DP-FL) settings. Given
that, two fundamental questions arise: (1) Do LayerNorm,
GroupNorm, and KernelNorm also deliver higher performance
than NoNorm in FL, DP, and DP-FL environments?, and (2)
Does KernelNorm still outperform other normalization layers
in cross-device FL (many clients with small datasets), in DP-
FL, and using deeper models in DP?
Key findings. We conduct extensive experiments using
the VGG-6 [27], ResNet-8 [21], PreactResNet-18 [28], and
DenseNet20×16 [18] models trained on the CIFAR-10/100
[29] and Imagenette [30] datasets in FL, DP, and DP-FL
settings to address those questions. The findings are as follows:
1) LayerNorm and GroupNorm do not necessarily out-
perform the NoNorm case for shallow models in FL
and DP settings. For instance, LayerNorm and Group-
Norm provide slightly lower accuracy and communica-
tion efficiency than NoNorm in the cross-silo federated
setting, where the shallow VGG-6 model is trained
on CIFAR-10. Similarly, LayerNorm and GroupNorm
achieve lower accuracy than NoNorm using the shallow
ResNet-8 model on CIFAR-10 in DP (Section III).
2) KernelNorm significantly outperforms NoNorm, Lay-
erNorm, and GroupNorm in terms of communication
efficiency (convergence rate) and accuracy in both cross-
silo and cross-device FL, with both shallow and deeper
models in DP, and using shallow models in DP-FL
environments (Section III).
Solution. Based on our findings, we advocate employing
KernelNorm as the effective normalization layer for FL, DP,
and DP-FL settings. Given that, we propose a KernelNorm-
based ResNet architecture called KNResNet-13, and show it
delivers considerably higher accuracy than the state-of-the-art
GroupNorm-based architectures on CIFAR-10 and Imagenette
in differentially private learning environments (Section IV).
Contributions. We make the following contributions: (I)
we show LayerNorm and GroupNorm do not deliver higher
accuracy than NoNorm with shallow models in FL and DP
settings, (II) we illustrate the recently proposed KernelNorm
layer has a great potential to become the de facto normalization
layer in privacy-enhancing/preserving machine learning, and
(III) we propose the KNResNet-13 architecture, and provide
new state-of-the-art (SOTA) accuracy values on CIFAR-10 and
Imagenette using the proposed architecture in DP environ-
ments, when trained from scratch.
II. PRELIMINARIES
Federated learning (FL). A federated environment con-
sists of multiple clients as data holders and a central server
as coordinator. FL is a privacy-enhancing technique, which
enables the clients to train a global model without sharing their
private data with a third party. In FL, or more precisely in the
FederatedAveraging (FedAvg) algorithm [7], the server
randomly chooses Kclients, and sends them the global model
parameters Wg
iin each communication round i. Next, each
selected client jtrains the global model on its local dataset
using mini-batch gradient descent, and shares the local model
parameters Wl
i,j with the server. Finally, the server takes the
weighted average over the local parameters from the clients
to update the global model:
Wg
i+1 =PK
j=1 Nj·Wl
i,j
PK
j=1 Nj
,
where Njis the number of samples in client j.
Across-device federated setting contains a large number of
clients such as mobile devices with small datasets [26]. The
server selects a fraction of clients in each round. Moreover,
the underlying assumption is that the communication between
clients and server is unstable, and the clients might drop out
during training. A cross-silo setting, on the other hand, consists
of few clients such as hospitals or research institutions with
relatively large datasets and stable network connection [26].
All clients participate in model training in all communication
rounds. For more details on federated learning, the readers are
referred to [7] and [26].
Differential privacy (DP). The differential privacy ap-
proach provides a theoretical framework and collection of
techniques for privacy-preserving data processing and release
[13]. Its guarantees are formulated in an information-theoretic
fashion and describe the upper bound on the multiplicative
information gain of an adversary observing the output of a
computation over a sensitive database. This definition endows
DP with a robust theoretical underpinning and ascertains that
its guarantees hold in the presence of adversaries with un-
bounded prior knowledge and under infinite post-processing.
Moreover, DP guarantees are compositional, meaning that they
degrade predictably when a DP system is executed repeatedly
on the same database. Formally, a randomised mechanism M
is said to preserve (ε, δ)-DP if, for all databases Dand D0
differing in the data of one individual and all measurable
subsets Sof the range of M, the following inequality holds:
P(M(D)S)eεP(M(D0)S) + δ,
where Pis the probability of an event, ε0and 0δ < 1.
Of note, this inequality must hold also if Dand D0are
swapped. The guarantee is given over the randomness of M.
Intuitively, this characterisation implies that the output of the
mechanism should not change too much when one individual’s
data is added or removed from a database, or equivalently,
the influence of one individual’s data on the result of the
computation should be small.
The application of DP to the training of neural networks is
usually (and in our work) based on the differentially private
stochastic gradient descent (DP-SGD) algorithm [14]. Here,
the role of the database is played by the individual (per-
sample) gradients of the loss function with respect to the
parameters. For the DP guarantee to be well-defined, the inter-
mediate layer outputs (activations), leading to the computation
of a per-sample gradient, are not allowed to be influenced by
more than one sample. Hence, layers like BatchNorm, which
normalize the activations of a layer by considering either
other samples in the batch or the statistics of previously seen
batches, cannot be employed in DP. We refer the readers to
[13], [14], [31] for more information on differential privacy.
Differentially private federated learning (DP-FL). Al-
though FL enhances data privacy by eliminating the require-
ment of data sharing, the model parameters shared with
the server can still cause privacy leakage. To overcome this
problem, the clients can rely on DP to train the global model
on their local data, and share differentially private models with
the server. This way, the clients can benefit from the guarantees
of DP in federated environments.
Normalization. The normalization layers play a crucial role
in deep CNNs. They can smoothen the optimization landscape
[32] and effectively address the problem of vanishing gradients
[33], leading to improved model performance. The normaliza-
tion layers are different from each other in their normalization
unit, which is a subset of elements from the original input
that are normalized together with the mean and variance of the
unit [21]. Assume that the input is a 4-dimensional tensor with
batch, channel, height, and width as dimensions. BatchNorm
[16] considers all elements in the batch, height, and width
dimensions as its normalization unit. LayerNorm [19], on the
other hand, performs normalization across all elements in the
channel, height, and width dimensions but separately for each
sample in the batch. The normalization unit of GroupNorm
[20] contains all elements in the height and width dimensions
similar to LayerNorm, but a subset of elements (specified by
the group size) in the channel dimension.
BatchNorm, LayerNorm, and GroupNorm are referred to as
global normalization layers because they consider all elements
in the height and width dimensions during normalization [34].
There is also a one-to-one correspondence between the input
and output elements in the aforementioned layers, implying
that they do not modify the input shape [21]. These layers have
shift and scale as learnable parameters too for ensuring that the
distributions of the input and output elements remain similar
[16]. In contrast to BatchNorm, LayerNorm and GroupNorm
are batch-independent because they perform normalization
separately for each sample in the batch.
KernelNorm [21] performs normalization along the chan-
nel, height, and width dimensions but independently of the
batch dimension akin to LayerNorm and GroupNorm. The
normalization unit of KernelNorm, however, is a tensor of
shape (c,kh,kw), where cis the number of input channels,
and (kh,kw) is the kernel size. Thus, KernelNorm considers
all elements in the channel dimension but a subset of elements
specified by the kernel size from the height and width dimen-
sions during normalization. In simple words, KernelNorm is
similar to the pooling layers, except that KernelNorm normal-
izes the elements instead of computing average or maximum,
and carries out operation over all channels rather than on a
single channel.
Formally, KernelNorm (1) applies dropout to the original
normalization unit Uto obtain the dropped-out unit U0, (2)
calculates the mean and variance of U0, and (3) employs the
computed mean and variance to normalize U:
U0=Dp(U),(1)
µu0=1
c·kh·kw
·
c
X
ic=1
kh
X
ih=1
kw
X
iw=1
U0(ic, ih, iw),
σ2
u0=1
c·kh·kw
·
c
X
ic=1
kh
X
ih=1
kw
X
iw=1
(U0(ic, ih, iw)µu0)2,
(2)
ˆ
U=Uµu0
pσ2
u0+,(3)
where pis the dropout [35] probability, µu0and σ2
u0are the
mean and variance of U0, respectively, and ˆ
Uis the normalized
unit. Partially inspired by BatchNorm, KernelNorm introduces
a regularizing effect during training through normalizing the
elements of the original unit Uvia the statistics calculated
over the dropped-out unit U0.
KernelNorm is a local normalization layer. Moreover, it
has no learnable parameters, and its output might have very
different shape than the input. Similar to LayerNorm and
GroupNorm, KernelNorm is batch-independent because it per-
forms normalization separately for each sample of the batch.
The kernel normalized convolutional (KNConv) layer [21] is
the combination of the KernelNorm and convolutional layer,
where the output of the former is given as input to the latter.
The modern CNNs are batch-normalized, leveraging the
BatchNorm and convolutional layers in their architectures. The
corresponding layer/group-normalized networks are obtained
by simply replacing BatchNorm with LayerNorm/GroupNorm.
The kernel-normalized counterparts [21], on the other hand,
employ the KernelNorm and KNConv layers as the main
building blocks, while forgoing the BatchNorm layers. For
more details on the normalization layers, the readers can see
[16], [19]–[21].
摘要:

KernelNormalizedConvolutionalNetworksforPrivacy-PreservingMachineLearningRezaNasirigerdehTechnicalUniversityofMunichKlinikumrechtsderIsarMunich,GermanyJavadTorkzadehmahaniAzadUniversityofKermanKerman,IranDanielRueckertTechnicalUniversityofMunichKlinikumrechtsderIsarMunich,GermanyImperialCollegeLondo...

展开>> 收起<<
Kernel Normalized Convolutional Networks for Privacy-Preserving Machine Learning Reza Nasirigerdeh.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:732.48KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注