Kernel Normalized Convolutional Networks for Privacy-Preserving Machine Learning Reza Nasirigerdeh

2025-05-06 0 0 732.48KB 12 页 10玖币

侵权投诉

Kernel Normalized Convolutional Networks for

Privacy-Preserving Machine Learning

Reza Nasirigerdeh

Technical University of Munich

Klinikum rechts der Isar

Munich, Germany

Javad Torkzadehmahani

Azad University of Kerman

Kerman, Iran

Daniel Rueckert

Technical University of Munich

Klinikum rechts der Isar

Munich, Germany

Imperial College London

London, United Kingdom

Georgios Kaissis

Technical University of Munich

Helmholtz Zentrum Munich

Munich, Germany

Abstract—Normalization is an important but understudied

challenge in privacy-related application domains such as fed-

erated learning (FL), differential privacy (DP), and differentially

private federated learning (DP-FL). While the unsuitability of

batch normalization for these domains has already been shown,

the impact of other normalization methods on the performance

of federated or differentially private models is not well-known.

To address this, we draw a performance comparison among layer

normalization (LayerNorm), group normalization (GroupNorm),

and the recently proposed kernel normalization (KernelNorm)

in FL, DP, and DP-FL settings. Our results indicate LayerNorm

and GroupNorm provide no performance gain compared to the

baseline (i.e. no normalization) for shallow models in FL and DP.

They, on the other hand, considerably enhance the performance

of shallow models in DP-FL and deeper models in FL and DP.

KernelNorm, moreover, signiﬁcantly outperforms its competitors

in terms of accuracy and convergence rate (or communication

efﬁciency) for both shallow and deeper models in all considered

learning environments. Given these key observations, we propose

a kernel normalized ResNet architecture called KNResNet-13 for

differentially private learning. Using the proposed architecture,

we provide new state-of-the-art accuracy values on the CIFAR-10

and Imagenette datasets, when trained from scratch.

Index Terms—Differential Privacy, Federated Learning, Kernel

Normalization, Group Normalization, Batch Normalization

I. INTRODUCTION

Deep convolutional neural networks (CNNs) are popular

in a diverse range of image vision tasks including image

classiﬁcation [1]. Deep CNNs rely on large-scale datasets to

effectively train the model, which might be difﬁcult to provide

in a centralized manner [2]. This is because datasets are often

distributed across different sites such as hospitals, and contain

sensitive data which cannot be transferred to a centralized

location due to privacy regulations [3]. Even if such datasets

become available, training algorithms can pose privacy risks

to the individuals participating in the dataset, leaking privacy-

sensitive information through the trained model [4]–[6].

To appear in the IEEE Conference on Secure and Trustworthy Machine

Learning (SaTML), February 2023.

Federated learning (FL) [7] addresses the large-scale data

availability challenge by enabling clients to jointly train a

global model under the coordination of a central server without

sharing their private data. Network communication, on the

other hand, emerges as a new challenge in federated environ-

ments, requiring a large number of communication rounds for

model convergence, and exchanging a large amount of trafﬁc

in each round [8]. FL also causes utility (e.g. in terms of

accuracy) reduction due to the Non-IID (not independent and

identically distributed) nature of the data across the clients

[9]. Finally, although FL eliminates the requirement of data

sharing, it might still lead to privacy leakage, where the private

data of the clients can be reconstructed from the model updates

shared with the server [10]–[12].

Differential privacy (DP) [13] copes with the privacy chal-

lenge in both centralized and federated environments by in-

jecting random noise into the model gradients to limit the

information learnt about a particular sample in the dataset [14].

DP, however, adversely affects the model utility similar to FL

because of the injected noise. In general, there is a trade-off

between privacy and utility in DP, where stronger privacy leads

to lower utility [15].

Batch normalization (BatchNorm) [16] is the de facto nor-

malization layer in popular deep CNNs such as ResNets [17]

and DenseNets [18], which remarkably improves the model

convergence rate and accuracy in centralized training. Batch-

Norm, however, is not suitable for FL and DP settings. This

is because BatchNorm relies on the IID distribution of feature

values in the batch [16], which is not the case in federated

settings. Moreover, per-sample gradients are required to be

computed in DP that is impossible for batch-normalized CNNs

[14]. Batch-independent layers such as layer normalization

(LayerNorm) [19], group normalization (GroupNorm) [20],

and the recently proposed kernel normalization (KernelNorm)

[21] do not suffer from the BatchNorm’s limitations, and

therefore, are applicable to FL and DP.

arXiv:2210.00053v2 [cs.LG] 23 Nov 2022

Normalization challenge. Unsuitability of BatchNorm for

federated and differentially private learning has presented a

real challenge in the corresponding environments. Unlike the

other challenges (i.e. utility, network communication, and

privacy), the normalization issue has remained understudied

in the context of FL and DP. Previous works [9], [22]

illustrate that GroupNorm outperforms BatchNorm in terms

of accuracy in federated settings. Likewise, GroupNorm also

delivers higher accuracy than LayerNorm in differentially

private learning [23]–[25]. Additionally, KernelNorm achieves

signiﬁcantly higher accuracy and faster convergence rate com-

pared to LayerNorm and GroupNorm in both FL and DP

settings according to the original study [21].

However, the prior studies have not made a comparison

between different normalization layers and the NoNorm (no

normalization layer) case in the ﬁrst place. Moreover, the

experimental evaluation regarding FL and DP environments

is limited in the original KernelNorm study [21], focusing on

a cross-silo federated setting (few clients with relatively large

datasets) [26] and a shallow model in DP. Finally, the perfor-

mance comparisons in the previous works do not consider dif-

ferentially private federated learning (DP-FL) settings. Given

that, two fundamental questions arise: (1) Do LayerNorm,

GroupNorm, and KernelNorm also deliver higher performance

than NoNorm in FL, DP, and DP-FL environments?, and (2)

Does KernelNorm still outperform other normalization layers

in cross-device FL (many clients with small datasets), in DP-

FL, and using deeper models in DP?

Key ﬁndings. We conduct extensive experiments using

the VGG-6 [27], ResNet-8 [21], PreactResNet-18 [28], and

DenseNet20×16 [18] models trained on the CIFAR-10/100

[29] and Imagenette [30] datasets in FL, DP, and DP-FL

settings to address those questions. The ﬁndings are as follows:

1) LayerNorm and GroupNorm do not necessarily out-

perform the NoNorm case for shallow models in FL

and DP settings. For instance, LayerNorm and Group-

Norm provide slightly lower accuracy and communica-

tion efﬁciency than NoNorm in the cross-silo federated

setting, where the shallow VGG-6 model is trained

on CIFAR-10. Similarly, LayerNorm and GroupNorm

achieve lower accuracy than NoNorm using the shallow

ResNet-8 model on CIFAR-10 in DP (Section III).

2) KernelNorm signiﬁcantly outperforms NoNorm, Lay-

erNorm, and GroupNorm in terms of communication

efﬁciency (convergence rate) and accuracy in both cross-

silo and cross-device FL, with both shallow and deeper

models in DP, and using shallow models in DP-FL

environments (Section III).

Solution. Based on our ﬁndings, we advocate employing

KernelNorm as the effective normalization layer for FL, DP,

and DP-FL settings. Given that, we propose a KernelNorm-

based ResNet architecture called KNResNet-13, and show it

delivers considerably higher accuracy than the state-of-the-art

GroupNorm-based architectures on CIFAR-10 and Imagenette

in differentially private learning environments (Section IV).

Contributions. We make the following contributions: (I)

we show LayerNorm and GroupNorm do not deliver higher

accuracy than NoNorm with shallow models in FL and DP

settings, (II) we illustrate the recently proposed KernelNorm

layer has a great potential to become the de facto normalization

layer in privacy-enhancing/preserving machine learning, and

(III) we propose the KNResNet-13 architecture, and provide

new state-of-the-art (SOTA) accuracy values on CIFAR-10 and

Imagenette using the proposed architecture in DP environ-

ments, when trained from scratch.

II. PRELIMINARIES

Federated learning (FL). A federated environment con-

sists of multiple clients as data holders and a central server

as coordinator. FL is a privacy-enhancing technique, which

enables the clients to train a global model without sharing their

private data with a third party. In FL, or more precisely in the

FederatedAveraging (FedAvg) algorithm [7], the server

randomly chooses Kclients, and sends them the global model

parameters Wg

iin each communication round i. Next, each

selected client jtrains the global model on its local dataset

using mini-batch gradient descent, and shares the local model

parameters Wl

i,j with the server. Finally, the server takes the

weighted average over the local parameters from the clients

to update the global model:

i+1 =PK

j=1 Nj·Wl

i,j

j=1 Nj

where Njis the number of samples in client j.

Across-device federated setting contains a large number of

clients such as mobile devices with small datasets [26]. The

server selects a fraction of clients in each round. Moreover,

the underlying assumption is that the communication between

clients and server is unstable, and the clients might drop out

during training. A cross-silo setting, on the other hand, consists

of few clients such as hospitals or research institutions with

relatively large datasets and stable network connection [26].

All clients participate in model training in all communication

rounds. For more details on federated learning, the readers are

referred to [7] and [26].

Differential privacy (DP). The differential privacy ap-

proach provides a theoretical framework and collection of

techniques for privacy-preserving data processing and release

[13]. Its guarantees are formulated in an information-theoretic

fashion and describe the upper bound on the multiplicative

information gain of an adversary observing the output of a

computation over a sensitive database. This deﬁnition endows

DP with a robust theoretical underpinning and ascertains that

its guarantees hold in the presence of adversaries with un-

bounded prior knowledge and under inﬁnite post-processing.

Moreover, DP guarantees are compositional, meaning that they

degrade predictably when a DP system is executed repeatedly

on the same database. Formally, a randomised mechanism M

is said to preserve (ε, δ)-DP if, for all databases Dand D0

differing in the data of one individual and all measurable

subsets Sof the range of M, the following inequality holds:

P(M(D)∈S)≤eεP(M(D0)∈S) + δ,

where Pis the probability of an event, ε≥0and 0≤δ < 1.

Of note, this inequality must hold also if Dand D0are

swapped. The guarantee is given over the randomness of M.

Intuitively, this characterisation implies that the output of the

mechanism should not change too much when one individual’s

data is added or removed from a database, or equivalently,

the inﬂuence of one individual’s data on the result of the

computation should be small.

The application of DP to the training of neural networks is

usually (and in our work) based on the differentially private

stochastic gradient descent (DP-SGD) algorithm [14]. Here,

the role of the database is played by the individual (per-

sample) gradients of the loss function with respect to the

parameters. For the DP guarantee to be well-deﬁned, the inter-

mediate layer outputs (activations), leading to the computation

of a per-sample gradient, are not allowed to be inﬂuenced by

more than one sample. Hence, layers like BatchNorm, which

normalize the activations of a layer by considering either

other samples in the batch or the statistics of previously seen

batches, cannot be employed in DP. We refer the readers to

[13], [14], [31] for more information on differential privacy.

Differentially private federated learning (DP-FL). Al-

though FL enhances data privacy by eliminating the require-

ment of data sharing, the model parameters shared with

the server can still cause privacy leakage. To overcome this

problem, the clients can rely on DP to train the global model

on their local data, and share differentially private models with

the server. This way, the clients can beneﬁt from the guarantees

of DP in federated environments.

Normalization. The normalization layers play a crucial role

in deep CNNs. They can smoothen the optimization landscape

[32] and effectively address the problem of vanishing gradients

[33], leading to improved model performance. The normaliza-

tion layers are different from each other in their normalization

unit, which is a subset of elements from the original input

that are normalized together with the mean and variance of the

unit [21]. Assume that the input is a 4-dimensional tensor with

batch, channel, height, and width as dimensions. BatchNorm

[16] considers all elements in the batch, height, and width

dimensions as its normalization unit. LayerNorm [19], on the

other hand, performs normalization across all elements in the

channel, height, and width dimensions but separately for each

sample in the batch. The normalization unit of GroupNorm

[20] contains all elements in the height and width dimensions

similar to LayerNorm, but a subset of elements (speciﬁed by

the group size) in the channel dimension.

BatchNorm, LayerNorm, and GroupNorm are referred to as

global normalization layers because they consider all elements

in the height and width dimensions during normalization [34].

There is also a one-to-one correspondence between the input

and output elements in the aforementioned layers, implying

that they do not modify the input shape [21]. These layers have

shift and scale as learnable parameters too for ensuring that the

distributions of the input and output elements remain similar

[16]. In contrast to BatchNorm, LayerNorm and GroupNorm

are batch-independent because they perform normalization

separately for each sample in the batch.

KernelNorm [21] performs normalization along the chan-

nel, height, and width dimensions but independently of the

batch dimension akin to LayerNorm and GroupNorm. The

normalization unit of KernelNorm, however, is a tensor of

shape (c,kh,kw), where cis the number of input channels,

and (kh,kw) is the kernel size. Thus, KernelNorm considers

all elements in the channel dimension but a subset of elements

speciﬁed by the kernel size from the height and width dimen-

sions during normalization. In simple words, KernelNorm is

similar to the pooling layers, except that KernelNorm normal-

izes the elements instead of computing average or maximum,

and carries out operation over all channels rather than on a

single channel.

Formally, KernelNorm (1) applies dropout to the original

normalization unit Uto obtain the dropped-out unit U0, (2)

calculates the mean and variance of U0, and (3) employs the

computed mean and variance to normalize U:

U0=Dp(U),(1)

µu0=1

c·kh·kw

ic=1

ih=1

iw=1

U0(ic, ih, iw),

σ2

u0=1

c·kh·kw

ic=1

ih=1

iw=1

(U0(ic, ih, iw)−µu0)2,

(2)

U=U−µu0

pσ2

u0+,(3)

where pis the dropout [35] probability, µu0and σ2

u0are the

mean and variance of U0, respectively, and ˆ

Uis the normalized

unit. Partially inspired by BatchNorm, KernelNorm introduces

a regularizing effect during training through normalizing the

elements of the original unit Uvia the statistics calculated

over the dropped-out unit U0.

KernelNorm is a local normalization layer. Moreover, it

has no learnable parameters, and its output might have very

different shape than the input. Similar to LayerNorm and

GroupNorm, KernelNorm is batch-independent because it per-

forms normalization separately for each sample of the batch.

The kernel normalized convolutional (KNConv) layer [21] is

the combination of the KernelNorm and convolutional layer,

where the output of the former is given as input to the latter.

The modern CNNs are batch-normalized, leveraging the

BatchNorm and convolutional layers in their architectures. The

corresponding layer/group-normalized networks are obtained

by simply replacing BatchNorm with LayerNorm/GroupNorm.

The kernel-normalized counterparts [21], on the other hand,

employ the KernelNorm and KNConv layers as the main

building blocks, while forgoing the BatchNorm layers. For

more details on the normalization layers, the readers can see

[16], [19]–[21].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

KernelNormalizedConvolutionalNetworksforPrivacy-PreservingMachineLearningRezaNasirigerdehTechnicalUniversityofMunichKlinikumrechtsderIsarMunich,GermanyJavadTorkzadehmahaniAzadUniversityofKermanKerman,IranDanielRueckertTechnicalUniversityofMunichKlinikumrechtsderIsarMunich,GermanyImperialCollegeLondo...

展开>> 收起<<

Kernel Normalized Convolutional Networks for Privacy-Preserving Machine Learning Reza Nasirigerdeh.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Kernel Normalized Convolutional Networks for Privacy-Preserving Machine Learning Reza Nasirigerdeh

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: