Differentially Private Deep Learning with ModelMix Hanshen Xiao Jun Wan and Srinivas Devadas MIT

2025-04-27 0 0 1.27MB 21 页 10玖币

侵权投诉

Differentially Private Deep Learning with ModelMix

Hanshen Xiao, Jun Wan, and Srinivas Devadas

MIT

{hsxiao, junwan, devadas}@mit.edu

Abstract—Training large neural networks with meaning-

ful/usable differential privacy security guarantees is a demand-

ing challenge. In this paper, we tackle this problem by revisiting

the two key operations in Differentially Private Stochastic

Gradient Descent (DP-SGD): 1) iterative perturbation and 2)

gradient clipping. We propose a generic optimization frame-

work, called ModelMix, which performs random aggregation of

intermediate model states. It strengthens the composite privacy

analysis utilizing the entropy of the training trajectory and

improves the (, δ)DP security parameters by an order of

magnitude.

We provide rigorous analyses for both the utility guaran-

tees and privacy ampliﬁcation of ModelMix. In particular, we

present a formal study on the effect of gradient clipping in

DP-SGD, which provides theoretical instruction on how hyper-

parameters should be selected. We also introduce a reﬁned

gradient clipping method, which can further sharpen the pri-

vacy loss in private learning when combined with ModelMix.

Thorough experiments with signiﬁcant privacy/utility im-

provement are presented to support our theory. We train

a Resnet-20 network on CIFAR10 with 70.4% accuracy via

ModelMix given (= 8, δ = 10−5)DP-budget, compared to the

same performance but with (= 145.8, δ = 10−5)using regu-

lar DP-SGD; assisted with additional public low-dimensional

gradient embedding, one can further improve the accuracy to

79.1% with (= 6.1, δ = 10−5)DP-budget, compared to the

same performance but with (= 111.2, δ = 10−5)without

ModelMix.

Index Terms—Differential Privacy; R´

enyi Differential Privacy;

Clipped Stochastic Gradient Descent; Deep Learning;

1. Introduction

Privacy concerns when learning with sensitive data are

receiving increasing attention. Many practical attacks have

shown that without proper protection, the model’s parameters

[

], [

], leakage on gradients during training [

], or just

observations on the prediction results [

] may enable an

adversary to successfully distinguish and even reconstruct the

private samples used for learning. As an emergent canonical

deﬁnition, Differential Privacy (DP) [

], [

] provides a

semantic privacy metric to quantify how hard it is for an

adversary to infer the participation of an individual in an

aggregate statistic. As one of the most popular approaches,

Differentially-Private Stochastic Gradient Descent (DP-SGD)

[

], [

] and its variants [

], [

]

have been widely studied over the last decade. DP-SGD

can be applied to almost all optimization problems in

machine learning to produce rigorous DP guarantees without

additional assumptions regarding the objective function or

dataset. However, despite its broad applicability, DP-SGD

also suffers notoriously large utility loss especially when

training cutting-edge deep models. Its practical implemen-

tation is also known to be sensitive to hyper-parameter

selections [

], [

]. Indeed, even in theory, the effects

of the two artiﬁcial privatization operations applied in DP-

SGD, iterative gradient perturbation and gradient clipping,

are still not fully-understood.

To understand why these two artiﬁcial modiﬁcations are

the key to differentially privatize iterative methods, we need

to ﬁrst introduce the concept of sensitivity, which plays a

key role in DP. The sensitivity captures the maximum impact

an individual sample from an input dataset may have on

an algorithm’s output. It is the foundation of almost all DP

mechanisms, including the Laplace/Gaussian and Exponential

Mechanisms [

], where the sensitivity determines how much

randomization is needed to to hide any individual amongst

the population with desired privacy. Unfortunately, in many

practical optimization problems, the end-to-end sensitivity is

intractable or can only be loosely bounded.

To this end, DP-SGD proposes an alternative solution by

assuming a more powerful adversary. In most private (central-

ized) learning applications, the standard black-box adversary

can only observe the ﬁnal model revealed. DP-SGD, on

the other hand, assumes an adversary who can observe the

intermediate updates during training. For convenience, we

will call such an adversary a white-box adversary. Provided

such an empowered adversary, DP-SGD clips the gradient

evaluated by each individual sample and adds random noises

to the updates in each iteration. Clipping guarantees that the

sensitivity is bounded within each iteration. The total privacy

loss is then upper bounded by a composition of the leakage

from all iterations.

For convex optimization with Lipschitz continuity, where

the norms of gradients are uniformly bounded by some given

constant, DP-SGD is known to produce an asymptotically

tight privacy-utility tradeoff [

]. However, it is, in general,

impractical to assume Lipschitz continuity in tasks such as

arXiv:2210.03843v1 [cs.LG] 7 Oct 2022

deep learning. Either asymptotically or non-asymptotically,

the study of practical implementations of DP-SGD with more

realistic and speciﬁc assumptions remains very active [

[

], [

] and demanding. Much research effort has

been dedicated to tackling the following two fundamental

questions. First, provided that we do not need to publish the

intermediate computation results, how conservative is the

privacy claim offered by DP-SGD? Second, during practical

implementation, how to properly select the training model

and hyper-parameters?

Regarding the ﬁrst question, many prior works [

], [

[

] tried to empirically simulate what the adversary can

infer from models trained by DP-SGD. In particular, [

] ex-

amined the respective power of “black-box” and “white-box”

adversaries, and suggested that a substantial gap between

the DP-SGD privacy bound and the actual privacy guarantee

may exist. Unfortunately, beyond DP-SGD, there are few

known ways to produce, let along improve, rigorous DP

analysis for a training process. Most existing analyses need

to assume either access to additional public data [

], [

[

], or strongly-convex loss functions to enable objective

perturbation [

]. Thus, for general applications, we still

have to adopt the conservative DP-SGD analysis for the

worst-case DP guarantees.

The second question is of particular interest to prac-

titioners. The implementation of DP-SGD is tricky as the

performance of DP-SGD is highly sensitive to the selection of

the training model and hyper-parameters. The lack of theoret-

ical analysis on gradient clipping makes it hard to ﬁnd good

parameters and optimize model architectures instructively,

though many heuristic observations and optimizations on

these choices are reported. [

], [

] showed how to

ﬁnd proper model architectures to balance learning capability

and utility loss when the dataset is given.

Recent work [

]

also demonstrated empirical improvements through selecting

the clipping threshold adaptively. However, even with these

efforts, there is still a long way to go if we want to practically

train large neural networks with rigorous and usable privacy

guarantees. The biggest bottlenecks include

the huge model dimension, which may be even larger

than the size of the training dataset, while the magnitude

of noise required for gradient perturbation is proportional

to the square root of model dimension, and

the long convergence time, which implies a massive

composition of privacy loss and also forces DP-SGD

to add formidable noise resulting in intolerable accuracy

loss.

To this end, Tramer and Boneh in [

] argued that within

the current framework, for most medium datasets (<500K

datapoints) such as CIFAR10/100 and MNIST, the utility

loss caused by DP-SGD will offset the powerful learning

Most prior works report the best model and parameter selection by

grid searching, where the private data is reused multiple times and the

selection of parameters itself is actually sensitive. This additional privacy

leakage, partially determined by the prior knowledge on the training data is

in general very hard to quantify, though in practice it might be small [

[26].

capacity offered by deep models. Therefore, simple linear

models usually outperform the modern Convolutional Neural

Network (CNN) for these datasets. How to privately train

a model while still being able to enjoy the state-of-the-

art success in modern machine learning is one of the key

problems in the application of DP.

1.1. Our Strategy and Results

In this paper, we set out to provide a systematic study of

DP-SGD from both theoretical and empirical perspectives

to understand the two important but artiﬁcial operations:

(1) iterative gradient perturbation and (2) gradient clipping.

We propose a generic technique, ModelMix, to signiﬁcantly

sharpen the utility-privacy tradeoff. Our theoretical analysis

also provides instruction on how to select the clipping

parameter and quantify privacy ampliﬁcation from other

practical randomness.

We will stick to the worst-case DP guarantee without any

relaxation, but view the private iterative optimization process

from a different angle. In most practical deep learning tasks,

with proper use of randomness, we will have a good chance of

ﬁnding some reasonable (local) minimum via SGD regardless

of the initialization (starting point) and the subsampling [

In particular, for convex optimization, we are guaranteed

to approach the global optimum with a proper step size.

In other words, there are an inﬁnite number of potential

training trajectories

pointing to some (local) minimum of

good generalization, and we are free to use any one of them

to ﬁnd a good model. Thus, even without DP perturbation,

the training trajectory has potential entropy if we are allowed

to do random selection.

From this standpoint, a slow convergence rate when

training a large model is not always bad news for privacy.

This might seem counter-intuitive. But, in general, slow

convergence means that the intermediate updates wander

around a relatively large domain for a longer time before

entering a satisfactory neighborhood of (global/local) opti-

mum. Training a larger model may produce a more fuzzy and

complicated convergence process, which could compensate

the larger privacy loss composition caused in DP-SGD. We

have to stress that our ultimate goal is to privately publish a

good model, while DP-SGD with exposed updates is merely

a tool to ﬁnd a trajectory with analyzable privacy leakage.

The above observation inspires a way to ﬁnd a better DP

guarantee even under the conservative “whitebox” adversary

model: can we utilize the potential entropy of the training

trajectory while still bounding the sensitivity to produce

rigorous DP guarantees?

To be speciﬁc, different from standard DP-SGD which

randomizes a particular trajectory with noise, we aim to pri-

vately construct an envelope of training trajectories, spanned

by the many trajectories converging to some (global/local)

minimum, and randomly generate one trajectory to amplify

privacy. To achieve this, we must carefully consider the

tradeoff between (1) controlling the worst-case sensitivity

We will use training trajectory in the following to represent the

sequences of intermediate updates produced by SGD.

in the trajectory generalization and (2) the learning bias

resultant from this approach. We summarize our contributions

as follows.

(a)

We present a generic optimization framework, called

ModelMix, which iteratively builds an envelope of

training trajectories through post-processing historical

updates, and randomly aggregates those model states

before applying gradient descent. We provide rigorous

convergence and privacy analysis for ModelMix, which

enables us to quantify

(, δ)

-DP budget of our protocol.

The reﬁned privacy analysis framework proposed can

also be used to capture the privacy ampliﬁcation of a

large class of training-purpose-oriented operations com-

monly used in deep learning. This class of operations

include data augmentation [

] and stochastic gradient

Langevin dynamics (SGLD) [

], which cannot produce

reasonable worst-case DP guarantees by themselves.

(b)

We study the inﬂuence of gradient clipping in private

optimization and present the ﬁrst generic convergence

rate analysis of clipped DP-SGD in deep learning. To

our best knowledge, this is the ﬁrst analysis of non-

convex optimization via clipped DP-SGD with only

mild assumptions on the concentration of stochastic

gradient. We show that the key factor in clipped DP-

SGD is the sampling noise

of the stochastic gradient.

We then demonstrate why implementation of DP-SGD

by clipping individual sample gradients can be unstable

and sensitive to the selection of hyper-parameters. Those

analyses can be used to instruct how to select hyper-

parameters and improve network architecture in deep

learning with DP-SGD.

(c)

ModelMix is a fundamental improvement to DP-SGD,

which can be applied to almost all applications together

with other advances in DP-SGD, such as low-rank or

low-dimensional gradient embedding [

], [

] and

ﬁne-tuning based transfer learning [

], [

] (if ad-

ditional public data is provided). In our experiments,

we focus on computer vision tasks, a canonical domain

for private deep learning. We evaluate our methods

on CIFAR-10, FMNIST and SVHN datasets using

various neural network models and compare with the

state-of-the-art results. Our approach improves the pri-

vacy/utility tradeoff signiﬁcantly. For example, provided

a privacy budget

(= 8, δ = 10−5)

, we are able

to train Resnet20 on CIFAR10 with accuracy

70.4%

compared to

56.1%

when applying regular DP-SGD.

As for private transfer learning on CIFAR10, we can

improve the

(= 2, δ = 10−5)

-DP guarantee in [

]

(= 0.64, δ = 10−5)

producing the same

92.7%

accuracy.

The remainder of this paper is organized as follows. In

Section 2, we introduce background on statistical learning,

differential privacy and DP-SGD. In Section 3, we formally

present the ModelMix framework, whose utility in both

The noise corresponds to using a minibatch of samples to estimate

the true full-batch gradient.

convex and non-convex optimizations is studied in Theorem

3.1 and Theorem 3.2, respectively. In Section 4, we show

how to efﬁciently compute the ampliﬁed

(, δ)

DP security

parameters in Theorem 4.1 and a non-asymptotic ampliﬁca-

tion analysis is given in Theorem 4.2. Further experiments

with detailed comparisons to the state-of-the-art works are

included in Section 5. Finally, we conclude and discuss future

work in Section 6.

1.2. Related Works

Theoretical (Clipped) DP-SGD Analysis

: When DP-SGD

was ﬁrst proposed [

], and in most theoretical studies

afterwards [

], [

], the objective loss function is assumed to

-Lipschitz continuous, where the

norm of the gradient

is uniformly bounded by

. This enables a straightforward

privatization on SGD by simply perturbing the gradients. In

particular, for convex optimization on a dataset of

samples,

a training loss of

Θ(pdlog(1/δ)/(n))

is known to be tight

under an (, δ)DP guarantee [8].

However, it is hard to get a (tight) Lipschitz bound for

general learning tasks. A practical version of DP-SGD was

then presented in [

], where the Lipschitz assumption is

replaced by gradient clipping to ensure bounded sensitivity.

This causes a disparity between the practice and the theory as

classic results [

], [

] assuming bounded gradients cannot

be directly generalized to clipped DP-SGD. Some existing

works tried to narrow this gap by providing new analysis. [

]

presented a convergence analysis of smooth optimization with

clipped SGD when the sampling noise in stochastic gradient

is bounded. But [

] requires the clipping threshold

to be

Ω(T)

where

is the total number of iterations. This could

be a strong requirement, as in practice, the iteration number

can be much larger than the constant clipping threshold

selected. [

] relaxed the requirement with an assumption

that the sampling noise is symmetric. [

] studied the special

case where clipped DP-SGD is applied to generalized linear

functions. In this paper, we give the ﬁrst generic analysis

of clipped DP-SGD with only mild assumptions on the

concentration property of stochastic gradients.

Assistance with Additional Public Data

: When additional

unrestricted (unlabeled) public data is available, an alternative

model-agnostic approach is Private Aggregation of Teacher

Ensembles (PATE) [

], [

]. PATE builds a teacher-

student framework, where private data is ﬁrst split into

multiple (usually hundreds) disjoint sets, and a teacher model

is trained over each set separately. Then, one can apply those

teacher models to privately label public data via a private

majority voting. Those privately labeled samples are then

used to train a student model, as a postprocessing of labeled

samples. Another line of works considers improving the

noise added in DP-SGD with public data. For example,

in private transfer learning, we can ﬁrst pretrain a large

model with public data and then apply DP-SGD with private

data to ﬁne-tune a small fraction of the model parameters

[

], [

]. However, both PATE and private transfer learning

have to assume a large amount of public data. Another

idea considers the projection of the private gradient into

a low-rank/dimensional subspace, approximated by public

samples, to reduce the magnitude of noise [

], [

[

]. When the public samples are limited, DP-SGD with

low-rank gradient representation usually outperforms the

former methods.

Except for PATE, our methods can be, in general, used to

further enhance those state-of-the-art DP-SGD improvements

with public data. For example, using 2K ImageNet public

samples, the low-rank embedding method in [

] can train

Resnet20 on CIFAR10 with

79.1%

accuracy at a cost of

(= 111.2, δ = 10−5)

-DP; while ModelMix can improve

the DP guarantee to

(= 6.1, δ = 10−5)

-DP with the same

accuracy as shown in Section 5.3.

2. Preliminaries

Empirical Risk Minimization:

In statistical learning, the

model to be trained is commonly represented by a pa-

rameterized function

f(w, x):(W,X)→R

, mapping

feature

from input domain

into an output (prediction/

classiﬁcation) domain. In the following, we will always use

to represent the dimensionality of the parameter

, i.e.,

w∈Rd

. For example, one may consider

f(w, x)

as a neural

network with a sequence of linear layers connected by non-

linear activation layers, and

represents the weights to be

trained. Given a set

samples

{(xi, yi), i = 1,2, ..., n}

we deﬁne the problem of Empirical Risk Minimization

(ERM) for some loss function l(·,·)as follows,

min

wF(w) = min

n·

i=1

l(f(w, xi), yi).(1)

For convenience, we simply use

f(w, xi, yi)

to denote the

objective loss function

l(f(w, xi), yi)

in the rest of the paper.

Below, we formally introduce the deﬁnitions of Lipschitz

continuity, smoothness and convexity, which are commonly

used in optimization research.

Deﬁnition 2.1

(Lipschitz Continuity)

A function

Lipschitz if for all

w, w0∈ W

|g(w)−g(w0)| ≤ Lkw−w0k2

Deﬁnition 2.2

(Smoothness)

A function

-smooth on

if for all

w, w0∈ W

g(w0)≤g(w) + h∇g(w), w0−wi+

2kw0−wk2

Deﬁnition 2.3

(Convexity)

A function

is convex on

if for all

w, w0∈ W

and

t∈(0,1)

f(tw + (1 −t)w0)≤

tg(w) + (1 −t)g(w0).

In the following, we will simply use

k·k

to denote the

norm unless speciﬁed otherwise.

Differential Privacy (DP):

We ﬁrst formally deﬁne

(, δ)

DP and (α, )-R´

enyi DP as follows.

Deﬁnition 2.4

(Differential Privacy)

Given a data universe

X∗

, we say that two datasets

D,D0⊆ X∗

are neighbors,

denoted as

D ∼ D0

, if

D=D0∪s

D0=D ∪ s

for

some additional datapoint

. A randomized algorithm

said to be

(, δ)

-differentially private (DP) if for any pair of

neighboring datasets

D,D0

and any event

in the output

space of A, it holds that

P(A(D)∈S)≤e·P(A(D0)∈S) + δ.

Deﬁnition 2.5

enyi Differential Privacy [

])

A random-

ized algorithm

satisﬁes

(α, )

-R

enyi Differential Privacy

(RDP),

α > 1

, if for any pair of neighboring datasets

D ∼ D0,

≥Dα(M(D)kM(D0)).

Here,

Dα(PkQ) = 1

α−1log Zq(o)( p(o)

q(o))αdo, (2)

represents

-R

enyi Divergence between two distributions

and Qwhose density functions are pand q, respectively.

In Deﬁnition 2.4 and 2.5, if two neighboring datasets

and

are deﬁned in a form that

can be obtained by

arbitrarily replacing an datapoint in

, then they become

the deﬁnitions of bounded DP [

], [

] and RDP, respectivaly.

In this paper, we adopt the unbounded DP version to match

existing DP deep learning works [

], [

] with a fair

comparison.

In practice, to achieve meaningful privacy guarantees,



is usually selected as some small one-digit constant and

is asymptotically

O(1/|D|) = O(1/n)

. To randomize an

algorithm, the most common approaches in DP are Gaussian

or Laplace Mechanisms [

], where a Gaussian or Laplace

noise proportional to the sensitivity is added to perturb the

algorithm’s output. In many applications, including the DP-

SGD analysis, we need to quantify the cumulative privacy

loss across sequential queries of some differentially private

mechanism on one dataset. The following theorem provides

an upper bound on the overall privacy leakage.

Theorem 2.1

(Advanced Composition [

])

For any

 >

and

δ∈(0,1)

, the class of

(, δ)

-differentially private

mechanisms satisﬁes

(˜, T δ +˜

δ)

-differential privacy under

T-fold adaptive composition for any ˜and ˜

δsuch that

˜=q2Tlog(1/˜

δ)·+T (e−1).

Theorem 2.2

(Advanced Composition via RDP [

])

For

any

α > 1

and

 > 0

, the class of

(α, )

-RDP mechanisms

satisﬁes

(˜, ˜

δ)

-differential privacy under

-fold adaptive

composition for any ˜and ˜

δsuch that

˜=T  −log(˜

δ)/(α−1).

Theorem 2.1 provides a good characterization on how

the privacy loss increases with composition. For small

(, δ)

, we still have an

O(√T , T δ)

DP guarantee after a

composition. In practice using RDP, Theorem 2.2 usually

produces tighter constants in the privacy bound.

DP-SGD:

(Stochastic) Gradient Descent ((S)GD) is a very

popular approach to optimize a function. Suppose we try

to solve the ERM problem and minimize some function

F(w) = 1

nPn

i=1 f(w, xi, yi)

. SGD can be described as the

following iterative protocol. In the

(k+ 1)

-th iteration, we

apply Poisson sampling, i.e., each datapoint is i.i.d. sampled

by a constant rate

, and a minibatch of

samples is

produced from the dataset

, denoted as

. We calculate

the stochastic gradient as

Gk←X

(xi,yi)∈Sk

∇f(wk, xi, yi).(3)

Then, a gradient descent update is applied using

wk+1 =wk−η·Gk,(4)

for some stepsize

. In particular, if the minibatch is selected

to be the full batch, i.e.,

Sk=D

, then Equation (4) becomes

the standard gradient descent procedure.

We make the following assumption regarding the sam-

pling noise

k∇f(w, x, y)− ∇F(w)k

when the minibatch

size equals

. This assumption, which will be shown to be

necessary in Example 3.1, will be used in Theorem 3.2 when

we derive the concrete convergence rate for clipped SGD.

Assumption 2.1

(Stochastic Gradient of Sub-exponential

Tail)

There exists some constant

κ > 0

such that for any

w, if we randomly select a datapoint (x, y)from D, then

Pr(k∇f(w, x, y)− ∇F(w)k ≥ t)≤e−t/κ.

In Assumption 2.1, a larger

implies stronger concentra-

tion, i.e., a faster decaying tail of the stochastic gradient. The

modiﬁcation from GD/SGD to its corresponding DP version

is straightforward. When the loss function

is assumed to

-Lipschitz [

], [

], i.e.,

k∇f(w, xi, yi)k ≤ L

for any

the worst-case sensitivity in Equation (4) is bounded by

ηL

in each iteration. One can derive a tighter bound [

] using

existing results on the privacy ampliﬁcation from sampling

[

], [

]. Thus, SGD can be made private via iterative

perturbation by replacing Equation (4) with the following:

wk+1 =wk−η·(Gk+ ∆k+1),(5)

where

∆k

is the noise for the

-th iteration. For example, if

we want to use the Gaussian Mechanism to ensure

(, δ)

-DP

when running

iterations, then

∆k+1

can be selected to be

i.i.d. generated from

∆k+1 ← N(0, O(L2Tlog(1/δ)

2)·Id).

Here, Idrepresents the d×didentity matrix.

However, when we do not have the Lipschitz assumption,

an alternative is to force a limited sensitivity through gradient

clipping. Following the same notations as before, we describe

DP-SGD with per-sample gradient clipping [9] as follows,

Gk←X

(xi,yi)∈Sk

CP∇f(wk, xi, yi), c;

wk+1 =wk−η·(Gk+ ∆k+1).

(6)

Here,

CP(·, c)

represents a clipping function of threshold

CP(∇f(w, x, y), c) = ∇f(w, x, y)·min{1,c

k∇f(w, x, y)k}.

With clipping, the

norm of each per-sample gradient is

bounded by

. Thus, the clipping threshold

virtually plays

the role of the Lipschitz constant

in clipped SGD for

privacy analysis.

3. ModelMix

In this section, we formally introduce ModelMix, and explain

how it sharpens the utility-privacy trade-off in DP-SGD.

3.1. Intuition

We begin with the following observation. Suppose we

run SGD twice on a least square regression

F(w) =

1/n ·Pn

i=1 khw, xii − yik2

for

iterations and obtain

two training trajectories

w= (w1, w2, ..., wT)

and

w0=

(w0

1, w0

2, ..., w0

. Suppose both

and

are

-close to

the optimum w∗= arg minwF(w), i.e.,

kwT−w∗k ≤ σand kw0

T−w∗k ≤ σ.

Due to the linearity of gradients in least square regression,

if we mix

and

to get

w00

k=αwk+ (1 −α)w0

(

1,2, ..., T

) for some weight

α∈(0,1)

, then we produce a

new SGD trajectory w00 where w00

Tis also σ-close to w∗.

This simple example gives us two inspirations. First, as

mentioned earlier, with different randomness in initialization

and subsampling, the training trajectory to ﬁnd an optimum

is not unique. Even in DP-SGD where we need to virtually

publish the trajectory for analytical purpose, we are not

restricted to expose a particular one. Second, and more im-

portant, this means that we have more freedom to randomize

the SGD process. In the

-th iteration, instead of following

the regular SGD rule in Equation (4) where we simply start

from the previously updated state

wk−1

, we can randomly

mix

wk−1

with some other

k−1

from another reasonable

training trajectory, and then move to a new trajectory to

proceed.

However, the above idea to randomly mix the trajectories

cannot be directly implemented as we do not have any prior

knowledge on what good trajectories look like. Any training

trajectory generated by the private dataset is sensitive and

potentially creates privacy leakage. Thus, we need to be

careful about how we generate the needed envelope. Recall

that we use envelope to describe the space spanned by the

mixtures of training trajectories. Provided the property that

DP is immune to post-processing, we consider approximating

the trajectory envelope using intermediate states already

published. This allows us to privately construct the envelope

and advance the optimization, simultaneously.

3.2. Algorithm and Observations

We start with a straw-man solution where we virtually run

DP-SGD to alternately train two models in turns. We initialize

two states

˜w1

and

˜w2

with respect to (w.r.t) the parameterized

function we aim to optimize.

At any odd iteration, i.e., iteration

2k+ 1

(

k≥0

), we

randomly generate

α2k+1 ∈(0,1)d

whose coordinate is i.i.d.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiallyPrivateDeepLearningwithModelMixHanshenXiao,JunWan,andSrinivasDevadasMITfhsxiao,junwan,devadasg@mit.eduAbstractTraininglargeneuralnetworkswithmeaning-ful/usabledifferentialprivacysecurityguaranteesisademand-ingchallenge.Inthispaper,wetacklethisproblembyrevisitingthetwokeyoperationsinDi...

展开>> 收起<<

Differentially Private Deep Learning with ModelMix Hanshen Xiao Jun Wan and Srinivas Devadas MIT.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentially Private Deep Learning with ModelMix Hanshen Xiao Jun Wan and Srinivas Devadas MIT

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: