Differentially Private Deep Learning with ModelMix Hanshen Xiao Jun Wan and Srinivas Devadas MIT

2025-04-27 0 0 1.27MB 21 页 10玖币
侵权投诉
Differentially Private Deep Learning with ModelMix
Hanshen Xiao, Jun Wan, and Srinivas Devadas
MIT
{hsxiao, junwan, devadas}@mit.edu
Abstract—Training large neural networks with meaning-
ful/usable differential privacy security guarantees is a demand-
ing challenge. In this paper, we tackle this problem by revisiting
the two key operations in Differentially Private Stochastic
Gradient Descent (DP-SGD): 1) iterative perturbation and 2)
gradient clipping. We propose a generic optimization frame-
work, called ModelMix, which performs random aggregation of
intermediate model states. It strengthens the composite privacy
analysis utilizing the entropy of the training trajectory and
improves the (, δ)DP security parameters by an order of
magnitude.
We provide rigorous analyses for both the utility guaran-
tees and privacy amplification of ModelMix. In particular, we
present a formal study on the effect of gradient clipping in
DP-SGD, which provides theoretical instruction on how hyper-
parameters should be selected. We also introduce a refined
gradient clipping method, which can further sharpen the pri-
vacy loss in private learning when combined with ModelMix.
Thorough experiments with significant privacy/utility im-
provement are presented to support our theory. We train
a Resnet-20 network on CIFAR10 with 70.4% accuracy via
ModelMix given (= 8, δ = 105)DP-budget, compared to the
same performance but with (= 145.8, δ = 105)using regu-
lar DP-SGD; assisted with additional public low-dimensional
gradient embedding, one can further improve the accuracy to
79.1% with (= 6.1, δ = 105)DP-budget, compared to the
same performance but with (= 111.2, δ = 105)without
ModelMix.
Index Terms—Differential Privacy; R´
enyi Differential Privacy;
Clipped Stochastic Gradient Descent; Deep Learning;
1. Introduction
Privacy concerns when learning with sensitive data are
receiving increasing attention. Many practical attacks have
shown that without proper protection, the model’s parameters
[
1
], [
2
], leakage on gradients during training [
3
], or just
observations on the prediction results [
4
] may enable an
adversary to successfully distinguish and even reconstruct the
private samples used for learning. As an emergent canonical
definition, Differential Privacy (DP) [
5
], [
6
] provides a
semantic privacy metric to quantify how hard it is for an
adversary to infer the participation of an individual in an
aggregate statistic. As one of the most popular approaches,
Differentially-Private Stochastic Gradient Descent (DP-SGD)
[
7
], [
8
] and its variants [
9
], [
10
], [
11
], [
12
], [
13
], [
14
]
have been widely studied over the last decade. DP-SGD
can be applied to almost all optimization problems in
machine learning to produce rigorous DP guarantees without
additional assumptions regarding the objective function or
dataset. However, despite its broad applicability, DP-SGD
also suffers notoriously large utility loss especially when
training cutting-edge deep models. Its practical implemen-
tation is also known to be sensitive to hyper-parameter
selections [
15
], [
16
], [
17
]. Indeed, even in theory, the effects
of the two artificial privatization operations applied in DP-
SGD, iterative gradient perturbation and gradient clipping,
are still not fully-understood.
To understand why these two artificial modifications are
the key to differentially privatize iterative methods, we need
to first introduce the concept of sensitivity, which plays a
key role in DP. The sensitivity captures the maximum impact
an individual sample from an input dataset may have on
an algorithm’s output. It is the foundation of almost all DP
mechanisms, including the Laplace/Gaussian and Exponential
Mechanisms [
18
], where the sensitivity determines how much
randomization is needed to to hide any individual amongst
the population with desired privacy. Unfortunately, in many
practical optimization problems, the end-to-end sensitivity is
intractable or can only be loosely bounded.
To this end, DP-SGD proposes an alternative solution by
assuming a more powerful adversary. In most private (central-
ized) learning applications, the standard black-box adversary
can only observe the final model revealed. DP-SGD, on
the other hand, assumes an adversary who can observe the
intermediate updates during training. For convenience, we
will call such an adversary a white-box adversary. Provided
such an empowered adversary, DP-SGD clips the gradient
evaluated by each individual sample and adds random noises
to the updates in each iteration. Clipping guarantees that the
sensitivity is bounded within each iteration. The total privacy
loss is then upper bounded by a composition of the leakage
from all iterations.
For convex optimization with Lipschitz continuity, where
the norms of gradients are uniformly bounded by some given
constant, DP-SGD is known to produce an asymptotically
tight privacy-utility tradeoff [
8
]. However, it is, in general,
impractical to assume Lipschitz continuity in tasks such as
arXiv:2210.03843v1 [cs.LG] 7 Oct 2022
deep learning. Either asymptotically or non-asymptotically,
the study of practical implementations of DP-SGD with more
realistic and specific assumptions remains very active [
13
],
[
15
], [
17
], [
19
] and demanding. Much research effort has
been dedicated to tackling the following two fundamental
questions. First, provided that we do not need to publish the
intermediate computation results, how conservative is the
privacy claim offered by DP-SGD? Second, during practical
implementation, how to properly select the training model
and hyper-parameters?
Regarding the first question, many prior works [
4
], [
20
],
[
21
] tried to empirically simulate what the adversary can
infer from models trained by DP-SGD. In particular, [
20
] ex-
amined the respective power of “black-box” and “white-box”
adversaries, and suggested that a substantial gap between
the DP-SGD privacy bound and the actual privacy guarantee
may exist. Unfortunately, beyond DP-SGD, there are few
known ways to produce, let along improve, rigorous DP
analysis for a training process. Most existing analyses need
to assume either access to additional public data [
13
], [
22
],
[
23
], or strongly-convex loss functions to enable objective
perturbation [
24
]. Thus, for general applications, we still
have to adopt the conservative DP-SGD analysis for the
worst-case DP guarantees.
The second question is of particular interest to prac-
titioners. The implementation of DP-SGD is tricky as the
performance of DP-SGD is highly sensitive to the selection of
the training model and hyper-parameters. The lack of theoret-
ical analysis on gradient clipping makes it hard to find good
parameters and optimize model architectures instructively,
though many heuristic observations and optimizations on
these choices are reported. [
16
], [
19
], [
25
] showed how to
find proper model architectures to balance learning capability
and utility loss when the dataset is given.
1
Recent work [
27
]
also demonstrated empirical improvements through selecting
the clipping threshold adaptively. However, even with these
efforts, there is still a long way to go if we want to practically
train large neural networks with rigorous and usable privacy
guarantees. The biggest bottlenecks include
·
the huge model dimension, which may be even larger
than the size of the training dataset, while the magnitude
of noise required for gradient perturbation is proportional
to the square root of model dimension, and
·
the long convergence time, which implies a massive
composition of privacy loss and also forces DP-SGD
to add formidable noise resulting in intolerable accuracy
loss.
To this end, Tramer and Boneh in [
19
] argued that within
the current framework, for most medium datasets (<500K
datapoints) such as CIFAR10/100 and MNIST, the utility
loss caused by DP-SGD will offset the powerful learning
1.
Most prior works report the best model and parameter selection by
grid searching, where the private data is reused multiple times and the
selection of parameters itself is actually sensitive. This additional privacy
leakage, partially determined by the prior knowledge on the training data is
in general very hard to quantify, though in practice it might be small [
19
],
[26].
capacity offered by deep models. Therefore, simple linear
models usually outperform the modern Convolutional Neural
Network (CNN) for these datasets. How to privately train
a model while still being able to enjoy the state-of-the-
art success in modern machine learning is one of the key
problems in the application of DP.
1.1. Our Strategy and Results
In this paper, we set out to provide a systematic study of
DP-SGD from both theoretical and empirical perspectives
to understand the two important but artificial operations:
(1) iterative gradient perturbation and (2) gradient clipping.
We propose a generic technique, ModelMix, to significantly
sharpen the utility-privacy tradeoff. Our theoretical analysis
also provides instruction on how to select the clipping
parameter and quantify privacy amplification from other
practical randomness.
We will stick to the worst-case DP guarantee without any
relaxation, but view the private iterative optimization process
from a different angle. In most practical deep learning tasks,
with proper use of randomness, we will have a good chance of
finding some reasonable (local) minimum via SGD regardless
of the initialization (starting point) and the subsampling [
28
].
In particular, for convex optimization, we are guaranteed
to approach the global optimum with a proper step size.
In other words, there are an infinite number of potential
training trajectories
2
pointing to some (local) minimum of
good generalization, and we are free to use any one of them
to find a good model. Thus, even without DP perturbation,
the training trajectory has potential entropy if we are allowed
to do random selection.
From this standpoint, a slow convergence rate when
training a large model is not always bad news for privacy.
This might seem counter-intuitive. But, in general, slow
convergence means that the intermediate updates wander
around a relatively large domain for a longer time before
entering a satisfactory neighborhood of (global/local) opti-
mum. Training a larger model may produce a more fuzzy and
complicated convergence process, which could compensate
the larger privacy loss composition caused in DP-SGD. We
have to stress that our ultimate goal is to privately publish a
good model, while DP-SGD with exposed updates is merely
a tool to find a trajectory with analyzable privacy leakage.
The above observation inspires a way to find a better DP
guarantee even under the conservative “whitebox” adversary
model: can we utilize the potential entropy of the training
trajectory while still bounding the sensitivity to produce
rigorous DP guarantees?
To be specific, different from standard DP-SGD which
randomizes a particular trajectory with noise, we aim to pri-
vately construct an envelope of training trajectories, spanned
by the many trajectories converging to some (global/local)
minimum, and randomly generate one trajectory to amplify
privacy. To achieve this, we must carefully consider the
tradeoff between (1) controlling the worst-case sensitivity
2.
We will use training trajectory in the following to represent the
sequences of intermediate updates produced by SGD.
in the trajectory generalization and (2) the learning bias
resultant from this approach. We summarize our contributions
as follows.
(a)
We present a generic optimization framework, called
ModelMix, which iteratively builds an envelope of
training trajectories through post-processing historical
updates, and randomly aggregates those model states
before applying gradient descent. We provide rigorous
convergence and privacy analysis for ModelMix, which
enables us to quantify
(, δ)
-DP budget of our protocol.
The refined privacy analysis framework proposed can
also be used to capture the privacy amplification of a
large class of training-purpose-oriented operations com-
monly used in deep learning. This class of operations
include data augmentation [
29
] and stochastic gradient
Langevin dynamics (SGLD) [
30
], which cannot produce
reasonable worst-case DP guarantees by themselves.
(b)
We study the influence of gradient clipping in private
optimization and present the first generic convergence
rate analysis of clipped DP-SGD in deep learning. To
our best knowledge, this is the first analysis of non-
convex optimization via clipped DP-SGD with only
mild assumptions on the concentration of stochastic
gradient. We show that the key factor in clipped DP-
SGD is the sampling noise
3
of the stochastic gradient.
We then demonstrate why implementation of DP-SGD
by clipping individual sample gradients can be unstable
and sensitive to the selection of hyper-parameters. Those
analyses can be used to instruct how to select hyper-
parameters and improve network architecture in deep
learning with DP-SGD.
(c)
ModelMix is a fundamental improvement to DP-SGD,
which can be applied to almost all applications together
with other advances in DP-SGD, such as low-rank or
low-dimensional gradient embedding [
31
], [
32
] and
fine-tuning based transfer learning [
19
], [
33
] (if ad-
ditional public data is provided). In our experiments,
we focus on computer vision tasks, a canonical domain
for private deep learning. We evaluate our methods
on CIFAR-10, FMNIST and SVHN datasets using
various neural network models and compare with the
state-of-the-art results. Our approach improves the pri-
vacy/utility tradeoff significantly. For example, provided
a privacy budget
(= 8, δ = 105)
, we are able
to train Resnet20 on CIFAR10 with accuracy
70.4%
compared to
56.1%
when applying regular DP-SGD.
As for private transfer learning on CIFAR10, we can
improve the
(= 2, δ = 105)
-DP guarantee in [
19
]
to
(= 0.64, δ = 105)
producing the same
92.7%
accuracy.
The remainder of this paper is organized as follows. In
Section 2, we introduce background on statistical learning,
differential privacy and DP-SGD. In Section 3, we formally
present the ModelMix framework, whose utility in both
3.
The noise corresponds to using a minibatch of samples to estimate
the true full-batch gradient.
convex and non-convex optimizations is studied in Theorem
3.1 and Theorem 3.2, respectively. In Section 4, we show
how to efficiently compute the amplified
(, δ)
DP security
parameters in Theorem 4.1 and a non-asymptotic amplifica-
tion analysis is given in Theorem 4.2. Further experiments
with detailed comparisons to the state-of-the-art works are
included in Section 5. Finally, we conclude and discuss future
work in Section 6.
1.2. Related Works
Theoretical (Clipped) DP-SGD Analysis
: When DP-SGD
was first proposed [
7
], and in most theoretical studies
afterwards [
8
], [
10
], the objective loss function is assumed to
be
L
-Lipschitz continuous, where the
L2
norm of the gradient
is uniformly bounded by
L
. This enables a straightforward
privatization on SGD by simply perturbing the gradients. In
particular, for convex optimization on a dataset of
n
samples,
a training loss of
Θ(pdlog(1)/(n))
is known to be tight
under an (, δ)DP guarantee [8].
However, it is hard to get a (tight) Lipschitz bound for
general learning tasks. A practical version of DP-SGD was
then presented in [
9
], where the Lipschitz assumption is
replaced by gradient clipping to ensure bounded sensitivity.
This causes a disparity between the practice and the theory as
classic results [
8
], [
10
] assuming bounded gradients cannot
be directly generalized to clipped DP-SGD. Some existing
works tried to narrow this gap by providing new analysis. [
34
]
presented a convergence analysis of smooth optimization with
clipped SGD when the sampling noise in stochastic gradient
is bounded. But [
34
] requires the clipping threshold
c
to be
Ω(T)
where
T
is the total number of iterations. This could
be a strong requirement, as in practice, the iteration number
T
can be much larger than the constant clipping threshold
c
selected. [
35
] relaxed the requirement with an assumption
that the sampling noise is symmetric. [
17
] studied the special
case where clipped DP-SGD is applied to generalized linear
functions. In this paper, we give the first generic analysis
of clipped DP-SGD with only mild assumptions on the
concentration property of stochastic gradients.
Assistance with Additional Public Data
: When additional
unrestricted (unlabeled) public data is available, an alternative
model-agnostic approach is Private Aggregation of Teacher
Ensembles (PATE) [
22
], [
23
], [
36
]. PATE builds a teacher-
student framework, where private data is first split into
multiple (usually hundreds) disjoint sets, and a teacher model
is trained over each set separately. Then, one can apply those
teacher models to privately label public data via a private
majority voting. Those privately labeled samples are then
used to train a student model, as a postprocessing of labeled
samples. Another line of works considers improving the
noise added in DP-SGD with public data. For example,
in private transfer learning, we can first pretrain a large
model with public data and then apply DP-SGD with private
data to fine-tune a small fraction of the model parameters
[
19
], [
33
]. However, both PATE and private transfer learning
have to assume a large amount of public data. Another
idea considers the projection of the private gradient into
a low-rank/dimensional subspace, approximated by public
samples, to reduce the magnitude of noise [
31
], [
32
], [
37
],
[
38
]. When the public samples are limited, DP-SGD with
low-rank gradient representation usually outperforms the
former methods.
Except for PATE, our methods can be, in general, used to
further enhance those state-of-the-art DP-SGD improvements
with public data. For example, using 2K ImageNet public
samples, the low-rank embedding method in [
32
] can train
Resnet20 on CIFAR10 with
79.1%
accuracy at a cost of
(= 111.2, δ = 105)
-DP; while ModelMix can improve
the DP guarantee to
(= 6.1, δ = 105)
-DP with the same
accuracy as shown in Section 5.3.
2. Preliminaries
Empirical Risk Minimization:
In statistical learning, the
model to be trained is commonly represented by a pa-
rameterized function
f(w, x):(W,X)R
, mapping
feature
x
from input domain
X
into an output (prediction/
classification) domain. In the following, we will always use
d
to represent the dimensionality of the parameter
w
, i.e.,
wRd
. For example, one may consider
f(w, x)
as a neural
network with a sequence of linear layers connected by non-
linear activation layers, and
w
represents the weights to be
trained. Given a set
D
of
n
samples
{(xi, yi), i = 1,2, ..., n}
,
we define the problem of Empirical Risk Minimization
(ERM) for some loss function l(·,·)as follows,
min
wF(w) = min
w
1
n·
n
X
i=1
l(f(w, xi), yi).(1)
For convenience, we simply use
f(w, xi, yi)
to denote the
objective loss function
l(f(w, xi), yi)
in the rest of the paper.
Below, we formally introduce the definitions of Lipschitz
continuity, smoothness and convexity, which are commonly
used in optimization research.
Definition 2.1
(Lipschitz Continuity)
.
A function
g
is
L
-
Lipschitz if for all
w, w0∈ W
,
|g(w)g(w0)| ≤ Lkww0k2
.
Definition 2.2
(Smoothness)
.
A function
g
is
β
-smooth on
W
if for all
w, w0∈ W
,
g(w0)g(w) + h∇g(w), w0wi+
β
2kw0wk2
2.
Definition 2.3
(Convexity)
.
A function
g
is convex on
W
if for all
w, w0∈ W
and
t(0,1)
,
f(tw + (1 t)w0)
tg(w) + (1 t)g(w0).
In the following, we will simply use
k·k
to denote the
l2
norm unless specified otherwise.
Differential Privacy (DP):
We first formally define
(, δ)
-
DP and (α, )-R´
enyi DP as follows.
Definition 2.4
(Differential Privacy)
.
Given a data universe
X
, we say that two datasets
D,D0⊆ X
are neighbors,
denoted as
D ∼ D0
, if
D=D0s
or
D0=D s
for
some additional datapoint
s
. A randomized algorithm
A
is
said to be
(, δ)
-differentially private (DP) if for any pair of
neighboring datasets
D,D0
and any event
S
in the output
space of A, it holds that
P(A(D)S)e·P(A(D0)S) + δ.
Definition 2.5
(R
´
enyi Differential Privacy [
39
])
.
A random-
ized algorithm
A
satisfies
(α, )
-R
´
enyi Differential Privacy
(RDP),
α > 1
, if for any pair of neighboring datasets
D ∼ D0,
Dα(M(D)kM(D0)).
Here,
Dα(PkQ) = 1
α1log Zq(o)( p(o)
q(o))αdo, (2)
represents
α
-R
´
enyi Divergence between two distributions
P
and Qwhose density functions are pand q, respectively.
In Definition 2.4 and 2.5, if two neighboring datasets
D
and
D0
are defined in a form that
D
can be obtained by
arbitrarily replacing an datapoint in
D0
, then they become
the definitions of bounded DP [
5
], [
6
] and RDP, respectivaly.
In this paper, we adopt the unbounded DP version to match
existing DP deep learning works [
9
], [
19
], [
32
] with a fair
comparison.
In practice, to achieve meaningful privacy guarantees,
is usually selected as some small one-digit constant and
δ
is asymptotically
O(1/|D|) = O(1/n)
. To randomize an
algorithm, the most common approaches in DP are Gaussian
or Laplace Mechanisms [
18
], where a Gaussian or Laplace
noise proportional to the sensitivity is added to perturb the
algorithm’s output. In many applications, including the DP-
SGD analysis, we need to quantify the cumulative privacy
loss across sequential queries of some differentially private
mechanism on one dataset. The following theorem provides
an upper bound on the overall privacy leakage.
Theorem 2.1
(Advanced Composition [
40
])
.
For any
 >
0
and
δ(0,1)
, the class of
(, δ)
-differentially private
mechanisms satisfies
, T δ +˜
δ)
-differential privacy under
T-fold adaptive composition for any ˜and ˜
δsuch that
˜=q2Tlog(1/˜
δ)·+T (e1).
Theorem 2.2
(Advanced Composition via RDP [
39
])
.
For
any
α > 1
and
 > 0
, the class of
(α, )
-RDP mechanisms
satisfies
, ˜
δ)
-differential privacy under
T
-fold adaptive
composition for any ˜and ˜
δsuch that
˜=T  log(˜
δ)/(α1).
Theorem 2.1 provides a good characterization on how
the privacy loss increases with composition. For small
(, δ)
, we still have an
˜
O(T , T δ)
DP guarantee after a
T
composition. In practice using RDP, Theorem 2.2 usually
produces tighter constants in the privacy bound.
DP-SGD:
(Stochastic) Gradient Descent ((S)GD) is a very
popular approach to optimize a function. Suppose we try
to solve the ERM problem and minimize some function
F(w) = 1
nPn
i=1 f(w, xi, yi)
. SGD can be described as the
following iterative protocol. In the
(k+ 1)
-th iteration, we
apply Poisson sampling, i.e., each datapoint is i.i.d. sampled
by a constant rate
q
, and a minibatch of
Bk
samples is
produced from the dataset
D
, denoted as
Sk
. We calculate
the stochastic gradient as
GkX
(xi,yi)Sk
f(wk, xi, yi).(3)
Then, a gradient descent update is applied using
wk+1 =wkη·Gk,(4)
for some stepsize
η
. In particular, if the minibatch is selected
to be the full batch, i.e.,
Sk=D
, then Equation (4) becomes
the standard gradient descent procedure.
We make the following assumption regarding the sam-
pling noise
k∇f(w, x, y)− ∇F(w)k
when the minibatch
size equals
1
. This assumption, which will be shown to be
necessary in Example 3.1, will be used in Theorem 3.2 when
we derive the concrete convergence rate for clipped SGD.
Assumption 2.1
(Stochastic Gradient of Sub-exponential
Tail)
.
There exists some constant
κ > 0
such that for any
w, if we randomly select a datapoint (x, y)from D, then
Pr(k∇f(w, x, y)− ∇F(w)k ≥ t)et/κ.
In Assumption 2.1, a larger
κ
implies stronger concentra-
tion, i.e., a faster decaying tail of the stochastic gradient. The
modification from GD/SGD to its corresponding DP version
is straightforward. When the loss function
f
is assumed to
be
L
-Lipschitz [
7
], [
8
], i.e.,
k∇f(w, xi, yi)k ≤ L
for any
w
,
the worst-case sensitivity in Equation (4) is bounded by
ηL
in each iteration. One can derive a tighter bound [
8
] using
existing results on the privacy amplification from sampling
[
41
], [
42
]. Thus, SGD can be made private via iterative
perturbation by replacing Equation (4) with the following:
wk+1 =wkη·(Gk+ ∆k+1),(5)
where
k
is the noise for the
k
-th iteration. For example, if
we want to use the Gaussian Mechanism to ensure
(, δ)
-DP
when running
T
iterations, then
k+1
can be selected to be
i.i.d. generated from
k+1 ← N(0, O(L2Tlog(1)
2)·Id).
Here, Idrepresents the d×didentity matrix.
However, when we do not have the Lipschitz assumption,
an alternative is to force a limited sensitivity through gradient
clipping. Following the same notations as before, we describe
DP-SGD with per-sample gradient clipping [9] as follows,
GkX
(xi,yi)Sk
CPf(wk, xi, yi), c;
wk+1 =wkη·(Gk+ ∆k+1).
(6)
Here,
CP(·, c)
represents a clipping function of threshold
c
,
CP(f(w, x, y), c) = f(w, x, y)·min{1,c
k∇f(w, x, y)k}.
With clipping, the
l2
norm of each per-sample gradient is
bounded by
c
. Thus, the clipping threshold
c
virtually plays
the role of the Lipschitz constant
L
in clipped SGD for
privacy analysis.
3. ModelMix
In this section, we formally introduce ModelMix, and explain
how it sharpens the utility-privacy trade-off in DP-SGD.
3.1. Intuition
We begin with the following observation. Suppose we
run SGD twice on a least square regression
F(w) =
1/n ·Pn
i=1 khw, xii − yik2
for
T
iterations and obtain
two training trajectories
w= (w1, w2, ..., wT)
and
w0=
(w0
1, w0
2, ..., w0
T)
. Suppose both
wT
and
w0
T
are
σ
-close to
the optimum w= arg minwF(w), i.e.,
kwTwk ≤ σand kw0
Twk ≤ σ.
Due to the linearity of gradients in least square regression,
if we mix
w
and
w0
to get
w00
k=αwk+ (1 α)w0
k
(
k=
1,2, ..., T
) for some weight
α(0,1)
, then we produce a
new SGD trajectory w00 where w00
Tis also σ-close to w.
This simple example gives us two inspirations. First, as
mentioned earlier, with different randomness in initialization
and subsampling, the training trajectory to find an optimum
is not unique. Even in DP-SGD where we need to virtually
publish the trajectory for analytical purpose, we are not
restricted to expose a particular one. Second, and more im-
portant, this means that we have more freedom to randomize
the SGD process. In the
k
-th iteration, instead of following
the regular SGD rule in Equation (4) where we simply start
from the previously updated state
wk1
, we can randomly
mix
wk1
with some other
w0
k1
from another reasonable
training trajectory, and then move to a new trajectory to
proceed.
However, the above idea to randomly mix the trajectories
cannot be directly implemented as we do not have any prior
knowledge on what good trajectories look like. Any training
trajectory generated by the private dataset is sensitive and
potentially creates privacy leakage. Thus, we need to be
careful about how we generate the needed envelope. Recall
that we use envelope to describe the space spanned by the
mixtures of training trajectories. Provided the property that
DP is immune to post-processing, we consider approximating
the trajectory envelope using intermediate states already
published. This allows us to privately construct the envelope
and advance the optimization, simultaneously.
3.2. Algorithm and Observations
We start with a straw-man solution where we virtually run
DP-SGD to alternately train two models in turns. We initialize
two states
˜w1
0
and
˜w2
0
with respect to (w.r.t) the parameterized
function we aim to optimize.
At any odd iteration, i.e., iteration
2k+ 1
(
k0
), we
randomly generate
α2k+1 (0,1)d
whose coordinate is i.i.d.
摘要:

DifferentiallyPrivateDeepLearningwithModelMixHanshenXiao,JunWan,andSrinivasDevadasMITfhsxiao,junwan,devadasg@mit.eduAbstract—Traininglargeneuralnetworkswithmeaning-ful/usabledifferentialprivacysecurityguaranteesisademand-ingchallenge.Inthispaper,wetacklethisproblembyrevisitingthetwokeyoperationsinDi...

展开>> 收起<<
Differentially Private Deep Learning with ModelMix Hanshen Xiao Jun Wan and Srinivas Devadas MIT.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.27MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注