deep learning. Either asymptotically or non-asymptotically,
the study of practical implementations of DP-SGD with more
realistic and specific assumptions remains very active [
13
],
[
15
], [
17
], [
19
] and demanding. Much research effort has
been dedicated to tackling the following two fundamental
questions. First, provided that we do not need to publish the
intermediate computation results, how conservative is the
privacy claim offered by DP-SGD? Second, during practical
implementation, how to properly select the training model
and hyper-parameters?
Regarding the first question, many prior works [
4
], [
20
],
[
21
] tried to empirically simulate what the adversary can
infer from models trained by DP-SGD. In particular, [
20
] ex-
amined the respective power of “black-box” and “white-box”
adversaries, and suggested that a substantial gap between
the DP-SGD privacy bound and the actual privacy guarantee
may exist. Unfortunately, beyond DP-SGD, there are few
known ways to produce, let along improve, rigorous DP
analysis for a training process. Most existing analyses need
to assume either access to additional public data [
13
], [
22
],
[
23
], or strongly-convex loss functions to enable objective
perturbation [
24
]. Thus, for general applications, we still
have to adopt the conservative DP-SGD analysis for the
worst-case DP guarantees.
The second question is of particular interest to prac-
titioners. The implementation of DP-SGD is tricky as the
performance of DP-SGD is highly sensitive to the selection of
the training model and hyper-parameters. The lack of theoret-
ical analysis on gradient clipping makes it hard to find good
parameters and optimize model architectures instructively,
though many heuristic observations and optimizations on
these choices are reported. [
16
], [
19
], [
25
] showed how to
find proper model architectures to balance learning capability
and utility loss when the dataset is given.
1
Recent work [
27
]
also demonstrated empirical improvements through selecting
the clipping threshold adaptively. However, even with these
efforts, there is still a long way to go if we want to practically
train large neural networks with rigorous and usable privacy
guarantees. The biggest bottlenecks include
·
the huge model dimension, which may be even larger
than the size of the training dataset, while the magnitude
of noise required for gradient perturbation is proportional
to the square root of model dimension, and
·
the long convergence time, which implies a massive
composition of privacy loss and also forces DP-SGD
to add formidable noise resulting in intolerable accuracy
loss.
To this end, Tramer and Boneh in [
19
] argued that within
the current framework, for most medium datasets (<500K
datapoints) such as CIFAR10/100 and MNIST, the utility
loss caused by DP-SGD will offset the powerful learning
1.
Most prior works report the best model and parameter selection by
grid searching, where the private data is reused multiple times and the
selection of parameters itself is actually sensitive. This additional privacy
leakage, partially determined by the prior knowledge on the training data is
in general very hard to quantify, though in practice it might be small [
19
],
[26].
capacity offered by deep models. Therefore, simple linear
models usually outperform the modern Convolutional Neural
Network (CNN) for these datasets. How to privately train
a model while still being able to enjoy the state-of-the-
art success in modern machine learning is one of the key
problems in the application of DP.
1.1. Our Strategy and Results
In this paper, we set out to provide a systematic study of
DP-SGD from both theoretical and empirical perspectives
to understand the two important but artificial operations:
(1) iterative gradient perturbation and (2) gradient clipping.
We propose a generic technique, ModelMix, to significantly
sharpen the utility-privacy tradeoff. Our theoretical analysis
also provides instruction on how to select the clipping
parameter and quantify privacy amplification from other
practical randomness.
We will stick to the worst-case DP guarantee without any
relaxation, but view the private iterative optimization process
from a different angle. In most practical deep learning tasks,
with proper use of randomness, we will have a good chance of
finding some reasonable (local) minimum via SGD regardless
of the initialization (starting point) and the subsampling [
28
].
In particular, for convex optimization, we are guaranteed
to approach the global optimum with a proper step size.
In other words, there are an infinite number of potential
training trajectories
2
pointing to some (local) minimum of
good generalization, and we are free to use any one of them
to find a good model. Thus, even without DP perturbation,
the training trajectory has potential entropy if we are allowed
to do random selection.
From this standpoint, a slow convergence rate when
training a large model is not always bad news for privacy.
This might seem counter-intuitive. But, in general, slow
convergence means that the intermediate updates wander
around a relatively large domain for a longer time before
entering a satisfactory neighborhood of (global/local) opti-
mum. Training a larger model may produce a more fuzzy and
complicated convergence process, which could compensate
the larger privacy loss composition caused in DP-SGD. We
have to stress that our ultimate goal is to privately publish a
good model, while DP-SGD with exposed updates is merely
a tool to find a trajectory with analyzable privacy leakage.
The above observation inspires a way to find a better DP
guarantee even under the conservative “whitebox” adversary
model: can we utilize the potential entropy of the training
trajectory while still bounding the sensitivity to produce
rigorous DP guarantees?
To be specific, different from standard DP-SGD which
randomizes a particular trajectory with noise, we aim to pri-
vately construct an envelope of training trajectories, spanned
by the many trajectories converging to some (global/local)
minimum, and randomly generate one trajectory to amplify
privacy. To achieve this, we must carefully consider the
tradeoff between (1) controlling the worst-case sensitivity
2.
We will use training trajectory in the following to represent the
sequences of intermediate updates produced by SGD.