
TAN Without a Burn: Scaling Laws of DP-SGD
Training neural networks with DP-SGD has seen progress
recently, due to several factors. The first is the use of
pre-trained models, with DP finetuning on downstream
tasks (Li et al.,2021;De et al.,2022). This circumvents
the traditional limitations of DP, because the model learns
meaningful features from public data and can adapt
to downstream data with minimal information. In the
remainder of this paper, we only consider models trained
from scratch, as we focus on obtaining information through
the DP channel. Another emerging trend among DP
practitioners is to use massive batch sizes at a large number
of steps to achieve a better tradeoff between privacy and
utility: Anil et al. (2021) have successfully pre-trained
BERT with DP-SGD using batch sizes of
2
million. This
paradigm makes training models computationally intensive
and hyper-parameter (HP) search effectively impractical
for realistic datasets and architectures.
In this context, we look at DP-SGD through the lens of
the Total Amount of Noise (TAN) injected during training,
and use it to decouple two aspects: privacy accounting and
influence of noisy updates on the training dynamics. We
first observe a heuristic rule: when typically
σ > 2
, the
privacy budget
ε
only depends on the total amount of noise.
Using the tools of RDP accounting, we approximate
ε
by a
simple closed-form expression. We then analyze the scaling
laws of DNNs at constant TAN and show that performance
at very large batch sizes (computationally intensive) is pre-
dictable from performance at small batch sizes as illustrated
in Figure 1. Our contributions are the following:
•
We take a heuristic view of privacy accounting by in-
troducing the Total Amount of Noise (TAN) and show
that in a regime when the budget
ε
is not overcharged,
it only depends on TAN;
•
We use this result in practice and derive scaling laws
that showcase the predictive power of TAN to reduce
the computational cost of hyper-parameter tuning with
DP-SGD, saving a factor of
128
in compute on Ima-
geNet experiments (Figure 1). We then use TAN to find
optimal privacy parameters, leading to a gain of
+9
points under ε= 8 compared to the previous SOTA;
•
We leverage TAN to quantify the impact of the dataset
size on the privacy/utility trade-off and show that with
well chosen privacy parameters, doubling dataset size
halves εwhile providing better performance.
2. Background and Related Work
In this section, we review traditional definitions of DP, in-
cluding R
´
enyi Differential Privacy. We consider a random-
ized mechanism
M
that takes as input a dataset
D
of size
Nand outputs a machine learning model θ∼ M(D).
Definition 2.1 (Approximate Differential Privacy).A ran-
domized mechanism
M
satisfies
(ε, δ)
-DP (Dwork et al.,
2006) if, for any pair of datasets
D
and
D′
that differ by
one sample and for all subset R⊂Im(M),
P(M(D)∈R)≤P(M(D′)∈R) exp(ε) + δ. (1)
DP-SGD (Abadi et al.,2016) is the most popular DP algo-
rithm to train DNNs. It selects samples uniformly at random
with probability
q=B/N
(with
B
the batch size and
N
the number of training samples), clips per-sample gradients
to a norm
C
(
clipC
), aggregates them and adds (gaussian)
noise. With
θ
the parameters of the DNN and
ℓi(θ)
the loss
evaluated at sample (xi, yi), it uses noisy gradient
g:= 1
BX
i∈B
clipC(∇θℓi(θ)) + N0,C2σ2
B2(2)
to train the model. The traditional privacy analysis of DP-
SGD is obtained through R´
enyi Differential Privacy.
Definition 2.2 (R
´
enyi Divergence).For two probability
distributions
P
and
Q
defined over
R
, the R
´
enyi divergence
of order α > 1of Pgiven Qis:
Dα(P∥Q) := 1
α−1log Ex∼QP(x)
Q(x)α
.
Definition 2.3 (R
´
enyi DP).A randomized mechanism
M:D → R
satisfies
(α, dα)
-R
´
enyi differential privacy
(RDP) if, for any D, D ∈ D′that differ by one sample:
Dα(M(D)∥ M(D′)) ≤dα.
RDP is a convenient notion to track privacy because compo-
sition is additive: a sequence of two algorithms satisfying
(α, dα)
and
(α, d′
α)
RDP satisfies
(α, dα+d′
α)
RDP. In
particular,
S
steps of a
(α, dα)
RDP mechanism satisfiy
(α, Sdα)
RDP. Mironov et al. (2019) show that each step of
DP-SGD satisfies (α, gα(σ, q))-RDP with
gα(σ, q) := Dα((1−q)N(0, σ2)+qN(1, σ2)∥ N(0, σ2)).
Finally, a mechanism satisfying
(α, dα)
-RDP also satisfies
(ε, δ)
-DP (Mironov,2017) for
ε=dα+log(1/δ)
α−1
. Perform-
ing Ssteps of DP-SGD satisfies (εRDP , δ)-DP with
εRDP := min
αSgα(σ, q) + log(1/δ)
α−1.(3)
RDP is the traditional tool used to analyse DP-SGD, but
other accounting tools have been proposed to obtain tighter
bounds (Gopi et al.,2021). In this work, we use the accoun-
tant due to (Balle et al.,2020), whose output is referred to
as ε, which is slightly smaller than εRDP.
2