TAN Without a Burn Scaling Laws of DP-SGD

2025-05-02 0 0 1.57MB 13 页 10玖币
侵权投诉
TAN Without a Burn: Scaling Laws of DP-SGD
Tom Sander 1 2 Pierre Stock 2Alexandre Sablayrolles 2
Abstract
Differentially Private methods for training Deep
Neural Networks (DNNs) have progressed re-
cently, in particular with the use of massive
batches and aggregated data augmentations for a
large number of training steps. These techniques
require much more computing resources than their
non-private counterparts, shifting the traditional
privacy-accuracy trade-off to a privacy-accuracy-
compute trade-off and making hyper-parameter
search virtually impossible for realistic scenar-
ios. In this work, we decouple privacy analysis
and experimental behavior of noisy training to
explore the trade-off with minimal computational
requirements. We first use the tools of R
´
enyi
Differential Privacy (RDP) to highlight that the
privacy budget, when not overcharged, only de-
pends on the total amount of noise (TAN) injected
throughout training. We then derive scaling laws
for training models with DP-SGD to optimize
hyper-parameters with more than a
100×
reduc-
tion in computational budget. We apply the pro-
posed method on CIFAR-10 and ImageNet and, in
particular, strongly improve the state-of-the-art on
ImageNet with a
+9
points gain in top-1 accuracy
for a privacy budget ε= 8.
1. Introduction
Deep neural networks (DNNs) have become a fundamental
tool of modern artificial intelligence, producing cutting-edge
performance in many domains such as computer vision (He
et al.,2016), natural language processing (Devlin et al.,
2018) or speech recognition (Amodei et al.,2016). The
performance of these models generally increases with their
training data size (Brown et al.,2020;Rae et al.,2021;
Ramesh et al.,2022), which encourages the inclusion of
more data in the model’s training set. This phenomenon
1
CMAP,
´
Ecole polytechnique, Palaiseau, France
2
Meta
AI, Paris, France. Correspondence to: Tom Sander
<
tom-
sander@meta.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
x128 FLOPS
Low Compute
HP search
ε= 8
Figure 1.
Training from scratch with DP-SGD on ImageNet. All
points are obtained at constant number of steps
S= 72k
and
constant ratio
σ/B
, with
σref = 2.5
and
Bref = 16384
. The
dashed lines are computed using a linear regression on the crosses,
and the dots and stars illustrate the predictive power of TAN. We
perform low compute hyper-parameter (HP) search at batch size
128
and extrapolate our best setup for a single run at large batch
size: stars show our reproduction of the previous SOTA from (De
et al.,2022) and improved performance obtained under the privacy
budget
ε= 8
with a
+6
points gain in top-1 accuracy. The shaded
blue areas denote 2 standard deviations over three runs.
also introduces a potential privacy risk for data that gets in-
corporated. Indeed, AI models not only learn about general
statistics or trends of their training data distribution (such
as grammar for language models), but also remember ver-
batim information about individual points (e.g., credit card
numbers), which compromises their privacy (Carlini et al.,
2019;2021). Access to a trained model thus potentially
leaks information about its training data.
The gold standard of disclosure control for individual in-
formation is Differential Privacy (DP) (Dwork et al.,2006).
Informally, DP ensures that the training does not produce
very different models if a sample is added or removed from
the dataset. Motivated by applications in deep learning,
DP-SGD (Abadi et al.,2016) is an adaptation of Stochastic
Gradient Descent (SGD) that clips individual gradients and
adds Gaussian noise to their sum. Its DP guarantees depend
on the privacy parameters: the sampling rate
q=B/N
(where
B
is the batch size and
N
is the number of training
samples), the number of gradient steps
S
, and the noise
σ2
.
1
arXiv:2210.03403v2 [cs.LG] 24 May 2023
TAN Without a Burn: Scaling Laws of DP-SGD
Training neural networks with DP-SGD has seen progress
recently, due to several factors. The first is the use of
pre-trained models, with DP finetuning on downstream
tasks (Li et al.,2021;De et al.,2022). This circumvents
the traditional limitations of DP, because the model learns
meaningful features from public data and can adapt
to downstream data with minimal information. In the
remainder of this paper, we only consider models trained
from scratch, as we focus on obtaining information through
the DP channel. Another emerging trend among DP
practitioners is to use massive batch sizes at a large number
of steps to achieve a better tradeoff between privacy and
utility: Anil et al. (2021) have successfully pre-trained
BERT with DP-SGD using batch sizes of
2
million. This
paradigm makes training models computationally intensive
and hyper-parameter (HP) search effectively impractical
for realistic datasets and architectures.
In this context, we look at DP-SGD through the lens of
the Total Amount of Noise (TAN) injected during training,
and use it to decouple two aspects: privacy accounting and
influence of noisy updates on the training dynamics. We
first observe a heuristic rule: when typically
σ > 2
, the
privacy budget
ε
only depends on the total amount of noise.
Using the tools of RDP accounting, we approximate
ε
by a
simple closed-form expression. We then analyze the scaling
laws of DNNs at constant TAN and show that performance
at very large batch sizes (computationally intensive) is pre-
dictable from performance at small batch sizes as illustrated
in Figure 1. Our contributions are the following:
We take a heuristic view of privacy accounting by in-
troducing the Total Amount of Noise (TAN) and show
that in a regime when the budget
ε
is not overcharged,
it only depends on TAN;
We use this result in practice and derive scaling laws
that showcase the predictive power of TAN to reduce
the computational cost of hyper-parameter tuning with
DP-SGD, saving a factor of
128
in compute on Ima-
geNet experiments (Figure 1). We then use TAN to find
optimal privacy parameters, leading to a gain of
+9
points under ε= 8 compared to the previous SOTA;
We leverage TAN to quantify the impact of the dataset
size on the privacy/utility trade-off and show that with
well chosen privacy parameters, doubling dataset size
halves εwhile providing better performance.
2. Background and Related Work
In this section, we review traditional definitions of DP, in-
cluding R
´
enyi Differential Privacy. We consider a random-
ized mechanism
M
that takes as input a dataset
D
of size
Nand outputs a machine learning model θ∼ M(D).
Definition 2.1 (Approximate Differential Privacy).A ran-
domized mechanism
M
satisfies
(ε, δ)
-DP (Dwork et al.,
2006) if, for any pair of datasets
D
and
D
that differ by
one sample and for all subset RIm(M),
P(M(D)R)P(M(D)R) exp(ε) + δ. (1)
DP-SGD (Abadi et al.,2016) is the most popular DP algo-
rithm to train DNNs. It selects samples uniformly at random
with probability
q=B/N
(with
B
the batch size and
N
the number of training samples), clips per-sample gradients
to a norm
C
(
clipC
), aggregates them and adds (gaussian)
noise. With
θ
the parameters of the DNN and
i(θ)
the loss
evaluated at sample (xi, yi), it uses noisy gradient
g:= 1
BX
iB
clipC(θi(θ)) + N0,C2σ2
B2(2)
to train the model. The traditional privacy analysis of DP-
SGD is obtained through R´
enyi Differential Privacy.
Definition 2.2 (R
´
enyi Divergence).For two probability
distributions
P
and
Q
defined over
R
, the R
´
enyi divergence
of order α > 1of Pgiven Qis:
Dα(PQ) := 1
α1log ExQP(x)
Q(x)α
.
Definition 2.3 (R
´
enyi DP).A randomized mechanism
M:D → R
satisfies
(α, dα)
-R
´
enyi differential privacy
(RDP) if, for any D, D ∈ Dthat differ by one sample:
Dα(M(D)∥ M(D)) dα.
RDP is a convenient notion to track privacy because compo-
sition is additive: a sequence of two algorithms satisfying
(α, dα)
and
(α, d
α)
RDP satisfies
(α, dα+d
α)
RDP. In
particular,
S
steps of a
(α, dα)
RDP mechanism satisfiy
(α, Sdα)
RDP. Mironov et al. (2019) show that each step of
DP-SGD satisfies (α, gα(σ, q))-RDP with
gα(σ, q) := Dα((1q)N(0, σ2)+qN(1, σ2)∥ N(0, σ2)).
Finally, a mechanism satisfying
(α, dα)
-RDP also satisfies
(ε, δ)
-DP (Mironov,2017) for
ε=dα+log(1)
α1
. Perform-
ing Ssteps of DP-SGD satisfies (εRDP , δ)-DP with
εRDP := min
αSgα(σ, q) + log(1)
α1.(3)
RDP is the traditional tool used to analyse DP-SGD, but
other accounting tools have been proposed to obtain tighter
bounds (Gopi et al.,2021). In this work, we use the accoun-
tant due to (Balle et al.,2020), whose output is referred to
as ε, which is slightly smaller than εRDP.
2
TAN Without a Burn: Scaling Laws of DP-SGD
Figure 2.
Privacy budget
ε
as a function of the noise level
σ
with
η
constant. On both figures, each curve corresponds to a different
number of steps
S
, and each point on the curve is computed at a sampling rate
q
such that
η
is constant. On the left, we use
η= 0.13
(resulting in
εTAN = 1
in Equation 4). On the right, we use
η= 0.95
(
εTAN = 8
). We observe a “privacy wall” imposing
σ0.5
for
meaningful level of privacy budget ε, and σ2for constant εεTAN .
DP variants Concentrated Differential Privacy (CDP)
(Dwork & Rothblum,2016;Bun & Steinke,2016) was
originally proposed as a relaxation of
(ε, δ)
- DP with bet-
ter compositional properties. Truncated CDP (tCDP) (Bun
et al.,2018) is an extension of CDP, with improved prop-
erties of privacy amplification via sub-sampling, which is
crucial for DP-SGD-style algorithms. The canonical noise
for tCDP follows a “sinh-normal”’ distribution, with tails ex-
ponentially tighter than a Gaussian. In Sections 3.2 and 3.3,
we highlight the practical implications of the Privacy ampli-
fication by sub-sampling behavior of DP-SGD. We observe
that in the large noise regime,
εRDP
can be approximated
by a very simple closed form expression of the parameters
(q, S, σ)through TAN, and relate it to CDP and tCDP.
Training from Scratch with DP. Training ML models
with DP-SGD typically incurs a loss of model utility, but
using very large batch sizes improves the privacy/utility
trade-off (Anil et al.,2021;Li et al.,2021). De et al. (2022)
recently introduced Augmentation Multiplicity (AugMult),
which averages the gradients from different augmented ver-
sions of every sample before clipping and leads to improved
performance on CIFAR-10. Computing per-sample gradi-
ents with mega batch sizes for a large number of steps and
AugMult makes DP-SGD much more computationally in-
tensive than non-private training, typically dozens of times.
For instance, reproducing the previous SOTA on ImageNet
of De et al. (2022) under
ε= 8
necessitates a
4
-day run
using
32
A100 GPUs, while the non-private SOTA can be
reproduced in a few hours with the same hardware (Goyal
et al.,2017). Yu et al. (2021b) propose to use low-rank
reparametrization of the weight matrices to diminish the
computational cost of accessing per-sample gradients.
Finetuning with DP-SGD. Tramer & Boneh (2020) show
that handcrafted features are very competitive when train-
ing from scratch, but fine-tuning deep models outperforms
them. Li et al. (2021); Yu et al. (2021a) fine-tune language
models to competitive accuracy on several NLP tasks. De
et al. (2022) consider models pre-trained on JFT-300M and
transferred to downstream tasks.
3. The TAN approach
We introduce the notion of Total Amount of Noise (TAN)
and discuss its connections to DP accounting. We then
demonstrate how training with reference privacy parameters
(qref , σref , S)
can be simulated with much lower computa-
tional resources using the same TAN with smaller batches.
Definition 3.1. Let the individual signal-to-noise ratio
η
(and its inverse Σ, the Total Amount of Noise or TAN) be:
η2=1
Σ2:= q2S
2σ2.
3.1. Motivation
We begin with a simple case to motivate our definition of
TAN. We assume a one-dimensional model, where the gra-
dients of all points are clipped to
C
. Looking at Equation 2,
in one batch of size
B
, the expected signal from each sample
is
C/B
with probability
q=B/N
and
0
otherwise. There-
fore, the expected individual signal of each sample after
S
steps is
SC/N
, and its squared norm is
S2C2/N2
. The
noise at each step being drawn independently, the variance
across
S
steps adds to
SC2σ2/B2
. The ratio between the
signal and noise is thus equal to (up to a factor 1/2)
S2C2
N2
2SC2σ2
B2
=q2S
2σ2=η2.
Denoting
ηstep := q/2σ
, we have
η2=Sη2
step
. The ratio
σ/q
is noted by Li et al. (2021) as the effective noise. The
3
摘要:

TANWithoutaBurn:ScalingLawsofDP-SGDTomSander12PierreStock2AlexandreSablayrolles2AbstractDifferentiallyPrivatemethodsfortrainingDeepNeuralNetworks(DNNs)haveprogressedre-cently,inparticularwiththeuseofmassivebatchesandaggregateddataaugmentationsforalargenumberoftrainingsteps.Thesetechniquesrequiremuch...

展开>> 收起<<
TAN Without a Burn Scaling Laws of DP-SGD.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.57MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注