TAN Without a Burn Scaling Laws of DP-SGD

2025-05-02 1 0 1.57MB 13 页 10玖币

侵权投诉

TAN Without a Burn: Scaling Laws of DP-SGD

Tom Sander 1 2 Pierre Stock 2Alexandre Sablayrolles 2

Abstract

Differentially Private methods for training Deep

Neural Networks (DNNs) have progressed re-

cently, in particular with the use of massive

batches and aggregated data augmentations for a

large number of training steps. These techniques

require much more computing resources than their

non-private counterparts, shifting the traditional

privacy-accuracy trade-off to a privacy-accuracy-

compute trade-off and making hyper-parameter

search virtually impossible for realistic scenar-

ios. In this work, we decouple privacy analysis

and experimental behavior of noisy training to

explore the trade-off with minimal computational

requirements. We ﬁrst use the tools of R

enyi

Differential Privacy (RDP) to highlight that the

privacy budget, when not overcharged, only de-

pends on the total amount of noise (TAN) injected

throughout training. We then derive scaling laws

for training models with DP-SGD to optimize

hyper-parameters with more than a

100×

reduc-

tion in computational budget. We apply the pro-

posed method on CIFAR-10 and ImageNet and, in

particular, strongly improve the state-of-the-art on

ImageNet with a

points gain in top-1 accuracy

for a privacy budget ε= 8.

1. Introduction

Deep neural networks (DNNs) have become a fundamental

tool of modern artiﬁcial intelligence, producing cutting-edge

performance in many domains such as computer vision (He

et al.,2016), natural language processing (Devlin et al.,

2018) or speech recognition (Amodei et al.,2016). The

performance of these models generally increases with their

training data size (Brown et al.,2020;Rae et al.,2021;

Ramesh et al.,2022), which encourages the inclusion of

more data in the model’s training set. This phenomenon

CMAP,

Ecole polytechnique, Palaiseau, France

Meta

AI, Paris, France. Correspondence to: Tom Sander

tom-

sander@meta.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

x128 FLOPS

Low Compute

HP search

ε= 8

Figure 1.

Training from scratch with DP-SGD on ImageNet. All

points are obtained at constant number of steps

S= 72k

and

constant ratio

σ/B

, with

σref = 2.5

and

Bref = 16384

. The

dashed lines are computed using a linear regression on the crosses,

and the dots and stars illustrate the predictive power of TAN. We

perform low compute hyper-parameter (HP) search at batch size

128

and extrapolate our best setup for a single run at large batch

size: stars show our reproduction of the previous SOTA from (De

et al.,2022) and improved performance obtained under the privacy

budget

ε= 8

with a

points gain in top-1 accuracy. The shaded

blue areas denote 2 standard deviations over three runs.

also introduces a potential privacy risk for data that gets in-

corporated. Indeed, AI models not only learn about general

statistics or trends of their training data distribution (such

as grammar for language models), but also remember ver-

batim information about individual points (e.g., credit card

numbers), which compromises their privacy (Carlini et al.,

2019;2021). Access to a trained model thus potentially

leaks information about its training data.

The gold standard of disclosure control for individual in-

formation is Differential Privacy (DP) (Dwork et al.,2006).

Informally, DP ensures that the training does not produce

very different models if a sample is added or removed from

the dataset. Motivated by applications in deep learning,

DP-SGD (Abadi et al.,2016) is an adaptation of Stochastic

Gradient Descent (SGD) that clips individual gradients and

adds Gaussian noise to their sum. Its DP guarantees depend

on the privacy parameters: the sampling rate

q=B/N

(where

is the batch size and

is the number of training

samples), the number of gradient steps

, and the noise

σ2

arXiv:2210.03403v2 [cs.LG] 24 May 2023

TAN Without a Burn: Scaling Laws of DP-SGD

Training neural networks with DP-SGD has seen progress

recently, due to several factors. The ﬁrst is the use of

pre-trained models, with DP ﬁnetuning on downstream

tasks (Li et al.,2021;De et al.,2022). This circumvents

the traditional limitations of DP, because the model learns

meaningful features from public data and can adapt

to downstream data with minimal information. In the

remainder of this paper, we only consider models trained

from scratch, as we focus on obtaining information through

the DP channel. Another emerging trend among DP

practitioners is to use massive batch sizes at a large number

of steps to achieve a better tradeoff between privacy and

utility: Anil et al. (2021) have successfully pre-trained

BERT with DP-SGD using batch sizes of

million. This

paradigm makes training models computationally intensive

and hyper-parameter (HP) search effectively impractical

for realistic datasets and architectures.

In this context, we look at DP-SGD through the lens of

the Total Amount of Noise (TAN) injected during training,

and use it to decouple two aspects: privacy accounting and

inﬂuence of noisy updates on the training dynamics. We

ﬁrst observe a heuristic rule: when typically

σ > 2

, the

privacy budget

only depends on the total amount of noise.

Using the tools of RDP accounting, we approximate

by a

simple closed-form expression. We then analyze the scaling

laws of DNNs at constant TAN and show that performance

at very large batch sizes (computationally intensive) is pre-

dictable from performance at small batch sizes as illustrated

in Figure 1. Our contributions are the following:

•

We take a heuristic view of privacy accounting by in-

troducing the Total Amount of Noise (TAN) and show

that in a regime when the budget

is not overcharged,

it only depends on TAN;

•

We use this result in practice and derive scaling laws

that showcase the predictive power of TAN to reduce

the computational cost of hyper-parameter tuning with

DP-SGD, saving a factor of

128

in compute on Ima-

geNet experiments (Figure 1). We then use TAN to ﬁnd

optimal privacy parameters, leading to a gain of

points under ε= 8 compared to the previous SOTA;

•

We leverage TAN to quantify the impact of the dataset

size on the privacy/utility trade-off and show that with

well chosen privacy parameters, doubling dataset size

halves εwhile providing better performance.

2. Background and Related Work

In this section, we review traditional deﬁnitions of DP, in-

cluding R

enyi Differential Privacy. We consider a random-

ized mechanism

that takes as input a dataset

of size

Nand outputs a machine learning model θ∼ M(D).

Deﬁnition 2.1 (Approximate Differential Privacy).A ran-

domized mechanism

satisﬁes

(ε, δ)

-DP (Dwork et al.,

2006) if, for any pair of datasets

and

D′

that differ by

one sample and for all subset R⊂Im(M),

P(M(D)∈R)≤P(M(D′)∈R) exp(ε) + δ. (1)

DP-SGD (Abadi et al.,2016) is the most popular DP algo-

rithm to train DNNs. It selects samples uniformly at random

with probability

q=B/N

(with

the batch size and

the number of training samples), clips per-sample gradients

to a norm

(

clipC

), aggregates them and adds (gaussian)

noise. With

the parameters of the DNN and

ℓi(θ)

the loss

evaluated at sample (xi, yi), it uses noisy gradient

g:= 1

i∈B

clipC(∇θℓi(θ)) + N0,C2σ2

B2(2)

to train the model. The traditional privacy analysis of DP-

SGD is obtained through R´

enyi Differential Privacy.

Deﬁnition 2.2 (R

enyi Divergence).For two probability

distributions

and

deﬁned over

, the R

enyi divergence

of order α > 1of Pgiven Qis:

Dα(P∥Q) := 1

α−1log Ex∼QP(x)

Q(x)α

Deﬁnition 2.3 (R

enyi DP).A randomized mechanism

M:D → R

satisﬁes

(α, dα)

-R

enyi differential privacy

(RDP) if, for any D, D ∈ D′that differ by one sample:

Dα(M(D)∥ M(D′)) ≤dα.

RDP is a convenient notion to track privacy because compo-

sition is additive: a sequence of two algorithms satisfying

(α, dα)

and

(α, d′

α)

RDP satisﬁes

(α, dα+d′

α)

RDP. In

particular,

steps of a

(α, dα)

RDP mechanism satisﬁy

(α, Sdα)

RDP. Mironov et al. (2019) show that each step of

DP-SGD satisﬁes (α, gα(σ, q))-RDP with

gα(σ, q) := Dα((1−q)N(0, σ2)+qN(1, σ2)∥ N(0, σ2)).

Finally, a mechanism satisfying

(α, dα)

-RDP also satisﬁes

(ε, δ)

-DP (Mironov,2017) for

ε=dα+log(1/δ)

α−1

. Perform-

ing Ssteps of DP-SGD satisﬁes (εRDP , δ)-DP with

εRDP := min

αSgα(σ, q) + log(1/δ)

α−1.(3)

RDP is the traditional tool used to analyse DP-SGD, but

other accounting tools have been proposed to obtain tighter

bounds (Gopi et al.,2021). In this work, we use the accoun-

tant due to (Balle et al.,2020), whose output is referred to

as ε, which is slightly smaller than εRDP.

TAN Without a Burn: Scaling Laws of DP-SGD

Figure 2.

Privacy budget

as a function of the noise level

with

constant. On both ﬁgures, each curve corresponds to a different

number of steps

, and each point on the curve is computed at a sampling rate

such that

is constant. On the left, we use

η= 0.13

(resulting in

εTAN = 1

in Equation 4). On the right, we use

η= 0.95

(

εTAN = 8

). We observe a “privacy wall” imposing

σ≥0.5

for

meaningful level of privacy budget ε, and σ≥2for constant ε≈εTAN .

DP variants Concentrated Differential Privacy (CDP)

(Dwork & Rothblum,2016;Bun & Steinke,2016) was

originally proposed as a relaxation of

(ε, δ)

- DP with bet-

ter compositional properties. Truncated CDP (tCDP) (Bun

et al.,2018) is an extension of CDP, with improved prop-

erties of privacy ampliﬁcation via sub-sampling, which is

crucial for DP-SGD-style algorithms. The canonical noise

for tCDP follows a “sinh-normal”’ distribution, with tails ex-

ponentially tighter than a Gaussian. In Sections 3.2 and 3.3,

we highlight the practical implications of the Privacy ampli-

ﬁcation by sub-sampling behavior of DP-SGD. We observe

that in the large noise regime,

εRDP

can be approximated

by a very simple closed form expression of the parameters

(q, S, σ)through TAN, and relate it to CDP and tCDP.

Training from Scratch with DP. Training ML models

with DP-SGD typically incurs a loss of model utility, but

using very large batch sizes improves the privacy/utility

trade-off (Anil et al.,2021;Li et al.,2021). De et al. (2022)

recently introduced Augmentation Multiplicity (AugMult),

which averages the gradients from different augmented ver-

sions of every sample before clipping and leads to improved

performance on CIFAR-10. Computing per-sample gradi-

ents with mega batch sizes for a large number of steps and

AugMult makes DP-SGD much more computationally in-

tensive than non-private training, typically dozens of times.

For instance, reproducing the previous SOTA on ImageNet

of De et al. (2022) under

ε= 8

necessitates a

-day run

using

A100 GPUs, while the non-private SOTA can be

reproduced in a few hours with the same hardware (Goyal

et al.,2017). Yu et al. (2021b) propose to use low-rank

reparametrization of the weight matrices to diminish the

computational cost of accessing per-sample gradients.

Finetuning with DP-SGD. Tramer & Boneh (2020) show

that handcrafted features are very competitive when train-

ing from scratch, but ﬁne-tuning deep models outperforms

them. Li et al. (2021); Yu et al. (2021a) ﬁne-tune language

models to competitive accuracy on several NLP tasks. De

et al. (2022) consider models pre-trained on JFT-300M and

transferred to downstream tasks.

3. The TAN approach

We introduce the notion of Total Amount of Noise (TAN)

and discuss its connections to DP accounting. We then

demonstrate how training with reference privacy parameters

(qref , σref , S)

can be simulated with much lower computa-

tional resources using the same TAN with smaller batches.

Deﬁnition 3.1. Let the individual signal-to-noise ratio

(and its inverse Σ, the Total Amount of Noise or TAN) be:

η2=1

Σ2:= q2S

2σ2.

3.1. Motivation

We begin with a simple case to motivate our deﬁnition of

TAN. We assume a one-dimensional model, where the gra-

dients of all points are clipped to

. Looking at Equation 2,

in one batch of size

, the expected signal from each sample

C/B

with probability

q=B/N

and

otherwise. There-

fore, the expected individual signal of each sample after

steps is

SC/N

, and its squared norm is

S2C2/N2

. The

noise at each step being drawn independently, the variance

across

steps adds to

SC2σ2/B2

. The ratio between the

signal and noise is thus equal to (up to a factor 1/2)

S2C2

2SC2σ2

=q2S

2σ2=η2.

Denoting

ηstep := q/√2σ

, we have

η2=Sη2

step

. The ratio

σ/q

is noted by Li et al. (2021) as the effective noise. The

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TANWithoutaBurn:ScalingLawsofDP-SGDTomSander12PierreStock2AlexandreSablayrolles2AbstractDifferentiallyPrivatemethodsfortrainingDeepNeuralNetworks(DNNs)haveprogressedre-cently,inparticularwiththeuseofmassivebatchesandaggregateddataaugmentationsforalargenumberoftrainingsteps.Thesetechniquesrequiremuch...

展开>> 收起<<

TAN Without a Burn Scaling Laws of DP-SGD.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TAN Without a Burn Scaling Laws of DP-SGD

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: