Differentially Private Optimization on Large Model at Small Cost_2

2025-04-27 0 0 1.46MB 28 页 10玖币

侵权投诉

Differentially Private Optimization on Large Model at Small Cost

Zhiqi Bu 1Yu-Xiang Wang 2Sheng Zha 1George Karypis 1

Abstract

Differentially private (DP) optimization is the

standard paradigm to learn large neural networks

that are accurate and privacy-preserving. The

computational cost for DP deep learning, however,

is notoriously heavy due to the per-sample gra-

dient clipping. Existing DP implementations are

2∼1000×

more costly in time and space com-

plexity than the standard (non-private) training.

In this work, we develop a novel Book-Keeping

(BK) technique that implements existing DP opti-

mizers (thus achieving the same accuracy), with

a substantial improvement on the computational

cost. Speciﬁcally, BK enables DP training on

large models and high dimensional data to be

roughly as fast and memory-saving as the standard

training, whereas previous DP algorithms can be

inefﬁcient or incapable of training due to mem-

ory error. The computational advantage of BK

is supported by the complexity analysis as well

as extensive experiments on vision and language

tasks. Our implementation achieves state-of-the-

art (SOTA) accuracy with very small extra cost:

on GPT2 and at almost the same memory cost

(

<1%

overhead), BK has 1.03

the time com-

plexity of the standard training (0.83

training

speed in practice), and 0.61

the time complexity

of the most efﬁcient DP implementation (1.36

training speed in practice). We open-source the

codebase for the BK algorithm at FastDP library

https://github.com/awslabs/fas

t-differential-privacy.

1 Introduction

Deep learning with differential privacy (DP; (Dwork et al.,

2006)) has shown strong performance while guaranteeing

rigorous protection against privacy risks, especially on large

Amazon Web Services

University of California, Santa Bar-

bara. Correspondence to: Zhiqi Bu <zhiqibu@amazon.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

models that tend to memorize and leak the training data

(Carlini et al.,2021;Haim et al.,2022;Shokri et al.,2017).

For example, recent advances have shed light on the success

of DP GPT2 (Li et al.,2021;Bu et al.,2022b;Yu et al.,

2021), which achieves

64.6

BLEU score

at strong privacy

guarantee (

ϵ= 3

), on the text generation task using E2E

restaurant review dataset. This is only marginally below the

standard non-private GPT2 (BLEU score 66.8). Similarly,

on computer vision tasks (

ϵ= 2

), DP vision transformers

and ResNets have obtained

97.1%/86.2%

accuracy on CI-

FAR10/100 by (Bu et al.,2022a) and over

81%

accuracy on

ImageNet by (De et al.,2022;Mehta et al.,2022).

However, DP training of large neural networks is well-

known to be computationally burdensome in comparison to

the standard training, in terms of both the training time and

the memory cost. For instance, training a small recurrent

neural network (0.598M parameters) experiences a

1000×

slowdown using DP optimizers in Tensorﬂow-Privacy (TF-

Privacy) library in (Bu et al.,2021a), and training a small

convolutional neural network (CNN, 0.605M parameters)

on CIFAR10 has a

24×

slowdown with Tensorﬂow 2 and the

XLA compiler (Subramani et al.,2021). Even with SOTA

efﬁcient implementations, large models such as RoBERTa

(Liu et al.,2019), GPT2 (Radford et al.,2019), ResNet (He

et al.,2016), VGG (Simonyan & Zisserman,2014), ViT

(Dosovitskiy et al.,2020) and its variants, experience about

2∼3×

slowdown in Pytorch (Li et al.,2021;Bu et al.,

2022a) and

2∼9×

slowdown in JAX (Kurakin et al.,2022;

De et al.,2022), with possibly

4∼20×

memory overhead

(Bu et al.,2022a;Li et al.,2021;Subramani et al.,2021) if

not running out of memory.

The efﬁciency bottleneck in DP deep learning lies in the

per-sample gradient clipping, which restricts the magnitude

of each per-sample gradient in the mini-batch. Applying

the clipping jointly with the Gaussian noise addition, one

can privately release the gradient to arbitrary optimizers

like SGD and Adam, and thus guarantee the privacy of the

BLEU (BiLingual Evaluation Understudy) is a metric (0-100)

for automatically evaluating translated text. BLEU

>60

is consid-

ered as ”very high quality, adequate, and ﬂuent translations, often

better than human”.

arXiv:2210.00038v2 [cs.LG] 19 Sep 2023

Differentially Private Optimization on Large Model at Small Cost

Dataset SOTA setting Model Time

/Epoch

Relative Speed (same memory contraint)

to GhostClip to Opacus to non-DP

QQP (Li et al.,2021) RoBERTa-large (355M) 70’04” 1.36×1.96×0.77×(0.89×)

E2E (Li et al.,2021) GPT2-large (774M) 10’01” 1.36×4.41×0.83×(0.97×)

CIFAR (Bu et al.,2022a) BEiT-large (304M) 6’35” 1.33×38.3×0.76×(0.92×)

Table 1.

Efﬁciency of BK algorithm on DP tasks using one A100 GPU (same accuracy). Note the speed is measured in wall-time

(hardware speed) and in complexity (theoretical speed). More models and tasks can be found in Table 9.

training as described in Section 1.3:

private gradient: ˆ

G:= Xigi·C(∥gi∥2) + σDP · N (0,I),

private optimizer (e.g. SGD): Wt+1 =Wt−ηˆ

G.(1)

Here

is the model parameters,

is the per-sample loss,

gi=∂Li

∂W

is the per-sample gradient,

is the learning rate,

σDP

is the noise magnitude that deﬁnes the privacy loss, and

C(∥gi∥)

or simply

is the per-sample clipping factor. For

example, in (Abadi et al.,2016),

Ci= min{R/∥gi∥,1}

for some clipping threshold

; in (Bu et al.,2021b),

Ci=

I(∥gi∥ ≤ R)

; in (Bu et al.,2022b),

Ci= 1/(∥gi∥+ 0.01)

or 1/∥gi∥as the gradient normalization.

At high level, the DP training is a system effort consisting

of multiple parts:

I. optimizer: DP-SGD, DP-Adam, DP-LAMB;

II. parameter efﬁciency: last layer (linear probing), LoRA,

Adapter, BiTFiT;

III. implementation: Opacus, GhostClip, Book-Keeping;

IV. platform: Pytorch, JAX, TensorFlow.

Previous works have tackled the efﬁciency bottleneck with

various approaches. One approach (part II) focuses on the

parameter efﬁciency by partially training a neural network,

in contrast to fully ﬁne-tuning all model parameters, e.g.

only the last output layer (Tramer & Boneh,2020), the

adapter layers (Houlsby et al.,2019;Mahabadi et al.,2021),

or the Low-Rank Adaptation (LoRA) (Hu et al.,2021;Yu

et al.,2021). For example, (Mehta et al.,2022) acceler-

ate the DP training on ImageNet (Deng et al.,2009) up to

30×

by only training the last layer of ResNet152. Notice-

ably, parameter efﬁcient ﬁne-tuning does not improve on

the efﬁciency in terms of complexity per parameter, rather

than reducing the number of parameters. Furthermore, this

approach oftentimes leads to some accuracy degradation

compared to DP full ﬁne-tuning (Bu et al.,2020;Mehta

et al.,2022;Li et al.,2021;Yu et al.,2021).

An orthogonal approach, including this work, focuses on

the computation efﬁciency (part III), i.e. reducing the time

and space complexity through efﬁcient implementations,

without modifying the DP optimizers (part I) and thus not

affecting their performance. We will elaborate on existing

methods in Section 1.2. Additionally, these methods can be

compiled on different platforms (part IV) such as Tensor-

ﬂow 2(XLA), JAX and Pytorch (Li et al.,2021;Subramani

et al.,2021;De et al.,2022;Kurakin et al.,2022), where re-

markable speed difference has been observed in some cases,

even with the same implementation. For example, (Subra-

mani et al.,2021) implemented DP-SGD using JAX and

claimed its efﬁciency advantage over the same algorithm

using Tensorﬂow or Pytorch.

1.1 Contributions

[Algorithm] We propose the book-keeping (BK) algo-

rithm that makes existing DP optimizers fast and mem-

ory efﬁcient, especially comparable to non-private opti-

mizers. We demonstrate BK via the computation graph

in Figure 1. The highlight is that BK only uses one back-

propagation and never instantiates per-sample gradients

{∂Li

∂W}B

i=1.

[Analysis] We analyze the complexity to show that BK

has almost the same time and space complexity as non-

DP training, especially when the feature dimension is

small (see Table 5).

[Extension] We strengthen BK using a layerwise deci-

sion to mix with Opacus (see Section 3.2), which proves

to be efﬁcient when the feature dimension is large (and

difﬁcult for GhostClip). We also extend BK to the param-

eter efﬁcient ﬁne-tuning such as DP LoRA and Adapter.

4. [Codebase] We develop a Pytorch (Paszke et al.,2019)

codebase for our BK algorithm, leveraging the auto-

differentiation technique on the computation graph and

a new trick in Appendix D.2. We highlight that our

codebase can automatically switch the standard training

of any model to its DP version, by adding a single piece

of codes.

[Experiments] We demonstrate the amazing efﬁciency

of BK on training large models, saving the memory up to

10×

and boosting the speed by

30% ∼5×

than previous

DP implementations.

Differentially Private Optimization on Large Model at Small Cost

Figure 1.

Forward pass and back-propagation of the

-th linear

layer (standard training is in black; DP training by our book-

keeping algorithm is added in red). Here

a(l)

is the activation

tensor,

s(l)

is the layer output,

W(l),b(l)

are weight and bias,

Li,L

are the per-sample loss and the summed loss. The dotted

arrow is the inter-layer operation such as pooling or normalization.

1.2 Related works

Previous arts have developed different implementations of

the same DP optimizer in Equation (1). Among these imple-

mentations, the tradeoff between the time and space com-

plexity has been constantly maneuvered. TF-Privacy (Ten-

sorﬂow) back-propagates a vectorized loss

[L1,· · · ,LB]

to compute the per-sample gradients, each from one back-

propagation, which is memory-efﬁcient but slow. Opacus

(Yousefpour et al.,2021) and (Rochette et al.,2019) acceler-

ate the training signiﬁcantly using the outer product trick in

(Goodfellow et al.,2014), though incurring heavy memory

burden so as to store the per-sample gradients. This memory

burden is partially alleviated in FastGradClip (Lee & Kifer,

2020) by sharing the space complexity in two rounds of

back-propagation, hence almost doubling the time complex-

ity. In ghost clipping (Goodfellow,2015;Li et al.,2021;Bu

et al.,2022a), the per-sample gradients can be clipped with-

out being instantiated, thus both time and space complexity

can be further improved if the feature dimension is small.

We refer interested readers to Figure 3and Appendix Cfor

algorithmic details of these implementations.

We now compare BK to different implementations in Table 2

and Figure 2. In what follows,

is the batch size

T(l)

the feature dimension

d(l), p(l)

are the input or output

dimension of a layer.

1.3 Preliminaries

We work with the

(ϵ, δ)

-DP by (Dwork et al.,2006), deﬁned

in Appendix A, which makes it difﬁcult for any privacy

attacker to distinguish or detect an arbitrary training sample,

even with full access to the model. In deep learning, DP is

achieved by training on the private gradient in Equation (1)

with any optimizer such as SGD, Adam, FedAvg, etc. Es-

sentially, the private gradient is the addition of Gaussian

noise to the sum of clipped per-sample gradients, which

guarantees the DP protection through the privacy account-

ing theorems (Abadi et al.,2016;Mironov,2017;Dong

et al.,2019;Zhu et al.,2021;Gopi et al.,2021;Koskela

et al.,2020).

Book-keeping: Efﬁcient DP training in low

dimension

The main computational bottleneck of DP training comes

from the per-sample gradient clipping, or from the computa-

tion of per-sample gradient norms, to be exact. One widely

used approach in Opacus, TF-privacy, and FastGradClip,

is to instantiate the per-sample gradients and then deriv-

ing their norms. Straight-forward implementation of this

approach on a mini-batch of per-sample losses requires

rounds of back-propagation (unacceptable slowdown) or

B×

gradient storage (unacceptable memory burden; see

Opacus in Figure 2). Consequently, these implementations

are not suitable for large model training. For instance, (Li

et al.,2021) shows that, when training GPT2-large (774M

In this work, we report the physical batch size, which affects

the efﬁciency but not the accuracy; the accuracy is only affected

by the logical batch size, which can be implemented through the

gradient accumulation of physical batch size.

For non-sequential data,

T= 1

; for texts,

is the sequence

length, which is layer-independent; for images (or videos),

T(l)

the height

width(

time) of hidden feature representation, which

is layer-dependent.

Non-DP TF-privacy Opacus FastGradClip GhostClip BK (ours)

Instantiating per-sample grad ✗ ✓ ✓ ✓ ✗ ✗

Storing every layer’s grad ✗ ✗ ✓✗ ✗ ✗

Instantiating non-DP grad ✓ ✓ ✓ ✗ ✓✗

Number of back-propagation 1 B12 2 1

Time Complexity of Clipping 6BT pd 6BT pd 8BT pd 8BT pd 10BT pd +O(BT 2)≈6BT pd

Memory Overhead to non-DP 0 0 Bpd Bpd 2BT 2min{2BT 2, Bpd}

Scalable to large model ✓ ✗ ✗ ✗ ✓ ✓

Scalable to high-dim input ✓ ✗ ✓ ✓ ✗✓

Table 2.

Summary of different DP implementations on a linear/convolution layer

RB×T(l)×d(l)→RB×T(l)×p(l)

. The main bottleneck is

marked in red.

Differentially Private Optimization on Large Model at Small Cost

Figure 2.

Speed and memory on MLP and CIFAR100 (images are ﬂattened into vectors). Left to right: deep network (50 layers, width

1000, 50M parameters, batch size 128), shallow network (10 layers, width 1000, 10M parameters, batch size 128), and wide network (10

layers, width 5000, 250M parameters, batch size 128 or 1024; Opacus is OOM). See more ablation study in Appendix F.

parameters), Opacus (Yousefpour et al.,2021) and JAX

(Subramani et al.,2021) cannot ﬁt even one single sample

into a 24GB GPU.

An alternative approach, termed as the ghost clipping

(GhostClip), directly computes the per-sample gradient

norms without computing the gradients themselves. This is

made possible, unfortunately, through two rounds of back-

propagation. During the ﬁrst back-propagation, one uses

the regular loss

PiLi

and extracts the activation tensor

and the output gradient

(a,∂L

∂s)

. One can use an algebraic

trick in Equation (2) to compute the per-sample gradient

norms

{∥ ∂Li

∂W∥}i

and the clipping factors

{Ci}i

in Equa-

tion (1). During the second back-propagation, one uses

the reweighted loss

PiCiLi

whose gradient is directly the

weighted gradient

PiCigi

, which constitutes the private

gradient we need. Note that this double back-propagation

roughly doubles the training time (or to be more precise,

10/6≈1.667×

when

is small; but this approach loses its

advantage when Tis large), as shown in Table 2).

To make the DP training as efﬁcient as the standard training,

we propose the book-keeping technique (BK) that

⟨1⟩

only

requires a single round of back-propagation, like Opacus

and the standard training;

⟨2⟩

does not instantiate the per-

sample gradients, like GhostClip.

2.1 Book-keeping algorithms

BK algorithms in their base forms are built on GhostClip and

especially the ghost norm trick, so as to avoid instantiating

the memory costly per-sample gradients: as can be seen in

Algorithm 1and Figure 3,

∂Li

∂W=a⊤

∂L

∂si

is not computed

throughout the training. In comparison to GhostClip, our

signiﬁcant improvement is solely on the speed (see Table 2)

through two novel tricks: the book-keeping and the ghost

differentiation. The entire BK algorithm is built on the

understanding of computation graph in Appendix A. Note

that these tricks also offer improved efﬁciency for existing

implementations, to be presented in Section 2.4. We now

elaborate on these tricks.

BK (base) =ghost norm

| {z }

from GhostClip

+book-keeping

| {z }

ours

+ghost differentiation

| {z }

ours

Algorithm 1 Differentially private deep learning with BK

Parameters:

-th layer weights

W(l)

, number of layers

noise level σ, clipping threshold R.

1: for layer l∈1,2,· · · , L do

2: Get activation tensor {a(l),i}by forward hook

3: for layer l∈L, L −1,· · · ,1do

4: Get output gradient {∂L

∂s(l),i }by backward hook

Compute per-example gradient norm

∥∂Li

∂W(l)∥2

ghost norm trick in Equation (2)

Aggregate gradient norm across layers:

∥∂Li

∂W∥2

Pl∥∂Li

∂W(l)∥2

7: Compute clipping factor: Ci=C(∥∂Li

∂W∥F;R)

8: for layer l∈L, L −1,· · · ,1do

Compute sum of clipped gradients

Gl=

a⊤

(l)diag(C1, C2,· · · )∂L

∂s(l)

10: Delete {a(l),i},{∂L

∂s(l),i }

11: Add Gaussian noise ˆ

G=G+σR · N (0,I)

12: Apply SGD/Adam/LAMB with the private gradient ˆ

[Ghost norm trick

←−

memory improvement]The

ghost norm trick (Goodfellow,2015) computes the gradient

norm without the gradient: while the gradient is instantiated

by the multiplication in Equation (2), the gradient norm

can be computed without

meeting

∂L

∂si

. This trick is

applicable to generalized linear layers including the linear,

the embedding (Li et al.,2021), and the convolution layers

(Bu et al.,2022a). We emphasize that these generalized

linear layers represent 99.9% of the trainable parameters in

modern neural networks.

We demonstrate this trick using a simple linear layer

si=aiW

, where

W∈Rd×p

is the weight matrix,

a∈RB×T×d

is the mini-batch input of this layer (a.k.a.

the activation tensor) and

s∈RB×T×p

is the output. Given

that the output gradient

∂L

∂s

is readily available in the back-

propagation, for DP and standard training, one can directly

Differentially Private Optimization on Large Model at Small Cost

Figure 3.

Standard (non-DP), Opacus, FastGradClip, GhostClip, and BK implementations, from left to right. Notice that BK directly

computes clipped gradient like Opacus, computes the ghost norm like GhostClip, and uses auto-differentiation like FastGradClip.

derive the per-sample gradient norm



∂Li

∂W



Frobenius

=vec∂L

∂si

∂L

∂si

⊤·vec aia⊤

i(2)

without actually computing

∂Li

∂W=a⊤

∂L

∂si

. Here ‘vec’

means ﬂattening the

T×T

matrix to a vector. This trick is

particularly efﬁcient when

is small, reducing the space

complexity from O(Bpd)to O(BT 2)by Table 3.

Figure 4.

Backward propagation of BK algorithm. Here

L:=

PiLi,ˆ

L:= PiCiLi.

[Book-keeping trick

←−

speed improvement]This trick

improves the time complexity by removing the second back-

propagation from GhostClip. Our idea is to book-keep

and re-use the output gradient

∂L

∂s(l)

, which is deleted af-

ter the ﬁrst back-propagation of GhostClip and must be

re-computed during the second back-propagation. The dif-

ference between GhostClip and BK is clearly illustrated via

a line-by-line comparison in Appendix C.1. In fact, denoting

the total number of model parameters as

M=Plp(l)d(l)

our trick reduces the time complexity from

10BT M +

O(BT 2)

by GhostClip to

8BT M +O(BT 2)

according

to Table 3. In contrast to Opacus, which book-keeps the per-

sample gradients

g(l)

using

O(BM) = O(BPlp(l)d(l))

memory, we instead book-keep the output gradient with sub-

stantially cheaper

O(BT Plp(l))

memory when the feature

dimension Tis small.

[Ghost differentiation trick

←−

speed improvement]

This trick improves the time complexity on the ﬁrst back-

propagation in GhostClip, further reducing from 8BT M +

O(BT 2)

6BT M +O(BT 2)

in Table 2. Our idea is to

only compute the output gradient

∂L

∂s(l)

but not the (non-

private) parameter gradient

∂L

∂W

. That is, we break the

4BT M

time complexity of the full back-propagation into

two sub-processes, each of

2BT M

complexity, and remove

the unnecessary one.

To be more speciﬁc, during the back-propagation of Opacus

and GhostClip, the output gradient

∂L

∂s

and then the parame-

ter gradient

∂L

∂W=a⊤∂L

∂s

are computed. However, we can

stop after we obtain

∂L

∂s

: we only need the output gradient to

compute the clipped parameter gradient

∂PiCiLi

∂W

in Line

9 of Algorithm 1. Therefore, the ghost differentiation trick

sets all parameters to not require gradients. See the techni-

cal details in Appendix D.2, including the origin parameter

trick that propagates on a computation graph even when no

parameters require gradients.

2.2 Complexity of DP implementations: a modular

analysis

In this section, we analyze the complexity of DP implemen-

tations from their opearation modules. We summarize the

time and space complexity in Table 3and give the derivation

in Appendix B. We will refer to these modules by indices,

e.g. 2a for the computation of output gradient.

Now we are ready to decompose each implementation, fol-

lowing the ﬂowcharts in Figure 3. Consequently, we can

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiallyPrivateOptimizationonLargeModelatSmallCostZhiqiBu1Yu-XiangWang2ShengZha1GeorgeKarypis1AbstractDifferentiallyprivate(DP)optimizationisthestandardparadigmtolearnlargeneuralnetworksthatareaccurateandprivacy-preserving.ThecomputationalcostforDPdeeplearning,however,isnotoriouslyheavyduetoth...

展开>> 收起<<

Differentially Private Optimization on Large Model at Small Cost_2.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentially Private Optimization on Large Model at Small Cost_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: