Differentially Private Optimization on Large Model at Small Cost_2

2025-04-27 0 0 1.46MB 28 页 10玖币
侵权投诉
Differentially Private Optimization on Large Model at Small Cost
Zhiqi Bu 1Yu-Xiang Wang 2Sheng Zha 1George Karypis 1
Abstract
Differentially private (DP) optimization is the
standard paradigm to learn large neural networks
that are accurate and privacy-preserving. The
computational cost for DP deep learning, however,
is notoriously heavy due to the per-sample gra-
dient clipping. Existing DP implementations are
21000×
more costly in time and space com-
plexity than the standard (non-private) training.
In this work, we develop a novel Book-Keeping
(BK) technique that implements existing DP opti-
mizers (thus achieving the same accuracy), with
a substantial improvement on the computational
cost. Specifically, BK enables DP training on
large models and high dimensional data to be
roughly as fast and memory-saving as the standard
training, whereas previous DP algorithms can be
inefficient or incapable of training due to mem-
ory error. The computational advantage of BK
is supported by the complexity analysis as well
as extensive experiments on vision and language
tasks. Our implementation achieves state-of-the-
art (SOTA) accuracy with very small extra cost:
on GPT2 and at almost the same memory cost
(
<1%
overhead), BK has 1.03
×
the time com-
plexity of the standard training (0.83
×
training
speed in practice), and 0.61
×
the time complexity
of the most efficient DP implementation (1.36
×
training speed in practice). We open-source the
codebase for the BK algorithm at FastDP library
https://github.com/awslabs/fas
t-differential-privacy.
1 Introduction
Deep learning with differential privacy (DP; (Dwork et al.,
2006)) has shown strong performance while guaranteeing
rigorous protection against privacy risks, especially on large
1
Amazon Web Services
2
University of California, Santa Bar-
bara. Correspondence to: Zhiqi Bu <zhiqibu@amazon.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
models that tend to memorize and leak the training data
(Carlini et al.,2021;Haim et al.,2022;Shokri et al.,2017).
For example, recent advances have shed light on the success
of DP GPT2 (Li et al.,2021;Bu et al.,2022b;Yu et al.,
2021), which achieves
64.6
BLEU score
1
at strong privacy
guarantee (
ϵ= 3
), on the text generation task using E2E
restaurant review dataset. This is only marginally below the
standard non-private GPT2 (BLEU score 66.8). Similarly,
on computer vision tasks (
ϵ= 2
), DP vision transformers
and ResNets have obtained
97.1%/86.2%
accuracy on CI-
FAR10/100 by (Bu et al.,2022a) and over
81%
accuracy on
ImageNet by (De et al.,2022;Mehta et al.,2022).
However, DP training of large neural networks is well-
known to be computationally burdensome in comparison to
the standard training, in terms of both the training time and
the memory cost. For instance, training a small recurrent
neural network (0.598M parameters) experiences a
1000×
slowdown using DP optimizers in Tensorflow-Privacy (TF-
Privacy) library in (Bu et al.,2021a), and training a small
convolutional neural network (CNN, 0.605M parameters)
on CIFAR10 has a
24×
slowdown with Tensorflow 2 and the
XLA compiler (Subramani et al.,2021). Even with SOTA
efficient implementations, large models such as RoBERTa
(Liu et al.,2019), GPT2 (Radford et al.,2019), ResNet (He
et al.,2016), VGG (Simonyan & Zisserman,2014), ViT
(Dosovitskiy et al.,2020) and its variants, experience about
23×
slowdown in Pytorch (Li et al.,2021;Bu et al.,
2022a) and
29×
slowdown in JAX (Kurakin et al.,2022;
De et al.,2022), with possibly
420×
memory overhead
(Bu et al.,2022a;Li et al.,2021;Subramani et al.,2021) if
not running out of memory.
The efficiency bottleneck in DP deep learning lies in the
per-sample gradient clipping, which restricts the magnitude
of each per-sample gradient in the mini-batch. Applying
the clipping jointly with the Gaussian noise addition, one
can privately release the gradient to arbitrary optimizers
like SGD and Adam, and thus guarantee the privacy of the
1
BLEU (BiLingual Evaluation Understudy) is a metric (0-100)
for automatically evaluating translated text. BLEU
>60
is consid-
ered as ”very high quality, adequate, and fluent translations, often
better than human”.
1
arXiv:2210.00038v2 [cs.LG] 19 Sep 2023
Differentially Private Optimization on Large Model at Small Cost
Dataset SOTA setting Model Time
/Epoch
Relative Speed (same memory contraint)
to GhostClip to Opacus to non-DP
QQP (Li et al.,2021) RoBERTa-large (355M) 70’04” 1.36×1.96×0.77×(0.89×)
E2E (Li et al.,2021) GPT2-large (774M) 10’01” 1.36×4.41×0.83×(0.97×)
CIFAR (Bu et al.,2022a) BEiT-large (304M) 6’35” 1.33×38.3×0.76×(0.92×)
Table 1.
Efficiency of BK algorithm on DP tasks using one A100 GPU (same accuracy). Note the speed is measured in wall-time
(hardware speed) and in complexity (theoretical speed). More models and tasks can be found in Table 9.
training as described in Section 1.3:
private gradient: ˆ
G:= Xigi·C(gi2) + σDP · N (0,I),
private optimizer (e.g. SGD): Wt+1 =Wtηˆ
G.(1)
Here
W
is the model parameters,
Li
is the per-sample loss,
gi=Li
W
is the per-sample gradient,
η
is the learning rate,
σDP
is the noise magnitude that defines the privacy loss, and
C(gi)
or simply
Ci
is the per-sample clipping factor. For
example, in (Abadi et al.,2016),
Ci= min{R/gi,1}
for some clipping threshold
R
; in (Bu et al.,2021b),
Ci=
I(gi∥ ≤ R)
; in (Bu et al.,2022b),
Ci= 1/(gi+ 0.01)
or 1/gias the gradient normalization.
At high level, the DP training is a system effort consisting
of multiple parts:
I. optimizer: DP-SGD, DP-Adam, DP-LAMB;
II. parameter efficiency: last layer (linear probing), LoRA,
Adapter, BiTFiT;
III. implementation: Opacus, GhostClip, Book-Keeping;
IV. platform: Pytorch, JAX, TensorFlow.
Previous works have tackled the efficiency bottleneck with
various approaches. One approach (part II) focuses on the
parameter efficiency by partially training a neural network,
in contrast to fully fine-tuning all model parameters, e.g.
only the last output layer (Tramer & Boneh,2020), the
adapter layers (Houlsby et al.,2019;Mahabadi et al.,2021),
or the Low-Rank Adaptation (LoRA) (Hu et al.,2021;Yu
et al.,2021). For example, (Mehta et al.,2022) acceler-
ate the DP training on ImageNet (Deng et al.,2009) up to
30×
by only training the last layer of ResNet152. Notice-
ably, parameter efficient fine-tuning does not improve on
the efficiency in terms of complexity per parameter, rather
than reducing the number of parameters. Furthermore, this
approach oftentimes leads to some accuracy degradation
compared to DP full fine-tuning (Bu et al.,2020;Mehta
et al.,2022;Li et al.,2021;Yu et al.,2021).
An orthogonal approach, including this work, focuses on
the computation efficiency (part III), i.e. reducing the time
and space complexity through efficient implementations,
without modifying the DP optimizers (part I) and thus not
affecting their performance. We will elaborate on existing
methods in Section 1.2. Additionally, these methods can be
compiled on different platforms (part IV) such as Tensor-
flow 2(XLA), JAX and Pytorch (Li et al.,2021;Subramani
et al.,2021;De et al.,2022;Kurakin et al.,2022), where re-
markable speed difference has been observed in some cases,
even with the same implementation. For example, (Subra-
mani et al.,2021) implemented DP-SGD using JAX and
claimed its efficiency advantage over the same algorithm
using Tensorflow or Pytorch.
1.1 Contributions
1.
[Algorithm] We propose the book-keeping (BK) algo-
rithm that makes existing DP optimizers fast and mem-
ory efficient, especially comparable to non-private opti-
mizers. We demonstrate BK via the computation graph
in Figure 1. The highlight is that BK only uses one back-
propagation and never instantiates per-sample gradients
{Li
W}B
i=1.
2.
[Analysis] We analyze the complexity to show that BK
has almost the same time and space complexity as non-
DP training, especially when the feature dimension is
small (see Table 5).
3.
[Extension] We strengthen BK using a layerwise deci-
sion to mix with Opacus (see Section 3.2), which proves
to be efficient when the feature dimension is large (and
difficult for GhostClip). We also extend BK to the param-
eter efficient fine-tuning such as DP LoRA and Adapter.
4. [Codebase] We develop a Pytorch (Paszke et al.,2019)
codebase for our BK algorithm, leveraging the auto-
differentiation technique on the computation graph and
a new trick in Appendix D.2. We highlight that our
codebase can automatically switch the standard training
of any model to its DP version, by adding a single piece
of codes.
5.
[Experiments] We demonstrate the amazing efficiency
of BK on training large models, saving the memory up to
10×
and boosting the speed by
30% 5×
than previous
DP implementations.
2
Differentially Private Optimization on Large Model at Small Cost
Figure 1.
Forward pass and back-propagation of the
l
-th linear
layer (standard training is in black; DP training by our book-
keeping algorithm is added in red). Here
a(l)
is the activation
tensor,
s(l)
is the layer output,
W(l),b(l)
are weight and bias,
Li,L
are the per-sample loss and the summed loss. The dotted
arrow is the inter-layer operation such as pooling or normalization.
1.2 Related works
Previous arts have developed different implementations of
the same DP optimizer in Equation (1). Among these imple-
mentations, the tradeoff between the time and space com-
plexity has been constantly maneuvered. TF-Privacy (Ten-
sorflow) back-propagates a vectorized loss
[L1,· · · ,LB]
to compute the per-sample gradients, each from one back-
propagation, which is memory-efficient but slow. Opacus
(Yousefpour et al.,2021) and (Rochette et al.,2019) acceler-
ate the training significantly using the outer product trick in
(Goodfellow et al.,2014), though incurring heavy memory
burden so as to store the per-sample gradients. This memory
burden is partially alleviated in FastGradClip (Lee & Kifer,
2020) by sharing the space complexity in two rounds of
back-propagation, hence almost doubling the time complex-
ity. In ghost clipping (Goodfellow,2015;Li et al.,2021;Bu
et al.,2022a), the per-sample gradients can be clipped with-
out being instantiated, thus both time and space complexity
can be further improved if the feature dimension is small.
We refer interested readers to Figure 3and Appendix Cfor
algorithmic details of these implementations.
We now compare BK to different implementations in Table 2
and Figure 2. In what follows,
B
is the batch size
2
,
T(l)
is
the feature dimension
3
,
d(l), p(l)
are the input or output
dimension of a layer.
1.3 Preliminaries
We work with the
(ϵ, δ)
-DP by (Dwork et al.,2006), defined
in Appendix A, which makes it difficult for any privacy
attacker to distinguish or detect an arbitrary training sample,
even with full access to the model. In deep learning, DP is
achieved by training on the private gradient in Equation (1)
with any optimizer such as SGD, Adam, FedAvg, etc. Es-
sentially, the private gradient is the addition of Gaussian
noise to the sum of clipped per-sample gradients, which
guarantees the DP protection through the privacy account-
ing theorems (Abadi et al.,2016;Mironov,2017;Dong
et al.,2019;Zhu et al.,2021;Gopi et al.,2021;Koskela
et al.,2020).
2
Book-keeping: Efficient DP training in low
dimension
The main computational bottleneck of DP training comes
from the per-sample gradient clipping, or from the computa-
tion of per-sample gradient norms, to be exact. One widely
used approach in Opacus, TF-privacy, and FastGradClip,
is to instantiate the per-sample gradients and then deriv-
ing their norms. Straight-forward implementation of this
approach on a mini-batch of per-sample losses requires
B
rounds of back-propagation (unacceptable slowdown) or
B×
gradient storage (unacceptable memory burden; see
Opacus in Figure 2). Consequently, these implementations
are not suitable for large model training. For instance, (Li
et al.,2021) shows that, when training GPT2-large (774M
2
In this work, we report the physical batch size, which affects
the efficiency but not the accuracy; the accuracy is only affected
by the logical batch size, which can be implemented through the
gradient accumulation of physical batch size.
3
For non-sequential data,
T= 1
; for texts,
T
is the sequence
length, which is layer-independent; for images (or videos),
T(l)
is
the height
×
width(
×
time) of hidden feature representation, which
is layer-dependent.
Non-DP TF-privacy Opacus FastGradClip GhostClip BK (ours)
Instantiating per-sample grad ✓ ✓
Storing every layer’s grad ✗ ✗
Instantiating non-DP grad ✓ ✓
Number of back-propagation 1 B12 2 1
Time Complexity of Clipping 6BT pd 6BT pd 8BT pd 8BT pd 10BT pd +O(BT 2)6BT pd
Memory Overhead to non-DP 0 0 Bpd Bpd 2BT 2min{2BT 2, Bpd}
Scalable to large model ✗ ✗
Scalable to high-dim input ✗ ✓
Table 2.
Summary of different DP implementations on a linear/convolution layer
RB×T(l)×d(l)RB×T(l)×p(l)
. The main bottleneck is
marked in red.
3
Differentially Private Optimization on Large Model at Small Cost
Figure 2.
Speed and memory on MLP and CIFAR100 (images are flattened into vectors). Left to right: deep network (50 layers, width
1000, 50M parameters, batch size 128), shallow network (10 layers, width 1000, 10M parameters, batch size 128), and wide network (10
layers, width 5000, 250M parameters, batch size 128 or 1024; Opacus is OOM). See more ablation study in Appendix F.
parameters), Opacus (Yousefpour et al.,2021) and JAX
(Subramani et al.,2021) cannot fit even one single sample
into a 24GB GPU.
An alternative approach, termed as the ghost clipping
(GhostClip), directly computes the per-sample gradient
norms without computing the gradients themselves. This is
made possible, unfortunately, through two rounds of back-
propagation. During the first back-propagation, one uses
the regular loss
PiLi
and extracts the activation tensor
and the output gradient
(a,L
s)
. One can use an algebraic
trick in Equation (2) to compute the per-sample gradient
norms
{∥ Li
W∥}i
and the clipping factors
{Ci}i
in Equa-
tion (1). During the second back-propagation, one uses
the reweighted loss
PiCiLi
whose gradient is directly the
weighted gradient
PiCigi
, which constitutes the private
gradient we need. Note that this double back-propagation
roughly doubles the training time (or to be more precise,
10/61.667×
when
T
is small; but this approach loses its
advantage when Tis large), as shown in Table 2).
To make the DP training as efficient as the standard training,
we propose the book-keeping technique (BK) that
1
only
requires a single round of back-propagation, like Opacus
and the standard training;
2
does not instantiate the per-
sample gradients, like GhostClip.
2.1 Book-keeping algorithms
BK algorithms in their base forms are built on GhostClip and
especially the ghost norm trick, so as to avoid instantiating
the memory costly per-sample gradients: as can be seen in
Algorithm 1and Figure 3,
Li
W=a
i
L
si
is not computed
throughout the training. In comparison to GhostClip, our
significant improvement is solely on the speed (see Table 2)
through two novel tricks: the book-keeping and the ghost
differentiation. The entire BK algorithm is built on the
understanding of computation graph in Appendix A. Note
that these tricks also offer improved efficiency for existing
implementations, to be presented in Section 2.4. We now
elaborate on these tricks.
BK (base) =ghost norm
| {z }
from GhostClip
+book-keeping
| {z }
ours
+ghost differentiation
| {z }
ours
Algorithm 1 Differentially private deep learning with BK
Parameters:
l
-th layer weights
W(l)
, number of layers
L
,
noise level σ, clipping threshold R.
1: for layer l1,2,· · · , L do
2: Get activation tensor {a(l),i}by forward hook
3: for layer lL, L 1,· · · ,1do
4: Get output gradient {L
s(l),i }by backward hook
5:
Compute per-example gradient norm
Li
W(l)2
F
by
ghost norm trick in Equation (2)
6:
Aggregate gradient norm across layers:
Li
W2
F=
PlLi
W(l)2
F
7: Compute clipping factor: Ci=C(Li
WF;R)
8: for layer lL, L 1,· · · ,1do
9:
Compute sum of clipped gradients
Gl=
a
(l)diag(C1, C2,· · · )L
s(l)
10: Delete {a(l),i},{L
s(l),i }
11: Add Gaussian noise ˆ
G=G+σR · N (0,I)
12: Apply SGD/Adam/LAMB with the private gradient ˆ
G
[Ghost norm trick
memory improvement]The
ghost norm trick (Goodfellow,2015) computes the gradient
norm without the gradient: while the gradient is instantiated
by the multiplication in Equation (2), the gradient norm
can be computed without
ai
meeting
L
si
. This trick is
applicable to generalized linear layers including the linear,
the embedding (Li et al.,2021), and the convolution layers
(Bu et al.,2022a). We emphasize that these generalized
linear layers represent 99.9% of the trainable parameters in
modern neural networks.
We demonstrate this trick using a simple linear layer
si=aiW
, where
WRd×p
is the weight matrix,
aRB×T×d
is the mini-batch input of this layer (a.k.a.
the activation tensor) and
sRB×T×p
is the output. Given
that the output gradient
L
s
is readily available in the back-
propagation, for DP and standard training, one can directly
4
Differentially Private Optimization on Large Model at Small Cost
Figure 3.
Standard (non-DP), Opacus, FastGradClip, GhostClip, and BK implementations, from left to right. Notice that BK directly
computes clipped gradient like Opacus, computes the ghost norm like GhostClip, and uses auto-differentiation like FastGradClip.
derive the per-sample gradient norm
Li
W
2
Frobenius
=vecL
si
L
si
·vec aia
i(2)
without actually computing
Li
W=a
i
L
si
. Here ‘vec’
means flattening the
T×T
matrix to a vector. This trick is
particularly efficient when
T
is small, reducing the space
complexity from O(Bpd)to O(BT 2)by Table 3.
Figure 4.
Backward propagation of BK algorithm. Here
L:=
PiLi,ˆ
L:= PiCiLi.
[Book-keeping trick
speed improvement]This trick
improves the time complexity by removing the second back-
propagation from GhostClip. Our idea is to book-keep
and re-use the output gradient
L
s(l)
, which is deleted af-
ter the first back-propagation of GhostClip and must be
re-computed during the second back-propagation. The dif-
ference between GhostClip and BK is clearly illustrated via
a line-by-line comparison in Appendix C.1. In fact, denoting
the total number of model parameters as
M=Plp(l)d(l)
,
our trick reduces the time complexity from
10BT M +
O(BT 2)
by GhostClip to
8BT M +O(BT 2)
according
to Table 3. In contrast to Opacus, which book-keeps the per-
sample gradients
g(l)
i
using
O(BM) = O(BPlp(l)d(l))
memory, we instead book-keep the output gradient with sub-
stantially cheaper
O(BT Plp(l))
memory when the feature
dimension Tis small.
[Ghost differentiation trick
speed improvement]
This trick improves the time complexity on the first back-
propagation in GhostClip, further reducing from 8BT M +
O(BT 2)
to
6BT M +O(BT 2)
in Table 2. Our idea is to
only compute the output gradient
L
s(l)
but not the (non-
private) parameter gradient
L
W
. That is, we break the
4BT M
time complexity of the full back-propagation into
two sub-processes, each of
2BT M
complexity, and remove
the unnecessary one.
To be more specific, during the back-propagation of Opacus
and GhostClip, the output gradient
L
s
and then the parame-
ter gradient
L
W=aL
s
are computed. However, we can
stop after we obtain
L
s
: we only need the output gradient to
compute the clipped parameter gradient
PiCiLi
W
in Line
9 of Algorithm 1. Therefore, the ghost differentiation trick
sets all parameters to not require gradients. See the techni-
cal details in Appendix D.2, including the origin parameter
trick that propagates on a computation graph even when no
parameters require gradients.
2.2 Complexity of DP implementations: a modular
analysis
In this section, we analyze the complexity of DP implemen-
tations from their opearation modules. We summarize the
time and space complexity in Table 3and give the derivation
in Appendix B. We will refer to these modules by indices,
e.g. 2a for the computation of output gradient.
Now we are ready to decompose each implementation, fol-
lowing the flowcharts in Figure 3. Consequently, we can
5
摘要:

DifferentiallyPrivateOptimizationonLargeModelatSmallCostZhiqiBu1Yu-XiangWang2ShengZha1GeorgeKarypis1AbstractDifferentiallyprivate(DP)optimizationisthestandardparadigmtolearnlargeneuralnetworksthatareaccurateandprivacy-preserving.ThecomputationalcostforDPdeeplearning,however,isnotoriouslyheavyduetoth...

展开>> 收起<<
Differentially Private Optimization on Large Model at Small Cost_2.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.46MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注