Differentially Private Bias-Term Fine-tuning of Foundation Models_2

2025-04-27 0 0 840.07KB 22 页 10玖币
侵权投诉
Differentially Private Bias-Term Fine-tuning of Foundation Models
Zhiqi Bu 1Yu-xiang Wang 1 2 Sheng Zha 1George Karypis 1
Abstract
We study the problem of differentially private
(DP) fine-tuning of large pre-trained models – a
recent privacy-preserving approach suitable for
solving downstream tasks with sensitive data. Ex-
isting work has demonstrated that high accuracy
is possible under strong privacy constraint, yet
requires significant computational overhead or
modifications to the network architecture. We pro-
pose differentially private bias-term fine-tuning
(DP-BiTFiT), which matches the state-of-the-art
accuracy for DP algorithms and the efficiency
of the standard BiTFiT. DP-BiTFiT is model ag-
nostic (not modifying the network architecture),
parameter efficient (only training about
0.1%
of
the parameters), and computation efficient (al-
most removing the overhead caused by DP, in
both the time and space complexity). On a wide
range of tasks, DP-BiTFiT is
230×
faster
and uses
28×
less memory than DP full fine-
tuning, even faster than the standard full fine-
tuning. This amazing efficiency enables us to con-
duct DP fine-tuning on language and vision tasks
with long-sequence texts and high-resolution im-
ages, which were computationally difficult using
existing methods. We open-source our code at
FastDP (
https://github.com/awslabs/
fast-differential-privacy).
1 Introduction
Fine-tuning large pre-trained neural networks is one of the
most critical technique in deep learning, yielding strong
performance in a variety of domains (Pan & Yang,2009;
Kenton & Toutanova,2019;Goyal et al.,2017). Among
different methods, full fine-tuning is the most prevalent one,
which trains all the model parameters on the downstream
tasks and achieves high accuracy within a small number of
1
Amazon AI
2
University of California, San Diego. Correspon-
dence to: Zhiqi Bu <zhiqibu@amazon.com>.
Proceedings of the
41 st
International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
training epochs. However, full fine-tuning on large mod-
els, from hundreds of millions (He et al.,2016;Chen et al.,
2016) to billions of parameters (Brown et al.,2020), can
be burdensome in terms of the computation and the deploy-
ment, since a full copy of fine-tuned model parameters is
needed for each task.
To alleviate this issue, the parameter efficient fine-tuning
only trains a substantially small portion of the model pa-
rameters, in contrast to the full fine-tuning. At a high level,
the parameter efficient fine-tuning methods can be divided
into two categories.
1
Model-aware methods, meaning a
relatively small number of parameters are introduced into
the neural network architecture and only the new parameters
are optimized. Examples include LoRA (Hu et al.,2021),
Adapter (Houlsby et al.,2019), and Compacter (Mahabadi
et al.,2021).
2
Model-agnostic methods, meaning that
only a subset of existing parameters are trainable. Examples
include training only the output linear layer (linear probing,
(Kornblith et al.,2019)), only the layer normalization layer
(Houlsby et al.,2019) and bias-term fine-tuning (BiTFiT)
(Zaken et al.,2022). We illustrate the differences as follows:
W0,b0
are the pre-trained weights and biases,
ˆ
indicates
trainable parameters, and θis the additional parameters.
f(x;W0,b0)
| {z }
pre-trained model
f(x;ˆ
W,ˆ
b)full fine-tuning
f(x;W0,b0,ˆ
θ)model-aware
f(x;W0,ˆ
b)bias-term only
Empirically, these parameter efficient fine-tuning methods
have achieved high accuracy that is comparable to full fine-
tuning in the standard non-private setting. For instance,
linear probing of ResNet (He et al.,2016) and Vision Trans-
former (ViT, (Dosovitskiy et al.,2020)) achieves 80% accu-
racy on the ImageNet dataset (Sun et al.,2017;Kornblith
et al.,2019); LoRA and BiTFiT of RoBERTa (Liu et al.,
2019) and BERT (Kenton & Toutanova,2019) achieve about
94% on SST2 and on average 85% across the General Lan-
guage Understanding Evaluation (GLUE) datasets (He et al.,
2021;Hu et al.,2021). In addition, parameter efficient meth-
ods are faster than full fine-tuning and save the communica-
tion cost significantly in distributed learning.
Parallel to these developments, the success of deep learning
models relies on the availability of large datasets, which
may contain sensitive information to be protected rigorously.
1
arXiv:2210.00036v3 [cs.LG] 19 Jun 2024
Differentially Private Bias-Term Fine-tuning of Foundation Models
Figure 1: Performance of different fine-tuning methods on MNLI dataset with RoBERTa-large. DP-BiTFiT is one of the
most accurate (below DP LoRA marginally), fastest (only slower than DP Adapter), and memory efficient (outperforming
others substantially by 3×) DP methods.
This privacy issue is well-known for neural networks can be
vulnerable to privacy attacks: membership information can
be leaked from the purchase records via Google and Amazon
online services (Shokri et al.,2017); sensitive texts can
be reconstructed by specifically designed prefix on GPT2
(Carlini et al.,2021) and so can images in CIFAR10 and
MNIST (Haim et al.,2022). To protect against such privacy
risks, the standard technique is differential privacy (DP,
formally stated in Definition 2.1) which randomizes the
standard optimizers via the private gradient in Equation (1).
A recent line of work has extensively studied the DP fine-
tuning in both computer vision and language tasks, often
achieving less than 3% accuracy drop across different set-
tings via full fine-tuning (De et al.,2022;Li et al.,2021;
Bu et al.,2022b;a), linear probing (Mehta et al.,2022),
LoRA, Adapter, or Compacter (Yu et al.,2021a). In fact,
fine-tuning or pre-training from large dataset is considered
necessary in the DP deep learning literature. As a mat-
ter of fact, full fine-tuning DP-GPT2 only achieves 24.2
BLEU score (
ϵ= 8
) on E2E dataset if randomly initialized
(Li et al.,2021), in starking contrast to 63.2 BLEU if pre-
trained; similarly, state-of-the-art (SOTA) DP accuracy on
ImageNet is 48% (
ϵ= 10
) without pre-training (Kurakin
et al.,2022) but 86.7% accuracy if pre-trained (De et al.,
2022). Specifically, parameter efficient DP fine-tuning has
empirically demonstrated strong accuracy (see our Table 3)
with
34×
memory saving and
23×
speedup com-
pared to DP full fine-tuning by Opacus (c.f. Figure 3and Yu
et al.,2021a, Table 3). Although previous works have shed
light on various DP fine-tuning methods, we are the first to
study DP-BiTFiT specifically and to show two distinctive
advantages of it.
Firstly, DP-BiTFiT is model-agnostic and remains its pa-
rameter efficiency around 0.1% across models by Table 1.
While linear probing is also model-agnostic, the parameter
efficiency can be as high as 8% in ResNet50. Other meth-
ods like LoRA, Adapter and Compacter are architecture-
dependent and possibly parameter inefficient, making them
difficult to directly apply on arbitrary neural networks:
LoRA and Adapter may need to train more than 12% on
BART-large (Lewis et al.,2020) to achieve high accuracy
by He et al. (2021, Figure 1& 4).
Secondly, DP-BiTFiT is computationally efficient, almost as
much as the standard BiTFiT and significantly more efficient
than DP full fine-tuning, particularly with large models and
high-dimensional input data. For examples of DP full fine-
tuning, Li et al. (2021) have reported
24×
slowdown on
large language models for four advanced private codebases
and up to
5×
memory overhead, compared to the standard
fine-tuning; even on small networks, 11 codebases across
Tensorflow, JAX, and Pytorch have demonstrated
0.25×
slowdown and
3100×
reduction in maximum batch
size in Subramani et al. (2021). See more discussion in
Section 3.3.
Algorithm 1 DP Bias-Term Fine-Tuning (BiTFiT)
Parameters:
l
-th layer’s bias
bl
, subsampling probability
p
, number of iterations
T
, number of layers
L
, noise scale
σ
, clipping threshold
R
, clipping factor
Ci
(if no clipping
then Ci= 1).
1: for iteration t= 1,··· , T do
2:
Subsample a batch
Bt⊆ {1, . . . , n}
from training
set with probability p
3: for layer lL, L 1,··· ,1do
4: Get output gradient L
sl
5: Compute per-example gradient and its norm:
6: Li
bl=L
sl,i
1=⇒ ∥Li
bl2
F
7: Aggregate grad norms: Li
b2
F=PlLi
bl2
F
8: Compute clipping factor: Ci=C(Li
bF;R)
9:
Compute sum of clipped gradients
G=PiCiLi
b
10: Add Gaussian noise G=G+σR · N(0,I)
11: Descend on bias terms with Gby SGD/Adam/...
Contributions. We develop DP-BiTFiT, a fine-tuning
method that is model-agnostic, accurate, privacy-preserving,
parameter efficient, and computationally efficient.
2
Differentially Private Bias-Term Fine-tuning of Foundation Models
1.
Algorithmically, we propose the Differentially Private
Bias-Term Fine-Tuning (DP-BiTFiT) in Algorithm 1
that is highly accurate under DP constraint, on par with
SOTA in Section 4and even outperforming fully fine-
tuned GPT2-large.
2.
DP-BiTFiT is model-agnostic
1
and only optimizes 0.1%
of the model parameters on BERT, RoBERTa, GPT2,
ViT, ResNet, and so on (see Table 1). Thus DP-BiTFiT is
one of the most parameter efficient fine-tuning methods
among DP LoRA, Adapter, last-layer, etc.
3.
We design a computationally efficient implementation
of DP-BiTFiT, whose time and space complexity is al-
most the same as the standard non-DP BiTFiT, while
being faster than non-DP full fine-tuning and other DP
fine-tuning (see Figure 1). This advantage is analyzed
in Table 2, and demonstrated via the substantial speedup
and memory-saving in Figure 3and Figure 4.
4.
DP-BiTFiT is a unique algorithm in that the compu-
tation overhead is independent of the feature dimen-
sion
T2
(see red texts in Table 2). This is due to the
activation-free forward pass that only happens in the
no-weight training
3
unlike LoRA. In Figure 1, although
DP-BiTFiT optimizes a similar number of parameters
to DP LoRA or Compacter, its memory efficiency is
dominating. Therefore, DP-BiTFiT enjoys a special
advantage on long-sequence texts and high-resolution
images (see Figure 3).
Novelty. At a glance, our results may appear to be incremen-
tal as we are merely adding differential privacy to an existing
method (BiTFiT) through a standard mechanism (DP-SGD).
This is not true! Computationally, our implementation of
DP-BiTFiT is distinct and orthogonal to existing DP al-
gorithms such as GhostClip (Li et al.,2021)
4
, in that DP-
BiTFiT exploits the special structures in the forward and
backward passes (see the simplicity of computation graph in
Figure 2), hence removing the computational and memory
overhead in DP-SGD (see the independence of
T
in Table 2),
which is unavoidable in other methods.
1
In Section 4, DP-BiTFiT is applicable to all model architec-
tures tested, unlike LoRA (mostly only applies to transformers)
and last-layer training (mostly only works on vision models).
2
The computation overhead to get the per-sample weight gra-
dient norm is linear (by instantiating per-sample gradints) or
quadratic in
T
(if using the ghost norm trick (Goodfellow,2015;
Li et al.,2021)), for DP full and any other PEFT.
3
We distinguish the weight training and bias training in Sec-
tion 2using the chain rules. Note that activation-free means
memory-saving, which is not leveraged by DP full, LoRA, Adapter,
Compacter, etc.
4
Ghost clipping (GhostClip) is an algebraic technique that only
works on weight gradients because it manipulates the activation
tensors at
O(BT 2)
cost. This is too expensive for high-dimension
features, hence not applicable to the bias gradients.
Our main contributions also include
The complexity analysis of DP parameter-efficient fine-
tuning (PEFT) in Table 2and Table 7. This was a miss-
ing piece in previous DP and non-DP PEFT literature
(including the BiTFiT paper) and significantly helpful
in determining the benefit of applying different PEFT
methods. Specifically, we leverage the complexity anal-
ysis to rigorously show that the complexity saving of
DP-BiTFiT is 50% compared to the full fine-tuning,
and to reveal the unique benefit of DP-BiTFiT on high-
dimension data.
The engineering effort: at the time of writing this paper,
none of existing codebases including GhostClip and
Opacus remove the forward hooks, because no analysis
has established that only BiTFiT can be activation-free,
not LoRA/Adapter/Compactor or full fine-tuning. Our
algorithm enables DP-BiTFiT by one line of code5.
2 Preliminaries
Fine-tuning methods. Fine-tuning, i.e. training a model on
a large dataset for a sufficiently long time, and then continu-
ing to train (or transferring) onto the downstream datasets,
is the standard paradigm to achieve high accuracy in both
the standard and the DP regimes. In DP deep learning, the
pre-training takes place on a public dataset using regular
optimizers like SGD, and the fine-tuning takes place on a
private dataset which requires privacy protection, using DP
optimizers like DP-SGD in Section 2.
In a long line of research, various fine-tuning methods have
been proposed. One of the most popular method is the
full fine-tuning, which simply runs gradient descents on all
trainable weights and biases, thus can be inefficient when
the model is large. To improve the efficiency, Li & Liang
(2021) proposes the prefix tuning that only optimizes the
prompts or the input layer activation (Lester et al.,2021;
Liu et al.,2021). However, as pointed out in Hu et al. (2021)
and Li et al. (2021), the prefix tuning can be difficult to
optimize and thus sub-optimal on large models. Another
approach is to reduce the number of trainable parameters.
For example, LoRA (Hu et al.,2021), Adapter (Houlsby
et al.,2019;Rebuffi et al.,2017;Pfeiffer et al.,2021;R
¨
uckl
´
e
et al.,2021;Lin et al.,2020) and Compacter (Mahabadi
et al.,2021) insert small ‘adapter’ layers (usually 1-10%
of total parameters) between existing layers, and only the
newly added adapters are optimized. We describe the forms
and complexity of LoRA and Adapter in Appendix C.
In addition to the aforementioned methods, BiTFiT is a
5
In Pytorch, DP-BiTFiT can be enabled within our codebase
by
[param.requires grad (0) for name,param in
model.named parameters() if ’bias’ in name].
3
Differentially Private Bias-Term Fine-tuning of Foundation Models
special parameter-efficient method that rivals the full fine-
tuning (Zaken et al.,2022;Cai et al.,2020;He et al.,2021).
Firstly, BiTFiT optimizes a subset of original parameters
– the bias terms, which usually constitute less than 1/1000
of all parameters as demonstrated in Table 1. Therefore,
BiTFiT can be readily deployed to any network in a model-
agnostic manner. Secondly, BiTFiT is fundamentally differ-
ent to other parameter efficient methods such as LoRA, since
the bias gradients are computed differently than the weight
gradients on the computation graph. We will elaborate on
this in Equation (3).
Deep learning with differential privacy. We recall the
classic
(ϵ, δ)
-DP, under which we train deep neural networks
with provably privacy guarantees.
Definition 2.1 ((Dwork et al.,2006)).A randomized algo-
rithm
M
is
(ε, δ)
-differentially private if, for any two neigh-
boring datasets
S, S
that differ by one datapoint and for any
event E, we have P[M(S)E]eεP[M(S)E] + δ.
In deep learning, DP can be achieved through applying an
off-the-shelf optimizer (SGD or Adam) with a privately
released stochastic gradient in place of the regular
Pigi
.
The private stochastic gradient is computed by first getting
a minibatch Ivia Poisson sampling, then compute
Private gradient: X
i∈I
gi·C(gi;R) + σR · N(0,I)(1)
where
C
is any function
6R+R
subject to
C(x)R/x
,
gi
is the
i
-th per-sample gradient,
R
is the clipping thresh-
old, and
σ
is the noise multiplier. The private gradient is
guaranteed to be DP through the sampled-Gaussian mecha-
nism and the associated tight privacy accounting to compose
over the iterations (see, e.g., Abadi et al.,2016;Wang et al.,
2019;Mironov et al.,2019;Koskela et al.,2020;Bu et al.,
2020;Gopi et al.,2021, and the references therein.).
Backward propagation. We briefly introduce the back-
propagation, which reveals a simple yet important difference
between the gradients of weights and those of biases. We
consider a linear layer, indexed as the
l
-th layer, with weight
WlRd×p
and bias as
blRp
. We leave the derivation
of other layers such as normalization and convolution in
Appendix A.1. We denote the mini-batched input of this
layer as
alRB×T×d
and the immediate output as
sl
RB×T×p
, where
B
is the batch size and
T
is the feature
dimension
7
:
al+1 =ϕ(sl),sl=alWl+bl
. Here
ϕ
is
6
Examples of gradient clipping include but not limited to
Abadi’s clipping
min(R/gi,1)
(Abadi et al.,2016) and au-
tomatic clipping (AUTO-S)
R/(gi+ 0.01)
(Bu et al.,2022b;
Yang et al.,2022).
7
In sequential data such as text,
T
is the sequence length;
in vision data,
T
is the product of input dimensions (e.g. for
images,
T
is the product of height and width). We refer to a
high-dimensional input when Tis large.
any non-parametric inter-layer operation, e.g. the non-linear
activation (like ReLU), pooling, padding, and so on.
We write
L=Pn
i=1 Li
as the total loss and
Li
as the
per-sample loss of the
i
-th sample. During a standard back-
propagation of
L
layers, the chain rule keeps track of the
output gradient at each layer in a just-in-time fashion:
L
sl
=L
aLaL
sL1·sL1
aL1◦ ··· al+1
sl
=L
sl+1
Wl+1 ϕ(sl).
(2)
Here
is the Hadamard product and
·
is the matrix product.
This output gradient
L
sl
is used to compute per-sample
gradient of weights and biases,
Li
Wl
=X
j
Li
sl,j
sl,j
Wl
=L
sl,i
al,i,
Li
bl
=X
j
Li
sl,j
sl,j
bl
=L
sl,i
1.
(3)
Notably, the weight gradient needs the activation tensor
al
to compute an expensive
O(BT pd)
tensor multiplica-
tion. Memory-wise,
{al}l
across all layers is very costly
to store (taking more than 95% memory across VGG,
ResNet, DenseNet, RoBERTa, etc. by Jain et al. (2020,
Figure 3)). In sharp contrast, the computation of bias gra-
dient does not need
al
, and the multiplication with
1
in
Equation (3) is actually a cheap
O(BT p)
summation on
L
sl:B×T×pB×p.
Forward propagation and the hook. During the for-
ward propagation, all Pytorch-based codebases for DP algo-
rithms such as Private Transformers, Opacus, FastGradClip,
Private-Vision, and others (Yu et al.,2021a;Bu et al.,2023)
register the forward hooks to extract the activation tensors
{al}l
of all layers from the computation graph, where
al
is
computed and stored. Hence, the majority of memory bur-
den is on the activation that grows extremely large for huge
models like GPT3 (Brown et al.,2020) with 175B parame-
ters: the activation tensors consume more than 3600GB of
memory while the parameters and gradients only consume
300GB (Rajbhandari et al.,2020). On one hand, this issue
can be alleviated by the activation recomputation or check-
pointing technique (Chen et al.,2016;Jain et al.,2020),
whose memory cost reduces from
O(L)
to
O(L)
with an
extra 33% slowdown. Alternatively, we note that the activa-
tion tensors are not necessary in the forward propagation, if
we only optimize the bias terms.
3 Differentially private Bias-Term
Fine-Tuning
We propose DP-BiTFiT, to privately train only the bias
terms in a neural network by combining Equation (3) and
4
Differentially Private Bias-Term Fine-tuning of Foundation Models
Figure 2: Back-propagation for DP (red&black) and non-DP (black) algorithms. Note that the bias gradient uses a much
simpler computation graph than the weight gradient, rendering DP-BiTFiT easy-to-implement and efficient-to-compute.
Left: full fine-tuning with GhostClip (ghost clipping; (Goodfellow,2015;Li et al.,2021;Bu et al.,2022a)). Upper right: full
fine-tuning with Opacus (Yousefpour et al.,2021). Lower right: DP-BiTFiT.
Equation (1). We use shaded lines to represent the additional
DP operations in Algorithm 1, and add DP-related variables
and operations in red in the computation graph by Figure 2.
Implementation-wise, DP-BiTFiT is different from all exist-
ing DP algorithms (including full, LoRA, Adapter, etc.) that
optimize weights, since it does not apply a Pytorch forward
hook to store the activation
al
for all layers. We provide
the implementation details of DP-BiTFiT in Appendix B.
To give a concrete example, we apply DP-BiTFiT to the
RoBERTa-large model on QQP dataset, following the same
setting as Li et al. (2021) and using one 40GB A100 GPU.
This is the most time-consuming text classification task in
our work, taking 119 minutes per epoch for a training batch
size 20 using the fastest DP full fine-tuning implementation
– GhostClip (Li et al.,2021). To conduct a simple ablation
study, setting all weights to not require gradients (but for-
ward hooks are still operating) reduces the training time by
50% to to 80 minutes; removing the forward hooks further
reduces the training time by 30% to 63 minutes; finally, us-
ing the maximum batch size allowed by the memory-saving
DP-BiTFiT reduces to 43 minutes.
3.1 Parameter efficiency
DP-BiTFiT enjoys exactly the same parameter efficiency
as the standard BiTFiT, training merely about 0.1% of the
total parameters in large models. We demonstrate that DP-
BiTFiT is one of the most parameter-efficient fine-tuning
through a list of models in Table 1.
Dataset Model # of params % of bias
ImageNet
VGG16 138M 0.009
ResNet18 11.7M 0.043
ResNet50 25.6M 0.113
ViT-small-patch16 21.7M 0.238
ViT-base-patch16 85.8M 0.120
ViT-large-patch16 303M 0.090
E2E
GPT2-small 124M 0.082
GPT2-medium 355M 0.076
GPT2-large 774M 0.066
GLUE RoBERTa-base 125M 0.083
RoBERTa-large 355M 0.077
Table 1: Parameter efficiency of (DP) BiTFiT. Extended
results on more models are in Table 11.
An advantage of this parameter efficiency is reflected in
the computation efficiency, given that most parameters do
not require gradients to be computed: we show in Table 2
and Section 3.3 that DP-BiTFiT is much more efficient than
full fine-tuning (DP and even non-DP). Additionally, the
parameter efficiency also translates to the communication
efficiency in the distributed learning. For example, the 64-
bit communication cost of DP full fine-tuning is
64MD
where
M
is number of worker and
D
is total number of
parameters, which can be reduced 1000×by DP-BiTFiT.
5
摘要:

DifferentiallyPrivateBias-TermFine-tuningofFoundationModelsZhiqiBu1Yu-xiangWang12ShengZha1GeorgeKarypis1AbstractWestudytheproblemofdifferentiallyprivate(DP)fine-tuningoflargepre-trainedmodels–arecentprivacy-preservingapproachsuitableforsolvingdownstreamtaskswithsensitivedata.Ex-istingworkhasdemonstr...

展开>> 收起<<
Differentially Private Bias-Term Fine-tuning of Foundation Models_2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:840.07KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注