Differentially Private Bias-Term Fine-tuning of Foundation Models_2

2025-04-27 1 0 840.07KB 22 页 10玖币

侵权投诉

Differentially Private Bias-Term Fine-tuning of Foundation Models

Zhiqi Bu 1Yu-xiang Wang 1 2 Sheng Zha 1George Karypis 1

Abstract

We study the problem of differentially private

(DP) ﬁne-tuning of large pre-trained models – a

recent privacy-preserving approach suitable for

solving downstream tasks with sensitive data. Ex-

isting work has demonstrated that high accuracy

is possible under strong privacy constraint, yet

requires signiﬁcant computational overhead or

modiﬁcations to the network architecture. We pro-

pose differentially private bias-term ﬁne-tuning

(DP-BiTFiT), which matches the state-of-the-art

accuracy for DP algorithms and the efﬁciency

of the standard BiTFiT. DP-BiTFiT is model ag-

nostic (not modifying the network architecture),

parameter efﬁcient (only training about

0.1%

the parameters), and computation efﬁcient (al-

most removing the overhead caused by DP, in

both the time and space complexity). On a wide

range of tasks, DP-BiTFiT is

2∼30×

faster

and uses

2∼8×

less memory than DP full ﬁne-

tuning, even faster than the standard full ﬁne-

tuning. This amazing efﬁciency enables us to con-

duct DP ﬁne-tuning on language and vision tasks

with long-sequence texts and high-resolution im-

ages, which were computationally difﬁcult using

existing methods. We open-source our code at

FastDP (

https://github.com/awslabs/

fast-differential-privacy).

1 Introduction

Fine-tuning large pre-trained neural networks is one of the

most critical technique in deep learning, yielding strong

performance in a variety of domains (Pan & Yang,2009;

Kenton & Toutanova,2019;Goyal et al.,2017). Among

different methods, full ﬁne-tuning is the most prevalent one,

which trains all the model parameters on the downstream

tasks and achieves high accuracy within a small number of

Amazon AI

University of California, San Diego. Correspon-

dence to: Zhiqi Bu <zhiqibu@amazon.com>.

Proceedings of the

41 st

International Conference on Machine

the author(s).

training epochs. However, full ﬁne-tuning on large mod-

els, from hundreds of millions (He et al.,2016;Chen et al.,

2016) to billions of parameters (Brown et al.,2020), can

be burdensome in terms of the computation and the deploy-

ment, since a full copy of ﬁne-tuned model parameters is

needed for each task.

To alleviate this issue, the parameter efﬁcient ﬁne-tuning

only trains a substantially small portion of the model pa-

rameters, in contrast to the full ﬁne-tuning. At a high level,

the parameter efﬁcient ﬁne-tuning methods can be divided

into two categories.

⟨1⟩

Model-aware methods, meaning a

relatively small number of parameters are introduced into

the neural network architecture and only the new parameters

are optimized. Examples include LoRA (Hu et al.,2021),

Adapter (Houlsby et al.,2019), and Compacter (Mahabadi

et al.,2021).

⟨2⟩

Model-agnostic methods, meaning that

only a subset of existing parameters are trainable. Examples

include training only the output linear layer (linear probing,

(Kornblith et al.,2019)), only the layer normalization layer

(Houlsby et al.,2019) and bias-term ﬁne-tuning (BiTFiT)

(Zaken et al.,2022). We illustrate the differences as follows:

W0,b0

are the pre-trained weights and biases,

indicates

trainable parameters, and θis the additional parameters.

f(x;W0,b0)

| {z }

pre-trained model

→









f(x;ˆ

W,ˆ

b)full ﬁne-tuning

f(x;W0,b0,ˆ

θ)model-aware

f(x;W0,ˆ

b)bias-term only

Empirically, these parameter efﬁcient ﬁne-tuning methods

have achieved high accuracy that is comparable to full ﬁne-

tuning in the standard non-private setting. For instance,

linear probing of ResNet (He et al.,2016) and Vision Trans-

former (ViT, (Dosovitskiy et al.,2020)) achieves 80% accu-

racy on the ImageNet dataset (Sun et al.,2017;Kornblith

et al.,2019); LoRA and BiTFiT of RoBERTa (Liu et al.,

2019) and BERT (Kenton & Toutanova,2019) achieve about

94% on SST2 and on average 85% across the General Lan-

guage Understanding Evaluation (GLUE) datasets (He et al.,

2021;Hu et al.,2021). In addition, parameter efﬁcient meth-

ods are faster than full ﬁne-tuning and save the communica-

tion cost signiﬁcantly in distributed learning.

Parallel to these developments, the success of deep learning

models relies on the availability of large datasets, which

may contain sensitive information to be protected rigorously.

arXiv:2210.00036v3 [cs.LG] 19 Jun 2024

Differentially Private Bias-Term Fine-tuning of Foundation Models

Figure 1: Performance of different ﬁne-tuning methods on MNLI dataset with RoBERTa-large. DP-BiTFiT is one of the

most accurate (below DP LoRA marginally), fastest (only slower than DP Adapter), and memory efﬁcient (outperforming

others substantially by 3×) DP methods.

This privacy issue is well-known for neural networks can be

vulnerable to privacy attacks: membership information can

be leaked from the purchase records via Google and Amazon

online services (Shokri et al.,2017); sensitive texts can

be reconstructed by speciﬁcally designed preﬁx on GPT2

(Carlini et al.,2021) and so can images in CIFAR10 and

MNIST (Haim et al.,2022). To protect against such privacy

risks, the standard technique is differential privacy (DP,

formally stated in Deﬁnition 2.1) which randomizes the

standard optimizers via the private gradient in Equation (1).

A recent line of work has extensively studied the DP ﬁne-

tuning in both computer vision and language tasks, often

achieving less than 3% accuracy drop across different set-

tings via full ﬁne-tuning (De et al.,2022;Li et al.,2021;

Bu et al.,2022b;a), linear probing (Mehta et al.,2022),

LoRA, Adapter, or Compacter (Yu et al.,2021a). In fact,

ﬁne-tuning or pre-training from large dataset is considered

necessary in the DP deep learning literature. As a mat-

ter of fact, full ﬁne-tuning DP-GPT2 only achieves 24.2

BLEU score (

ϵ= 8

) on E2E dataset if randomly initialized

(Li et al.,2021), in starking contrast to 63.2 BLEU if pre-

trained; similarly, state-of-the-art (SOTA) DP accuracy on

ImageNet is 48% (

ϵ= 10

) without pre-training (Kurakin

et al.,2022) but 86.7% accuracy if pre-trained (De et al.,

2022). Speciﬁcally, parameter efﬁcient DP ﬁne-tuning has

empirically demonstrated strong accuracy (see our Table 3)

with

3∼4×

memory saving and

2∼3×

speedup com-

pared to DP full ﬁne-tuning by Opacus (c.f. Figure 3and Yu

et al.,2021a, Table 3). Although previous works have shed

light on various DP ﬁne-tuning methods, we are the ﬁrst to

study DP-BiTFiT speciﬁcally and to show two distinctive

advantages of it.

Firstly, DP-BiTFiT is model-agnostic and remains its pa-

rameter efﬁciency around 0.1% across models by Table 1.

While linear probing is also model-agnostic, the parameter

efﬁciency can be as high as 8% in ResNet50. Other meth-

ods like LoRA, Adapter and Compacter are architecture-

dependent and possibly parameter inefﬁcient, making them

difﬁcult to directly apply on arbitrary neural networks:

LoRA and Adapter may need to train more than 12% on

BART-large (Lewis et al.,2020) to achieve high accuracy

by He et al. (2021, Figure 1& 4).

Secondly, DP-BiTFiT is computationally efﬁcient, almost as

much as the standard BiTFiT and signiﬁcantly more efﬁcient

than DP full ﬁne-tuning, particularly with large models and

high-dimensional input data. For examples of DP full ﬁne-

tuning, Li et al. (2021) have reported

2∼4×

slowdown on

large language models for four advanced private codebases

and up to

5×

memory overhead, compared to the standard

ﬁne-tuning; even on small networks, 11 codebases across

Tensorﬂow, JAX, and Pytorch have demonstrated

0.2∼5×

slowdown and

3∼100×

reduction in maximum batch

size in Subramani et al. (2021). See more discussion in

Section 3.3.

Algorithm 1 DP Bias-Term Fine-Tuning (BiTFiT)

Parameters:

-th layer’s bias

, subsampling probability

, number of iterations

, number of layers

, noise scale

, clipping threshold

, clipping factor

(if no clipping

then Ci= 1).

1: for iteration t= 1,··· , T do

Subsample a batch

Bt⊆ {1, . . . , n}

from training

set with probability p

3: for layer l∈L, L −1,··· ,1do

4: Get output gradient ∂L

∂sl

5: Compute per-example gradient and its norm:

6: ∂Li

∂bl=∂L

∂sl,i

⊤1=⇒ ∥∂Li

∂bl∥2

7: Aggregate grad norms: ∥∂Li

∂b∥2

F=Pl∥∂Li

∂bl∥2

8: Compute clipping factor: Ci=C(∥∂Li

∂b∥F;R)

Compute sum of clipped gradients

G=PiCi∂Li

∂b

10: Add Gaussian noise G=G+σR · N(0,I)

11: Descend on bias terms with Gby SGD/Adam/...

Contributions. We develop DP-BiTFiT, a ﬁne-tuning

method that is model-agnostic, accurate, privacy-preserving,

parameter efﬁcient, and computationally efﬁcient.

Differentially Private Bias-Term Fine-tuning of Foundation Models

Algorithmically, we propose the Differentially Private

Bias-Term Fine-Tuning (DP-BiTFiT) in Algorithm 1

that is highly accurate under DP constraint, on par with

SOTA in Section 4and even outperforming fully ﬁne-

tuned GPT2-large.

DP-BiTFiT is model-agnostic

and only optimizes 0.1%

of the model parameters on BERT, RoBERTa, GPT2,

ViT, ResNet, and so on (see Table 1). Thus DP-BiTFiT is

one of the most parameter efﬁcient ﬁne-tuning methods

among DP LoRA, Adapter, last-layer, etc.

We design a computationally efﬁcient implementation

of DP-BiTFiT, whose time and space complexity is al-

most the same as the standard non-DP BiTFiT, while

being faster than non-DP full ﬁne-tuning and other DP

ﬁne-tuning (see Figure 1). This advantage is analyzed

in Table 2, and demonstrated via the substantial speedup

and memory-saving in Figure 3and Figure 4.

DP-BiTFiT is a unique algorithm in that the compu-

tation overhead is independent of the feature dimen-

sion

(see red texts in Table 2). This is due to the

activation-free forward pass that only happens in the

no-weight training

unlike LoRA. In Figure 1, although

DP-BiTFiT optimizes a similar number of parameters

to DP LoRA or Compacter, its memory efﬁciency is

dominating. Therefore, DP-BiTFiT enjoys a special

advantage on long-sequence texts and high-resolution

images (see Figure 3).

Novelty. At a glance, our results may appear to be incremen-

tal as we are merely adding differential privacy to an existing

method (BiTFiT) through a standard mechanism (DP-SGD).

This is not true! Computationally, our implementation of

DP-BiTFiT is distinct and orthogonal to existing DP al-

gorithms such as GhostClip (Li et al.,2021)

, in that DP-

BiTFiT exploits the special structures in the forward and

backward passes (see the simplicity of computation graph in

Figure 2), hence removing the computational and memory

overhead in DP-SGD (see the independence of

in Table 2),

which is unavoidable in other methods.

In Section 4, DP-BiTFiT is applicable to all model architec-

tures tested, unlike LoRA (mostly only applies to transformers)

and last-layer training (mostly only works on vision models).

The computation overhead to get the per-sample weight gra-

dient norm is linear (by instantiating per-sample gradints) or

quadratic in

(if using the ghost norm trick (Goodfellow,2015;

Li et al.,2021)), for DP full and any other PEFT.

We distinguish the weight training and bias training in Sec-

tion 2using the chain rules. Note that activation-free means

memory-saving, which is not leveraged by DP full, LoRA, Adapter,

Compacter, etc.

Ghost clipping (GhostClip) is an algebraic technique that only

works on weight gradients because it manipulates the activation

tensors at

O(BT 2)

cost. This is too expensive for high-dimension

features, hence not applicable to the bias gradients.

Our main contributions also include

• The complexity analysis of DP parameter-efﬁcient ﬁne-

tuning (PEFT) in Table 2and Table 7. This was a miss-

ing piece in previous DP and non-DP PEFT literature

(including the BiTFiT paper) and signiﬁcantly helpful

in determining the beneﬁt of applying different PEFT

methods. Speciﬁcally, we leverage the complexity anal-

ysis to rigorously show that the complexity saving of

DP-BiTFiT is 50% compared to the full ﬁne-tuning,

and to reveal the unique beneﬁt of DP-BiTFiT on high-

dimension data.

•

The engineering effort: at the time of writing this paper,

none of existing codebases including GhostClip and

Opacus remove the forward hooks, because no analysis

has established that only BiTFiT can be activation-free,

not LoRA/Adapter/Compactor or full ﬁne-tuning. Our

algorithm enables DP-BiTFiT by one line of code5.

2 Preliminaries

Fine-tuning methods. Fine-tuning, i.e. training a model on

a large dataset for a sufﬁciently long time, and then continu-

ing to train (or transferring) onto the downstream datasets,

is the standard paradigm to achieve high accuracy in both

the standard and the DP regimes. In DP deep learning, the

pre-training takes place on a public dataset using regular

optimizers like SGD, and the ﬁne-tuning takes place on a

private dataset which requires privacy protection, using DP

optimizers like DP-SGD in Section 2.

In a long line of research, various ﬁne-tuning methods have

been proposed. One of the most popular method is the

full ﬁne-tuning, which simply runs gradient descents on all

trainable weights and biases, thus can be inefﬁcient when

the model is large. To improve the efﬁciency, Li & Liang

(2021) proposes the preﬁx tuning that only optimizes the

prompts or the input layer activation (Lester et al.,2021;

Liu et al.,2021). However, as pointed out in Hu et al. (2021)

and Li et al. (2021), the preﬁx tuning can be difﬁcult to

optimize and thus sub-optimal on large models. Another

approach is to reduce the number of trainable parameters.

For example, LoRA (Hu et al.,2021), Adapter (Houlsby

et al.,2019;Rebufﬁ et al.,2017;Pfeiffer et al.,2021;R

uckl

et al.,2021;Lin et al.,2020) and Compacter (Mahabadi

et al.,2021) insert small ‘adapter’ layers (usually 1-10%

of total parameters) between existing layers, and only the

newly added adapters are optimized. We describe the forms

and complexity of LoRA and Adapter in Appendix C.

In addition to the aforementioned methods, BiTFiT is a

In Pytorch, DP-BiTFiT can be enabled within our codebase

[param.requires grad (0) for name,param in

model.named parameters() if ’bias’ in name].

Differentially Private Bias-Term Fine-tuning of Foundation Models

special parameter-efﬁcient method that rivals the full ﬁne-

tuning (Zaken et al.,2022;Cai et al.,2020;He et al.,2021).

Firstly, BiTFiT optimizes a subset of original parameters

– the bias terms, which usually constitute less than 1/1000

of all parameters as demonstrated in Table 1. Therefore,

BiTFiT can be readily deployed to any network in a model-

agnostic manner. Secondly, BiTFiT is fundamentally differ-

ent to other parameter efﬁcient methods such as LoRA, since

the bias gradients are computed differently than the weight

gradients on the computation graph. We will elaborate on

this in Equation (3).

Deep learning with differential privacy. We recall the

classic

(ϵ, δ)

-DP, under which we train deep neural networks

with provably privacy guarantees.

Deﬁnition 2.1 ((Dwork et al.,2006)).A randomized algo-

rithm

(ε, δ)

-differentially private if, for any two neigh-

boring datasets

S, S′

that differ by one datapoint and for any

event E, we have P[M(S)∈E]⩽eεP[M(S′)∈E] + δ.

In deep learning, DP can be achieved through applying an

off-the-shelf optimizer (SGD or Adam) with a privately

released stochastic gradient in place of the regular

Pigi

The private stochastic gradient is computed by ﬁrst getting

a minibatch Ivia Poisson sampling, then compute

Private gradient: X

i∈I

gi·C(∥gi∥;R) + σR · N(0,I)(1)

where

is any function

6R+→R

subject to

C(x)≤R/x

is the

-th per-sample gradient,

is the clipping thresh-

old, and

is the noise multiplier. The private gradient is

guaranteed to be DP through the sampled-Gaussian mecha-

nism and the associated tight privacy accounting to compose

over the iterations (see, e.g., Abadi et al.,2016;Wang et al.,

2019;Mironov et al.,2019;Koskela et al.,2020;Bu et al.,

2020;Gopi et al.,2021, and the references therein.).

Backward propagation. We brieﬂy introduce the back-

propagation, which reveals a simple yet important difference

between the gradients of weights and those of biases. We

consider a linear layer, indexed as the

-th layer, with weight

Wl∈Rd×p

and bias as

bl∈Rp

. We leave the derivation

of other layers such as normalization and convolution in

Appendix A.1. We denote the mini-batched input of this

layer as

al∈RB×T×d

and the immediate output as

sl∈

RB×T×p

, where

is the batch size and

is the feature

dimension

al+1 =ϕ(sl),sl=alWl+bl

. Here

Examples of gradient clipping include but not limited to

Abadi’s clipping

min(R/∥gi∥,1)

(Abadi et al.,2016) and au-

tomatic clipping (AUTO-S)

R/(∥gi∥+ 0.01)

(Bu et al.,2022b;

Yang et al.,2022).

In sequential data such as text,

is the sequence length;

in vision data,

is the product of input dimensions (e.g. for

images,

is the product of height and width). We refer to a

high-dimensional input when Tis large.

any non-parametric inter-layer operation, e.g. the non-linear

activation (like ReLU), pooling, padding, and so on.

We write

L=Pn

i=1 Li

as the total loss and

as the

per-sample loss of the

-th sample. During a standard back-

propagation of

layers, the chain rule keeps track of the

output gradient at each layer in a just-in-time fashion:

∂L

∂sl

=∂L

∂aL◦∂aL

∂sL−1·∂sL−1

∂aL−1◦ ··· ∂al+1

∂sl

=∂L

∂sl+1

Wl+1 ◦ϕ′(sl).

(2)

Here

◦

is the Hadamard product and

is the matrix product.

This output gradient

∂L

∂sl

is used to compute per-sample

gradient of weights and biases,

∂Li

∂Wl

⊤

∂Li

∂sl,j

⊤∂sl,j

∂Wl

=∂L

∂sl,i

⊤

al,i,

∂Li

∂bl

⊤

∂Li

∂sl,j

⊤∂sl,j

∂bl

=∂L

∂sl,i

⊤

(3)

Notably, the weight gradient needs the activation tensor

to compute an expensive

O(BT pd)

tensor multiplica-

tion. Memory-wise,

{al}l

across all layers is very costly

to store (taking more than 95% memory across VGG,

ResNet, DenseNet, RoBERTa, etc. by Jain et al. (2020,

Figure 3)). In sharp contrast, the computation of bias gra-

dient does not need

, and the multiplication with

Equation (3) is actually a cheap

O(BT p)

summation on

∂L

∂sl:B×T×p→B×p.

Forward propagation and the hook. During the for-

ward propagation, all Pytorch-based codebases for DP algo-

rithms such as Private Transformers, Opacus, FastGradClip,

Private-Vision, and others (Yu et al.,2021a;Bu et al.,2023)

{al}l

of all layers from the computation graph, where

computed and stored. Hence, the majority of memory bur-

den is on the activation that grows extremely large for huge

models like GPT3 (Brown et al.,2020) with 175B parame-

ters: the activation tensors consume more than 3600GB of

memory while the parameters and gradients only consume

300GB (Rajbhandari et al.,2020). On one hand, this issue

can be alleviated by the activation recomputation or check-

pointing technique (Chen et al.,2016;Jain et al.,2020),

whose memory cost reduces from

O(L)

O(√L)

with an

extra 33% slowdown. Alternatively, we note that the activa-

tion tensors are not necessary in the forward propagation, if

we only optimize the bias terms.

3 Differentially private Bias-Term

Fine-Tuning

We propose DP-BiTFiT, to privately train only the bias

terms in a neural network by combining Equation (3) and

Differentially Private Bias-Term Fine-tuning of Foundation Models

Figure 2: Back-propagation for DP (red&black) and non-DP (black) algorithms. Note that the bias gradient uses a much

simpler computation graph than the weight gradient, rendering DP-BiTFiT easy-to-implement and efﬁcient-to-compute.

Left: full ﬁne-tuning with GhostClip (ghost clipping; (Goodfellow,2015;Li et al.,2021;Bu et al.,2022a)). Upper right: full

ﬁne-tuning with Opacus (Yousefpour et al.,2021). Lower right: DP-BiTFiT.

Equation (1). We use shaded lines to represent the additional

DP operations in Algorithm 1, and add DP-related variables

and operations in red in the computation graph by Figure 2.

Implementation-wise, DP-BiTFiT is different from all exist-

ing DP algorithms (including full, LoRA, Adapter, etc.) that

optimize weights, since it does not apply a Pytorch forward

hook to store the activation

for all layers. We provide

the implementation details of DP-BiTFiT in Appendix B.

To give a concrete example, we apply DP-BiTFiT to the

RoBERTa-large model on QQP dataset, following the same

setting as Li et al. (2021) and using one 40GB A100 GPU.

This is the most time-consuming text classiﬁcation task in

our work, taking 119 minutes per epoch for a training batch

size 20 using the fastest DP full ﬁne-tuning implementation

– GhostClip (Li et al.,2021). To conduct a simple ablation

study, setting all weights to not require gradients (but for-

ward hooks are still operating) reduces the training time by

50% to to 80 minutes; removing the forward hooks further

reduces the training time by 30% to 63 minutes; ﬁnally, us-

ing the maximum batch size allowed by the memory-saving

DP-BiTFiT reduces to 43 minutes.

3.1 Parameter efﬁciency

DP-BiTFiT enjoys exactly the same parameter efﬁciency

as the standard BiTFiT, training merely about 0.1% of the

total parameters in large models. We demonstrate that DP-

BiTFiT is one of the most parameter-efﬁcient ﬁne-tuning

through a list of models in Table 1.

Dataset Model # of params % of bias

ImageNet

VGG16 138M 0.009

ResNet18 11.7M 0.043

ResNet50 25.6M 0.113

ViT-small-patch16 21.7M 0.238

ViT-base-patch16 85.8M 0.120

ViT-large-patch16 303M 0.090

E2E

GPT2-small 124M 0.082

GPT2-medium 355M 0.076

GPT2-large 774M 0.066

GLUE RoBERTa-base 125M 0.083

RoBERTa-large 355M 0.077

Table 1: Parameter efﬁciency of (DP) BiTFiT. Extended

results on more models are in Table 11.

An advantage of this parameter efﬁciency is reﬂected in

the computation efﬁciency, given that most parameters do

not require gradients to be computed: we show in Table 2

and Section 3.3 that DP-BiTFiT is much more efﬁcient than

full ﬁne-tuning (DP and even non-DP). Additionally, the

parameter efﬁciency also translates to the communication

efﬁciency in the distributed learning. For example, the 64-

bit communication cost of DP full ﬁne-tuning is

64MD

where

is number of worker and

is total number of

parameters, which can be reduced 1000×by DP-BiTFiT.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiallyPrivateBias-TermFine-tuningofFoundationModelsZhiqiBu1Yu-xiangWang12ShengZha1GeorgeKarypis1AbstractWestudytheproblemofdifferentiallyprivate(DP)fine-tuningoflargepre-trainedmodels–arecentprivacy-preservingapproachsuitableforsolvingdownstreamtaskswithsensitivedata.Ex-istingworkhasdemonstr...

展开>> 收起<<

Differentially Private Bias-Term Fine-tuning of Foundation Models_2.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentially Private Bias-Term Fine-tuning of Foundation Models_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: