
Differentially Private Bias-Term Fine-tuning of Foundation Models
1.
Algorithmically, we propose the Differentially Private
Bias-Term Fine-Tuning (DP-BiTFiT) in Algorithm 1
that is highly accurate under DP constraint, on par with
SOTA in Section 4and even outperforming fully fine-
tuned GPT2-large.
2.
DP-BiTFiT is model-agnostic
1
and only optimizes 0.1%
of the model parameters on BERT, RoBERTa, GPT2,
ViT, ResNet, and so on (see Table 1). Thus DP-BiTFiT is
one of the most parameter efficient fine-tuning methods
among DP LoRA, Adapter, last-layer, etc.
3.
We design a computationally efficient implementation
of DP-BiTFiT, whose time and space complexity is al-
most the same as the standard non-DP BiTFiT, while
being faster than non-DP full fine-tuning and other DP
fine-tuning (see Figure 1). This advantage is analyzed
in Table 2, and demonstrated via the substantial speedup
and memory-saving in Figure 3and Figure 4.
4.
DP-BiTFiT is a unique algorithm in that the compu-
tation overhead is independent of the feature dimen-
sion
T2
(see red texts in Table 2). This is due to the
activation-free forward pass that only happens in the
no-weight training
3
unlike LoRA. In Figure 1, although
DP-BiTFiT optimizes a similar number of parameters
to DP LoRA or Compacter, its memory efficiency is
dominating. Therefore, DP-BiTFiT enjoys a special
advantage on long-sequence texts and high-resolution
images (see Figure 3).
Novelty. At a glance, our results may appear to be incremen-
tal as we are merely adding differential privacy to an existing
method (BiTFiT) through a standard mechanism (DP-SGD).
This is not true! Computationally, our implementation of
DP-BiTFiT is distinct and orthogonal to existing DP al-
gorithms such as GhostClip (Li et al.,2021)
4
, in that DP-
BiTFiT exploits the special structures in the forward and
backward passes (see the simplicity of computation graph in
Figure 2), hence removing the computational and memory
overhead in DP-SGD (see the independence of
T
in Table 2),
which is unavoidable in other methods.
1
In Section 4, DP-BiTFiT is applicable to all model architec-
tures tested, unlike LoRA (mostly only applies to transformers)
and last-layer training (mostly only works on vision models).
2
The computation overhead to get the per-sample weight gra-
dient norm is linear (by instantiating per-sample gradints) or
quadratic in
T
(if using the ghost norm trick (Goodfellow,2015;
Li et al.,2021)), for DP full and any other PEFT.
3
We distinguish the weight training and bias training in Sec-
tion 2using the chain rules. Note that activation-free means
memory-saving, which is not leveraged by DP full, LoRA, Adapter,
Compacter, etc.
4
Ghost clipping (GhostClip) is an algebraic technique that only
works on weight gradients because it manipulates the activation
tensors at
O(BT 2)
cost. This is too expensive for high-dimension
features, hence not applicable to the bias gradients.
Our main contributions also include
• The complexity analysis of DP parameter-efficient fine-
tuning (PEFT) in Table 2and Table 7. This was a miss-
ing piece in previous DP and non-DP PEFT literature
(including the BiTFiT paper) and significantly helpful
in determining the benefit of applying different PEFT
methods. Specifically, we leverage the complexity anal-
ysis to rigorously show that the complexity saving of
DP-BiTFiT is 50% compared to the full fine-tuning,
and to reveal the unique benefit of DP-BiTFiT on high-
dimension data.
•
The engineering effort: at the time of writing this paper,
none of existing codebases including GhostClip and
Opacus remove the forward hooks, because no analysis
has established that only BiTFiT can be activation-free,
not LoRA/Adapter/Compactor or full fine-tuning. Our
algorithm enables DP-BiTFiT by one line of code5.
2 Preliminaries
Fine-tuning methods. Fine-tuning, i.e. training a model on
a large dataset for a sufficiently long time, and then continu-
ing to train (or transferring) onto the downstream datasets,
is the standard paradigm to achieve high accuracy in both
the standard and the DP regimes. In DP deep learning, the
pre-training takes place on a public dataset using regular
optimizers like SGD, and the fine-tuning takes place on a
private dataset which requires privacy protection, using DP
optimizers like DP-SGD in Section 2.
In a long line of research, various fine-tuning methods have
been proposed. One of the most popular method is the
full fine-tuning, which simply runs gradient descents on all
trainable weights and biases, thus can be inefficient when
the model is large. To improve the efficiency, Li & Liang
(2021) proposes the prefix tuning that only optimizes the
prompts or the input layer activation (Lester et al.,2021;
Liu et al.,2021). However, as pointed out in Hu et al. (2021)
and Li et al. (2021), the prefix tuning can be difficult to
optimize and thus sub-optimal on large models. Another
approach is to reduce the number of trainable parameters.
For example, LoRA (Hu et al.,2021), Adapter (Houlsby
et al.,2019;Rebuffi et al.,2017;Pfeiffer et al.,2021;R
¨
uckl
´
e
et al.,2021;Lin et al.,2020) and Compacter (Mahabadi
et al.,2021) insert small ‘adapter’ layers (usually 1-10%
of total parameters) between existing layers, and only the
newly added adapters are optimized. We describe the forms
and complexity of LoRA and Adapter in Appendix C.
In addition to the aforementioned methods, BiTFiT is a
5
In Pytorch, DP-BiTFiT can be enabled within our codebase
by
[param.requires grad (0) for name,param in
model.named parameters() if ’bias’ in name].
3