
Differentially Private Optimization on Large Model at Small Cost
Zhiqi Bu 1Yu-Xiang Wang 2Sheng Zha 1George Karypis 1
Abstract
Differentially private (DP) optimization is the
standard paradigm to learn large neural networks
that are accurate and privacy-preserving. The
computational cost for DP deep learning, however,
is notoriously heavy due to the per-sample gra-
dient clipping. Existing DP implementations are
2∼1000×
more costly in time and space com-
plexity than the standard (non-private) training.
In this work, we develop a novel Book-Keeping
(BK) technique that implements existing DP opti-
mizers (thus achieving the same accuracy), with
a substantial improvement on the computational
cost. Specifically, BK enables DP training on
large models and high dimensional data to be
roughly as fast and memory-saving as the standard
training, whereas previous DP algorithms can be
inefficient or incapable of training due to mem-
ory error. The computational advantage of BK
is supported by the complexity analysis as well
as extensive experiments on vision and language
tasks. Our implementation achieves state-of-the-
art (SOTA) accuracy with very small extra cost:
on GPT2 and at almost the same memory cost
(
<1%
overhead), BK has 1.03
×
the time com-
plexity of the standard training (0.83
×
training
speed in practice), and 0.61
×
the time complexity
of the most efficient DP implementation (1.36
×
training speed in practice). We open-source the
codebase for the BK algorithm at FastDP library
https://github.com/awslabs/fas
t-differential-privacy.
1 Introduction
Deep learning with differential privacy (DP; (Dwork et al.,
2006)) has shown strong performance while guaranteeing
rigorous protection against privacy risks, especially on large
1
Amazon Web Services
2
University of California, Santa Bar-
bara. Correspondence to: Zhiqi Bu <zhiqibu@amazon.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
models that tend to memorize and leak the training data
(Carlini et al.,2021;Haim et al.,2022;Shokri et al.,2017).
For example, recent advances have shed light on the success
of DP GPT2 (Li et al.,2021;Bu et al.,2022b;Yu et al.,
2021), which achieves
64.6
BLEU score
1
at strong privacy
guarantee (
ϵ= 3
), on the text generation task using E2E
restaurant review dataset. This is only marginally below the
standard non-private GPT2 (BLEU score 66.8). Similarly,
on computer vision tasks (
ϵ= 2
), DP vision transformers
and ResNets have obtained
97.1%/86.2%
accuracy on CI-
FAR10/100 by (Bu et al.,2022a) and over
81%
accuracy on
ImageNet by (De et al.,2022;Mehta et al.,2022).
However, DP training of large neural networks is well-
known to be computationally burdensome in comparison to
the standard training, in terms of both the training time and
the memory cost. For instance, training a small recurrent
neural network (0.598M parameters) experiences a
1000×
slowdown using DP optimizers in Tensorflow-Privacy (TF-
Privacy) library in (Bu et al.,2021a), and training a small
convolutional neural network (CNN, 0.605M parameters)
on CIFAR10 has a
24×
slowdown with Tensorflow 2 and the
XLA compiler (Subramani et al.,2021). Even with SOTA
efficient implementations, large models such as RoBERTa
(Liu et al.,2019), GPT2 (Radford et al.,2019), ResNet (He
et al.,2016), VGG (Simonyan & Zisserman,2014), ViT
(Dosovitskiy et al.,2020) and its variants, experience about
2∼3×
slowdown in Pytorch (Li et al.,2021;Bu et al.,
2022a) and
2∼9×
slowdown in JAX (Kurakin et al.,2022;
De et al.,2022), with possibly
4∼20×
memory overhead
(Bu et al.,2022a;Li et al.,2021;Subramani et al.,2021) if
not running out of memory.
The efficiency bottleneck in DP deep learning lies in the
per-sample gradient clipping, which restricts the magnitude
of each per-sample gradient in the mini-batch. Applying
the clipping jointly with the Gaussian noise addition, one
can privately release the gradient to arbitrary optimizers
like SGD and Adam, and thus guarantee the privacy of the
1
BLEU (BiLingual Evaluation Understudy) is a metric (0-100)
for automatically evaluating translated text. BLEU
>60
is consid-
ered as ”very high quality, adequate, and fluent translations, often
better than human”.
1
arXiv:2210.00038v2 [cs.LG] 19 Sep 2023