Recycling Scraps Improving Private Learning by Leveraging Checkpoints Virat Shejwalkar Arun Ganesh Rajiv Mathews Yarong Mu Shuang Song Om Thakkar

2025-04-29 0 0 991.5KB 35 页 10玖币
侵权投诉
Recycling Scraps: Improving Private Learning
by Leveraging Checkpoints
Virat Shejwalkar
, Arun Ganesh
, Rajiv Mathews
, Yarong Mu
, Shuang Song
, Om Thakkar
,
Abhradeep Thakurta, Xinyi Zheng
Google
{vshejwalkar, arunganesh, mathews, ymu, shuangsong, omthkkr, athakurta,
cazheng}@google.com
Abstract
In this work, we focus on improving the accuracy-variance trade-off for state-of-the-art differentially
private machine learning (DP ML) methods. First, we design a general framework that uses aggregates
of intermediate checkpoints during training to increase the accuracy of DP ML techniques. Specifically,
we demonstrate that training over aggregates can provide significant gains in prediction accuracy over
the existing state-of-the-art for StackOverflow, CIFAR10 and CIFAR100 datasets. For instance, we
improve the state-of-the-art DP StackOverflow accuracies to 22.74% (+2.06% relative) for
ε
= 8
.
2, and
23.90% (+2.09%) for
ε
= 18
.
9. Furthermore, these gains magnify in settings with periodically varying
training data distributions. We also demonstrate that our methods achieve relative improvements of
0.54% and 62.6% in terms of utility and variance, on a proprietary, production-grade pCVR task. Lastly,
we initiate an exploration into estimating the uncertainty (variance) that DP noise adds in the predictions
of DP ML models. We prove that, under standard assumptions on the loss function, the sample variance
from last few checkpoints provides a good approximation of the variance of the final model of a DP run.
Empirically, we show that the last few checkpoints can provide a reasonable lower bound for the variance
of a converged DP model. Crucially, all the methods proposed in this paper operate on a single training
run of the DP ML technique, thus incurring no additional privacy cost.
1 Introduction
Machine learning models can unintentionally memorize sensitive information about the data they were trained
on, which has led to numerous attacks that extract private information about the training data [Ateniese
et al., 2013, Fredrikson et al., 2014, 2015, Carlini et al., 2019, Shejwalkar et al., 2021, Carlini et al., 2021,
2022]. For instance, membership inference attacks [Shokri et al., 2017] can infer whether a target sample
was used to train a given ML model, while property inference attacks [Melis et al., 2019, Mahloujifar et al.,
2022] can infer certain sensitive properties of the training data. To address such privacy risks, literature has
introduced various approaches to privacy-preserving ML [Nasr et al., 2018, Shejwalkar and Houmansadr, 2021,
Tang et al., 2022]. In particular, iterative techniques like differentially private stochastic gradient descent
(DP-SGD) [Song et al., 2013, Bassily et al., 2014a, Abadi et al., 2016c, McMahan et al., 2017b] and DP Follow
The Regularized Leader (DP-FTRL) [Kairouz et al., 2021] have become the state-of-the-art for training DP
neural networks.
The accuracy-variance trade-off is a central problem in machine learning. Note that here, we use the
term accuracy to refer to the primary evaluation metric of a model on the the training/test data sets, e.g.,
accuracy for datasets like CIFAR10 and StackOverflow, and AUC-loss (i.e., 1 - AUC) for datasets like pCVR.
Techniques like DP-SGD and DP-FTRL involve the operation of per-example gradient clipping and calibrated
Work done while the author was an intern at Google.
Listed in alphabetical order.
1
arXiv:2210.01864v2 [cs.LG] 17 Sep 2024
Gaussian noise addition in each training step, which makes this trade-off even trickier to understand in DP
ML Song et al. [2021]. In this work, we focus on both fronts of the problem.
Our contributions at a glance: First, we design a general framework that (adaptively) uses aggregates of
intermediate checkpoints (i.e., the intermediate iterates of model training) to increase the accuracy of DP ML
techniques. Next, we provide a method to estimate the uncertainty (variance) that DP noise adds to DP ML
training. Crucially, we attain both these goals with a single training run of the DP technique, thus incurring
no additional privacy cost. While both the goals are interleaved, for ease of presentation, we will separate the
exposition into two parts. In the following, we provide the details of our contributions, and place them in the
context of prior works.
Increasing accuracy using checkpoint aggregates (Sections 3 and 4): While the privacy analyses
for state-of-the-art DP ML techniques allow releasing/using all the training checkpoints, prior works in DP
ML [Abadi et al., 2016c, McMahan et al., 2017b, 2018, Erlingsson et al., 2019, Wang et al., 2019b, Zhu and
Wang, 2019, Balle et al., 2020, Erlingsson et al., 2020, Papernot et al., 2020, Tramer and Boneh, 2020, Andrew
et al., 2021, Kairouz et al., 2021, Amid et al., 2022, Feldman et al., 2022] use only the final model output by the
DP algorithm for establishing benchmarks. This is also how DP models are deployed in practice [Ramaswamy
et al., 2020, McMahan et al., 2022]. To our knowledge, De et al. [2022] is the only prior work that re-uses
intermediate checkpoints to increase the accuracy of DP-SGD. They note non-trivial accuracy gains by
post-processing the DP-SGD checkpoints using an exponential moving average (EMA). While [Chen et al.,
2017, Izmailov et al., 2018] explore checkpoint aggregation methods to improve performance in (non-DP) ML
settings, they observe negligible performance gains.
In this work, we propose a general framework that adaptively uses intermediate checkpoints to increase the
accuracy of state-of-the-art DP ML techniques. To our knowledge, this is the first work to re-use intermediate
checkpoints during DP ML training. Empirically, we demonstrate significant performance gains using our
framework for a next word prediction task with user-level DP for StackOverflow, an image classification
task with sample-level DP for CIFAR10, and an ad-click conversion prediction task with sample-level DP
for a proprietary pCVR dataset. It is worth noting that DP state-of-the-art for benchmark datasets has
repeatedly improved over the years since the foundational techniques from Abadi et al. [2016c] for CIFAR10
and McMahan et al. [2017b] for StackOverflow, hence any consistent improvements are instrumental in
advancing the state of DP ML.
Specifically, we show that training over aggregates of checkpoints achieves state-of-the-art prediction
accuracy of 22.74% at
ε
= 8
.
2 for StackOverflow (i.e., 2.09% relative gain over DP-FTRL from Kairouz et al.
[2021])
1
, and 57.51% at
ε
= 1 for CIFAR10 (i.e., 2.7% relative gain over DP-SGD as per De et al. [2022]),
respectively. For CIFAR100 task, we first improve the DP-SGD baseline of De et al. [2022] even without using
any of our aggregation methods. Similar to De et al. [2022], we warm-start DP training on CIFAR100 from
a checkpoint pre-trained on ImageNet. However, we use the EMA checkpoint of the pre-training pipeline
instead of the last checkpoint as in De et al. [2022], and improve DP-SGD performance by 5% and 3.2%
for
ε
1 and 8, respectively. Next, we show that training over aggregates further improves the accuracy on
CIFAR100 by 0.67% to 76.18% at
ε
= 1 (i.e., 0.89% relative gain over our improved CIFAR100 DP-SGD
baseline). Next, we show that these benefits further magnify in more practical settings with periodically
varying training data distributions. For instance, we note relative accuracy gains of 2.64% and 2.82% for
ε
of
18.9 and 8.2, respectively, for StackOverflow over DP-FTRL baseline in such a setting. We also experiment
with a proprietary, production-grade pCVR dataset Denison et al. [2022], Chua et al. [2024] and show that at
ε
= 6, training over aggregates of checkpoints improves AUC-loss (i.e., 1 - AUC) by 0.54% (relative) over the
DP-SGD baseline. Note that such an improvement is considered very significant in the context of ads ranking.
Theoretically, we show in Theorem 3.2 that for standard training regimes, the excess empirical risk of the
final checkpoint of DP-SGD is
log
(
n
) times more than that of the weighted average of the past
k
checkpoints,
where
n
is the size of dataset. It is interesting to theoretically analyze the use of checkpoint aggregations
during training, which we leave as future work.
Uncertainty quantification using intermediate checkpoints (Section 5): There are various sources
of randomness in an ML training pipeline [Abdar et al., 2021], e.g., choice of initial parameters, dataset,
batching, etc. This randomness induces uncertainty in the predictions made using such ML models. In
critical domains, e.g., medical diagnosis, self-driving cars and financial market analysis, failing to capture the
1These improvements are notable since there are 10kclasses in StackOverflow data.
2
uncertainty in such predictions can have undesirable repercussions. DP learning adds an additional source of
randomness by injecting noise at every training round. Hence, it is paramount to quantify reliability of the
DP models, e.g., by quantifying the uncertainty in their predictions.
As prior work, Karwa and Vadhan [2017] develop finite sample confidence intervals but for the simpler
Gaussian mean estimation problem. Various methods exist for uncertainty quantification in ML-based
systems [Mitchell, 1980, Roy et al., 2018, Begoli et al., 2019, Hubschneider et al., 2019, McDermott and Wikle,
2019, Tagasovska and Lopez-Paz, 2019, Wang et al., 2019a, Nair et al., 2020, Ferrando et al., 2022]. However,
these methods either use specialized (or simpler) model architectures to facilitate uncertainty quantification,
or are not directly applicable to quantify the uncertainty in DP ML due to DP noise. For example, a common
way of uncertainty quantification [Barrientos et al., 2019, Nissim et al., 2007, Brawner and Honaker, 2018,
Evans et al., 2020] that we call the independent runs method, needs
k
independent (bootstrap) runs of the
ML algorithm. However, repeating a DP ML algorithm multiple times can incur significant privacy and
computation costs.
To this end, for the first time we quantify the uncertainty that DP noise adds to DP training procedure using
only a single training run. We propose to use the last
k
checkpoints of a single run of a DP ML algorithm
as a proxy for the
k
final checkpoints from independent runs. This does not incur any additional privacy
cost to the DP ML algorithm. Furthermore, it is useful in practice as it does not incur additional training
compute, and can work with any algorithm having intermediate checkpoints. Finally, it doesn’t require
changing the underlying model or algorithm, unlike some other methods for uncertainty estimation (e.g., the
use of Bayesian neural networks Zhang et al. [2021]).
Theoretically, we consider using (a rescaling of) the sample variance of a statistic
f
(
θ
) at checkpoints
θt1, . . . , θtk
as an estimator of the variance of any convex combination of
f
(
θti
), i.e., any weighted average of
the statistics at the checkpoints, and give a bound on the bias of this estimator. As expected, our bound on
the error decreases as the “burn-in” time
t1
and the time between checkpoints
t2
both increase. An upshot of
this analysis is that getting
k
nearly i.i.d. checkpoints requires fewer iterations than running
k
independent
runs of
t1
iterations. In turn, under a fixed privacy constraint, using the sample variance of the checkpoints
can provide more samples and thus tighter confidence intervals than the independent runs method; see the
remark in Section 5 for details.
Intuitively, our proof shows that (i) as the burn-in time increases, the marginal distribution of each
θti
approaches the distribution of
θtk
, and (ii) as the time between checkpoints increases, any pair
θti, θtj
approaches pairwise independence. We prove both (i) and (ii) via a mixing time bound, which shows that
starting from any point distribution
θ0
, the Markov chain given by DP-SGD approaches its stationary
distribution at a certain rate.
Empirically, we show that our method provides reasonable lower bounds on the uncertainty quantified
using the more accurate (but privacy and computation intensive) method that uses independent runs. For
instance, we show that for DP-FTRL trained StackOverflow, the 95% confidence widths for the scores of the
predicted labels computed using independent runs method (no budget split)
2
are always within a factor of 2
of the widths provided by our method for various privacy levels and number of bootstrap samples.
While we compute the variance in regards to a fixed prediction function, we believe our estimator can be
used to obtain DP parameter confidence intervals for traditional statistical estimators (e.g., linear regression).
We leave this direction for future exploration.
2 Background and Preliminaries
In this section, we briefly introduce the background on machine learning, privacy leakages in machine learning
models, differential privacy and deep learning with differential privacy.
2.1 Machine Learning
In this paper, we consider machine learning (ML) models used for image classification and language next-
word-prediction tasks. We use supervised machine learning for both the types of tasks and briefly review it
below.
2Thus, a superior baseline by not splitting the privacy budget among the independent runs.
3
Let
fθ
:
Rd7→ Rk
be a ML classifier (e.g., neural network) with
d
input features and
k
classes, which is
parameterized by
θ
. For a given example
z
= (
x, y
),
fθ
(
x
) is the classifier’s confidence vector for
k
classes and
the predicted label is the corresponding class which has the largest confidence score, i.e.,
ˆy
=
arg maxifθ
(
x
).
The goal of supervised machine learning is to learn the relationship between features and labels in given
labeled training data
Dl
tr
and generalize this ability to unseen data. The model learns this relationship using
empirical risk minimization (ERM) on the training set
Dl
tr
, where the risk is measured in terms of a certain
loss function, e.g., cross-entropy loss:
min
θ
1
|Dl
tr|X
zDl
tr
l(fθ,z)
Here
|Dl
tr|
is the size of the labeled training set and
l
(
fθ,z
) is the loss function. When clear from the context,
we use finstead of fθ, to denote the target model.
2.2 Privacy Leakage in ML Models
ML models generally require large amounts of training data to achieve good performances. This data can
be of sensitive nature, e.g., medical records and personal photographs, and without proper precautions,
ML models may leak sensitive information about their private training data. Multiple previous works have
demonstrated this via various inference attacks, e.g., membership inference, property or attribute inference,
model stealing, and model inversion. Below, we review these attacks.
Consider a target model
fθ
trained on
Dtr
and a target sample (x
, y
). Membership inference attacks Shokri
et al. [2017], Sankararaman et al. [2009], Ateniese et al. [2015] aim to infer whether the target sample (x
, y
)
was used to train the target model, i.e., whether (x, y)Dtr. Property or attribute inference attacks Melis
et al. [2019], Song and Shmatikov [2019] aim to infer certain attributes of (x
, y
) based on model’s inference
time representation of (x
, y
). For instance, even if
fθ
is just a gender classifier,
fθ
(x) may reveal the race
of the person in x. Model stealing attacks Tram`er et al. [2016], Orekondy et al. [2019] aim to reconstruct
the parameters
θ
of the original model
fθ
based on black-box access to
fθ
, i.e., using
fθ
(x). Model inversion
attacks Fredrikson et al. [2015] aim to reconstruct the whole training data
Dtr
based on white-box, i.e., using
θ, or black-box, i.e., using fθ(x), access to model.
2.3 Deep Learning with Differential Privacy
Differential privacy Dwork et al. [2006], Dwork [2008], Dwork and Roth [2014] is a notion to quantify the
privacy leakage from the outputs of a data analysis procedure and is the gold standard for data privacy. It is
formally defined as below:
Definition 2.1 (Differential Privacy).A randomized algorithm
M
with domain
D
and range
R
preserves
(ε, δ)-differential privacy iff for any two neighboring datasets D, D∈ D and for any subset S⊆ R we have:
Pr[M(D)S]eεPr[M(D)S] + δ(1)
where εis the privacy budget and δis the failure probability.
R´enyi Differential Privacy (RDP) is a commonly-used relaxed definition for differential privacy.
Definition 2.2 (R´enyi Differential Privacy (RDP) Mironov [2017]).A randomized algorithm
M
with domain
Dis (α, ε)-RDP with order α(1,)if and only if for any two neighboring datasets D, D∈ D:
Dα(M(D)||M(D))
:= 1
α1log E
δ∼M(D)[( P r[M(D) = δ]
P r[M(D) = δ])α]ε(2)
There are two key properties of DP algorithms that will be useful in our composition and post-processing.
Below we briefly review these two properties specifically for the widely-used R´enyi-DP definition, but they
apply to all the DP algorithms.
4
Lemma 1 (Adaptive Composition of RDP Mironov [2017]).Consider two randomized mechanisms
M1
and
M2
that provide (
α, ε1
)-RDP and (
α, ε2
)-RDP, respectively. Composing
M1
and
M2
results in a mechanism
with (α, ε1+ε2)-RDP.
Lemma 2 (Post-processing of RDP Mironov [2017]).Given a randomized mechanism that is (
α, ε
)-RDP,
applying a randomized mapping function on it does not increase its privacy budget, i.e., it will result in
another (α, ε)-RDP mechanism.
2.3.1 Differentially Private ML Algorithms We Use
Several works have used differential privacy in traditional machine learning to protect the privacy of the
training data Li et al. [2014], Chaudhuri et al. [2011], Feldman et al. [2018], Zhang et al. [2016], Bassily et al.
[2014b]. We use two of the commonly-used algorithms for DP deep learning: DP-SGD Abadi et al. [2016b],
and DP-FTRL Kairouz et al. [2021]. At a high level, to update the model in each training round, DP-SGD
first samples a minibatch of examples uniformly at random, clips the gradient of each example to limit the
sensitivity of a gradient update, and then adds independent Gaussian noise to gradients that is calibrated
to achieve the desired DP guarantee. In contrast, in each training round, DP-FTRL takes a minibatch of
examples (no requirement of sampling), clips each example’s gradient to limit sensitivity, and adds correlated
Gaussian noise calibrated to achieve the desired DP guarantee.
3
Using Checkpoint Aggregates to Improve Accuracy of Differen-
tially Private ML
In this section, we first detail our novel and general adaptive aggregation training framework that leverages
past checkpoints (recall a checkpoint is just an intermediate model iterate
θt
) during training, and provide
two instantiations of it. We also design four checkpoint aggregation methods that can be used for inference
over a given sequence of checkpoints. Finally, we provide a theoretical analysis for improved privacy-utility
trade-offs due to some of the checkpoint aggregations.
Why can we post-process intermediate DP ML checkpoints?: Before delving into the details of our
checkpoints aggregation methods, it is useful to note that the privacy analyses for the DP algorithms we
consider in this paper, i.e., DP-SGD Abadi et al. [2016b] and DP-FTRL Kairouz et al. [2021], use the adaptive
composition (Lemma 1) across training rounds. This implies that all the intermediate checkpoints are also
DP, which allows us to release of all intermediate checkpoints computed during training. Furthermore, as
all checkpoints are DP, due to the post-processing property of DP (Lemma 2), one can process/use these
checkpoints without incurring additional privacy cost.
3.1 Using Checkpoint Aggregations for Training
Algorithm 1 describes our general adaptive aggregation training framework. Apart from the parameters
needed to run the DP algorithm
A
, it uses a checkpoint aggregation function
fAGG
to compute an aggregate
checkpoint
θAGG
t+1
from the checkpoints (
θt+1, θt, . . . , θ0
) at each step
t
. Consequently,
A
uses
θAGG
t+1
for its next
training step. Note that Algorithm 1 has two hyperparameters: (1)
τ
that decides when to start training
over the past checkpoints aggregate, and (2) parameter
p
specific to
fAGG
which we detail below, along with
fAGG
s. Due to the post-processing property of DP, using
fAGG
does not incur any additional privacy cost.
Though our framework can incorporate any custom
fAGG
, we present two natural instantiations for
fAGG
and
extensively evaluate them.
Exponential Moving Average (EMA):Our first proposal uses an EMA function to aggregate all the
past checkpoints at training step
t
. Starting from the latest checkpoint, EMA assigns exponentially decaying
weights to each of the previous checkpoints. At step
t
, EMA maintains a moving average
θEMA
t
that is a
weighted average of θEMA
t1and the latest checkpoint, θt. This is formalized as follows:
θEMA
t= (1 βt)·θEMA
t1+βt·θt(3)
5
摘要:

RecyclingScraps:ImprovingPrivateLearningbyLeveragingCheckpointsViratShejwalkar∗,ArunGanesh†,RajivMathews†,YarongMu†,ShuangSong†,OmThakkar†,AbhradeepThakurta†,XinyiZheng†Google{vshejwalkar,arunganesh,mathews,ymu,shuangsong,omthkkr,athakurta,cazheng}@google.comAbstractInthiswork,wefocusonimprovingthea...

展开>> 收起<<
Recycling Scraps Improving Private Learning by Leveraging Checkpoints Virat Shejwalkar Arun Ganesh Rajiv Mathews Yarong Mu Shuang Song Om Thakkar.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:991.5KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注