Recycling Scraps Improving Private Learning by Leveraging Checkpoints Virat Shejwalkar Arun Ganesh Rajiv Mathews Yarong Mu Shuang Song Om Thakkar

2025-04-29 0 0 991.5KB 35 页 10玖币

侵权投诉

Recycling Scraps: Improving Private Learning

by Leveraging Checkpoints

Virat Shejwalkar

∗

, Arun Ganesh

†

, Rajiv Mathews

†

, Yarong Mu

†

, Shuang Song

†

, Om Thakkar

†

Abhradeep Thakurta†, Xinyi Zheng†

Google

{vshejwalkar, arunganesh, mathews, ymu, shuangsong, omthkkr, athakurta,

cazheng}@google.com

Abstract

In this work, we focus on improving the accuracy-variance trade-oﬀ for state-of-the-art diﬀerentially

private machine learning (DP ML) methods. First, we design a general framework that uses aggregates

of intermediate checkpoints during training to increase the accuracy of DP ML techniques. Speciﬁcally,

we demonstrate that training over aggregates can provide signiﬁcant gains in prediction accuracy over

the existing state-of-the-art for StackOverﬂow, CIFAR10 and CIFAR100 datasets. For instance, we

improve the state-of-the-art DP StackOverﬂow accuracies to 22.74% (+2.06% relative) for

= 8

2, and

23.90% (+2.09%) for

= 18

9. Furthermore, these gains magnify in settings with periodically varying

training data distributions. We also demonstrate that our methods achieve relative improvements of

0.54% and 62.6% in terms of utility and variance, on a proprietary, production-grade pCVR task. Lastly,

we initiate an exploration into estimating the uncertainty (variance) that DP noise adds in the predictions

of DP ML models. We prove that, under standard assumptions on the loss function, the sample variance

from last few checkpoints provides a good approximation of the variance of the ﬁnal model of a DP run.

Empirically, we show that the last few checkpoints can provide a reasonable lower bound for the variance

of a converged DP model. Crucially, all the methods proposed in this paper operate on a single training

run of the DP ML technique, thus incurring no additional privacy cost.

1 Introduction

Machine learning models can unintentionally memorize sensitive information about the data they were trained

on, which has led to numerous attacks that extract private information about the training data [Ateniese

et al., 2013, Fredrikson et al., 2014, 2015, Carlini et al., 2019, Shejwalkar et al., 2021, Carlini et al., 2021,

2022]. For instance, membership inference attacks [Shokri et al., 2017] can infer whether a target sample

was used to train a given ML model, while property inference attacks [Melis et al., 2019, Mahloujifar et al.,

2022] can infer certain sensitive properties of the training data. To address such privacy risks, literature has

introduced various approaches to privacy-preserving ML [Nasr et al., 2018, Shejwalkar and Houmansadr, 2021,

Tang et al., 2022]. In particular, iterative techniques like diﬀerentially private stochastic gradient descent

(DP-SGD) [Song et al., 2013, Bassily et al., 2014a, Abadi et al., 2016c, McMahan et al., 2017b] and DP Follow

The Regularized Leader (DP-FTRL) [Kairouz et al., 2021] have become the state-of-the-art for training DP

neural networks.

The accuracy-variance trade-oﬀ is a central problem in machine learning. Note that here, we use the

term accuracy to refer to the primary evaluation metric of a model on the the training/test data sets, e.g.,

accuracy for datasets like CIFAR10 and StackOverﬂow, and AUC-loss (i.e., 1 - AUC) for datasets like pCVR.

Techniques like DP-SGD and DP-FTRL involve the operation of per-example gradient clipping and calibrated

∗Work done while the author was an intern at Google.

†Listed in alphabetical order.

arXiv:2210.01864v2 [cs.LG] 17 Sep 2024

Gaussian noise addition in each training step, which makes this trade-oﬀ even trickier to understand in DP

ML Song et al. [2021]. In this work, we focus on both fronts of the problem.

Our contributions at a glance: First, we design a general framework that (adaptively) uses aggregates of

intermediate checkpoints (i.e., the intermediate iterates of model training) to increase the accuracy of DP ML

techniques. Next, we provide a method to estimate the uncertainty (variance) that DP noise adds to DP ML

training. Crucially, we attain both these goals with a single training run of the DP technique, thus incurring

no additional privacy cost. While both the goals are interleaved, for ease of presentation, we will separate the

exposition into two parts. In the following, we provide the details of our contributions, and place them in the

context of prior works.

Increasing accuracy using checkpoint aggregates (Sections 3 and 4): While the privacy analyses

for state-of-the-art DP ML techniques allow releasing/using all the training checkpoints, prior works in DP

ML [Abadi et al., 2016c, McMahan et al., 2017b, 2018, Erlingsson et al., 2019, Wang et al., 2019b, Zhu and

Wang, 2019, Balle et al., 2020, Erlingsson et al., 2020, Papernot et al., 2020, Tramer and Boneh, 2020, Andrew

et al., 2021, Kairouz et al., 2021, Amid et al., 2022, Feldman et al., 2022] use only the ﬁnal model output by the

DP algorithm for establishing benchmarks. This is also how DP models are deployed in practice [Ramaswamy

et al., 2020, McMahan et al., 2022]. To our knowledge, De et al. [2022] is the only prior work that re-uses

intermediate checkpoints to increase the accuracy of DP-SGD. They note non-trivial accuracy gains by

post-processing the DP-SGD checkpoints using an exponential moving average (EMA). While [Chen et al.,

2017, Izmailov et al., 2018] explore checkpoint aggregation methods to improve performance in (non-DP) ML

settings, they observe negligible performance gains.

In this work, we propose a general framework that adaptively uses intermediate checkpoints to increase the

accuracy of state-of-the-art DP ML techniques. To our knowledge, this is the ﬁrst work to re-use intermediate

checkpoints during DP ML training. Empirically, we demonstrate signiﬁcant performance gains using our

framework for a next word prediction task with user-level DP for StackOverﬂow, an image classiﬁcation

task with sample-level DP for CIFAR10, and an ad-click conversion prediction task with sample-level DP

for a proprietary pCVR dataset. It is worth noting that DP state-of-the-art for benchmark datasets has

repeatedly improved over the years since the foundational techniques from Abadi et al. [2016c] for CIFAR10

and McMahan et al. [2017b] for StackOverﬂow, hence any consistent improvements are instrumental in

advancing the state of DP ML.

Speciﬁcally, we show that training over aggregates of checkpoints achieves state-of-the-art prediction

accuracy of 22.74% at

= 8

2 for StackOverﬂow (i.e., 2.09% relative gain over DP-FTRL from Kairouz et al.

[2021])

, and 57.51% at

= 1 for CIFAR10 (i.e., 2.7% relative gain over DP-SGD as per De et al. [2022]),

respectively. For CIFAR100 task, we ﬁrst improve the DP-SGD baseline of De et al. [2022] even without using

any of our aggregation methods. Similar to De et al. [2022], we warm-start DP training on CIFAR100 from

a checkpoint pre-trained on ImageNet. However, we use the EMA checkpoint of the pre-training pipeline

instead of the last checkpoint as in De et al. [2022], and improve DP-SGD performance by 5% and 3.2%

for

1 and 8, respectively. Next, we show that training over aggregates further improves the accuracy on

CIFAR100 by 0.67% to 76.18% at

= 1 (i.e., 0.89% relative gain over our improved CIFAR100 DP-SGD

baseline). Next, we show that these beneﬁts further magnify in more practical settings with periodically

varying training data distributions. For instance, we note relative accuracy gains of 2.64% and 2.82% for

18.9 and 8.2, respectively, for StackOverﬂow over DP-FTRL baseline in such a setting. We also experiment

with a proprietary, production-grade pCVR dataset Denison et al. [2022], Chua et al. [2024] and show that at

= 6, training over aggregates of checkpoints improves AUC-loss (i.e., 1 - AUC) by 0.54% (relative) over the

DP-SGD baseline. Note that such an improvement is considered very signiﬁcant in the context of ads ranking.

Theoretically, we show in Theorem 3.2 that for standard training regimes, the excess empirical risk of the

ﬁnal checkpoint of DP-SGD is

log

(

) times more than that of the weighted average of the past

checkpoints,

where

is the size of dataset. It is interesting to theoretically analyze the use of checkpoint aggregations

during training, which we leave as future work.

Uncertainty quantiﬁcation using intermediate checkpoints (Section 5): There are various sources

of randomness in an ML training pipeline [Abdar et al., 2021], e.g., choice of initial parameters, dataset,

batching, etc. This randomness induces uncertainty in the predictions made using such ML models. In

critical domains, e.g., medical diagnosis, self-driving cars and ﬁnancial market analysis, failing to capture the

1These improvements are notable since there are 10kclasses in StackOverﬂow data.

uncertainty in such predictions can have undesirable repercussions. DP learning adds an additional source of

randomness by injecting noise at every training round. Hence, it is paramount to quantify reliability of the

DP models, e.g., by quantifying the uncertainty in their predictions.

As prior work, Karwa and Vadhan [2017] develop ﬁnite sample conﬁdence intervals but for the simpler

Gaussian mean estimation problem. Various methods exist for uncertainty quantiﬁcation in ML-based

systems [Mitchell, 1980, Roy et al., 2018, Begoli et al., 2019, Hubschneider et al., 2019, McDermott and Wikle,

2019, Tagasovska and Lopez-Paz, 2019, Wang et al., 2019a, Nair et al., 2020, Ferrando et al., 2022]. However,

these methods either use specialized (or simpler) model architectures to facilitate uncertainty quantiﬁcation,

or are not directly applicable to quantify the uncertainty in DP ML due to DP noise. For example, a common

way of uncertainty quantiﬁcation [Barrientos et al., 2019, Nissim et al., 2007, Brawner and Honaker, 2018,

Evans et al., 2020] that we call the independent runs method, needs

independent (bootstrap) runs of the

ML algorithm. However, repeating a DP ML algorithm multiple times can incur signiﬁcant privacy and

computation costs.

To this end, for the ﬁrst time we quantify the uncertainty that DP noise adds to DP training procedure using

only a single training run. We propose to use the last

checkpoints of a single run of a DP ML algorithm

as a proxy for the

ﬁnal checkpoints from independent runs. This does not incur any additional privacy

cost to the DP ML algorithm. Furthermore, it is useful in practice as it does not incur additional training

compute, and can work with any algorithm having intermediate checkpoints. Finally, it doesn’t require

changing the underlying model or algorithm, unlike some other methods for uncertainty estimation (e.g., the

use of Bayesian neural networks Zhang et al. [2021]).

Theoretically, we consider using (a rescaling of) the sample variance of a statistic

(

) at checkpoints

θt1, . . . , θtk

as an estimator of the variance of any convex combination of

(

θti

), i.e., any weighted average of

the statistics at the checkpoints, and give a bound on the bias of this estimator. As expected, our bound on

the error decreases as the “burn-in” time

and the time between checkpoints

both increase. An upshot of

this analysis is that getting

nearly i.i.d. checkpoints requires fewer iterations than running

independent

runs of

iterations. In turn, under a ﬁxed privacy constraint, using the sample variance of the checkpoints

can provide more samples and thus tighter conﬁdence intervals than the independent runs method; see the

remark in Section 5 for details.

Intuitively, our proof shows that (i) as the burn-in time increases, the marginal distribution of each

θti

approaches the distribution of

θtk

, and (ii) as the time between checkpoints increases, any pair

θti, θtj

approaches pairwise independence. We prove both (i) and (ii) via a mixing time bound, which shows that

starting from any point distribution

θ0

, the Markov chain given by DP-SGD approaches its stationary

distribution at a certain rate.

Empirically, we show that our method provides reasonable lower bounds on the uncertainty quantiﬁed

using the more accurate (but privacy and computation intensive) method that uses independent runs. For

instance, we show that for DP-FTRL trained StackOverﬂow, the 95% conﬁdence widths for the scores of the

predicted labels computed using independent runs method (no budget split)

are always within a factor of 2

of the widths provided by our method for various privacy levels and number of bootstrap samples.

While we compute the variance in regards to a ﬁxed prediction function, we believe our estimator can be

used to obtain DP parameter conﬁdence intervals for traditional statistical estimators (e.g., linear regression).

We leave this direction for future exploration.

2 Background and Preliminaries

In this section, we brieﬂy introduce the background on machine learning, privacy leakages in machine learning

models, diﬀerential privacy and deep learning with diﬀerential privacy.

2.1 Machine Learning

In this paper, we consider machine learning (ML) models used for image classiﬁcation and language next-

word-prediction tasks. We use supervised machine learning for both the types of tasks and brieﬂy review it

below.

2Thus, a superior baseline by not splitting the privacy budget among the independent runs.

Let

fθ

Rd7→ Rk

be a ML classiﬁer (e.g., neural network) with

input features and

classes, which is

parameterized by

. For a given example

= (

x, y

fθ

(

) is the classiﬁer’s conﬁdence vector for

classes and

the predicted label is the corresponding class which has the largest conﬁdence score, i.e.,

ˆy

arg maxifθ

(

The goal of supervised machine learning is to learn the relationship between features and labels in given

labeled training data

and generalize this ability to unseen data. The model learns this relationship using

empirical risk minimization (ERM) on the training set

, where the risk is measured in terms of a certain

loss function, e.g., cross-entropy loss:

min

|Dl

tr|X

z∈Dl

l(fθ,z)

Here

|Dl

tr|

is the size of the labeled training set and

(

fθ,z

) is the loss function. When clear from the context,

we use finstead of fθ, to denote the target model.

2.2 Privacy Leakage in ML Models

ML models generally require large amounts of training data to achieve good performances. This data can

be of sensitive nature, e.g., medical records and personal photographs, and without proper precautions,

ML models may leak sensitive information about their private training data. Multiple previous works have

demonstrated this via various inference attacks, e.g., membership inference, property or attribute inference,

model stealing, and model inversion. Below, we review these attacks.

Consider a target model

fθ

trained on

Dtr

and a target sample (x

, y

). Membership inference attacks Shokri

et al. [2017], Sankararaman et al. [2009], Ateniese et al. [2015] aim to infer whether the target sample (x

, y

)

was used to train the target model, i.e., whether (x, y)∈Dtr. Property or attribute inference attacks Melis

et al. [2019], Song and Shmatikov [2019] aim to infer certain attributes of (x

, y

) based on model’s inference

time representation of (x

, y

). For instance, even if

fθ

is just a gender classiﬁer,

fθ

(x) may reveal the race

of the person in x. Model stealing attacks Tram`er et al. [2016], Orekondy et al. [2019] aim to reconstruct

the parameters

of the original model

fθ

based on black-box access to

fθ

, i.e., using

fθ

(x). Model inversion

attacks Fredrikson et al. [2015] aim to reconstruct the whole training data

Dtr

based on white-box, i.e., using

θ, or black-box, i.e., using fθ(x), access to model.

2.3 Deep Learning with Diﬀerential Privacy

Diﬀerential privacy Dwork et al. [2006], Dwork [2008], Dwork and Roth [2014] is a notion to quantify the

privacy leakage from the outputs of a data analysis procedure and is the gold standard for data privacy. It is

formally deﬁned as below:

Deﬁnition 2.1 (Diﬀerential Privacy).A randomized algorithm

with domain

and range

preserves

(ε, δ)-diﬀerential privacy iﬀ for any two neighboring datasets D, D′∈ D and for any subset S⊆ R we have:

Pr[M(D)∈S]≤eεPr[M(D′)∈S] + δ(1)

where εis the privacy budget and δis the failure probability.

R´enyi Diﬀerential Privacy (RDP) is a commonly-used relaxed deﬁnition for diﬀerential privacy.

Deﬁnition 2.2 (R´enyi Diﬀerential Privacy (RDP) Mironov [2017]).A randomized algorithm

with domain

Dis (α, ε)-RDP with order α∈(1,∞)if and only if for any two neighboring datasets D, D′∈ D:

Dα(M(D)||M(D′))

:= 1

α−1log E

δ∼M(D′)[( P r[M(D) = δ]

P r[M(D′) = δ])α]≤ε(2)

There are two key properties of DP algorithms that will be useful in our composition and post-processing.

Below we brieﬂy review these two properties speciﬁcally for the widely-used R´enyi-DP deﬁnition, but they

apply to all the DP algorithms.

Lemma 1 (Adaptive Composition of RDP Mironov [2017]).Consider two randomized mechanisms

and

that provide (

α, ε1

)-RDP and (

α, ε2

)-RDP, respectively. Composing

and

results in a mechanism

with (α, ε1+ε2)-RDP.

Lemma 2 (Post-processing of RDP Mironov [2017]).Given a randomized mechanism that is (

α, ε

)-RDP,

applying a randomized mapping function on it does not increase its privacy budget, i.e., it will result in

another (α, ε)-RDP mechanism.

2.3.1 Diﬀerentially Private ML Algorithms We Use

Several works have used diﬀerential privacy in traditional machine learning to protect the privacy of the

training data Li et al. [2014], Chaudhuri et al. [2011], Feldman et al. [2018], Zhang et al. [2016], Bassily et al.

[2014b]. We use two of the commonly-used algorithms for DP deep learning: DP-SGD Abadi et al. [2016b],

and DP-FTRL Kairouz et al. [2021]. At a high level, to update the model in each training round, DP-SGD

ﬁrst samples a minibatch of examples uniformly at random, clips the gradient of each example to limit the

sensitivity of a gradient update, and then adds independent Gaussian noise to gradients that is calibrated

to achieve the desired DP guarantee. In contrast, in each training round, DP-FTRL takes a minibatch of

examples (no requirement of sampling), clips each example’s gradient to limit sensitivity, and adds correlated

Gaussian noise calibrated to achieve the desired DP guarantee.

Using Checkpoint Aggregates to Improve Accuracy of Diﬀeren-

tially Private ML

In this section, we ﬁrst detail our novel and general adaptive aggregation training framework that leverages

past checkpoints (recall a checkpoint is just an intermediate model iterate

θt

) during training, and provide

two instantiations of it. We also design four checkpoint aggregation methods that can be used for inference

over a given sequence of checkpoints. Finally, we provide a theoretical analysis for improved privacy-utility

trade-oﬀs due to some of the checkpoint aggregations.

Why can we post-process intermediate DP ML checkpoints?: Before delving into the details of our

checkpoints aggregation methods, it is useful to note that the privacy analyses for the DP algorithms we

consider in this paper, i.e., DP-SGD Abadi et al. [2016b] and DP-FTRL Kairouz et al. [2021], use the adaptive

composition (Lemma 1) across training rounds. This implies that all the intermediate checkpoints are also

DP, which allows us to release of all intermediate checkpoints computed during training. Furthermore, as

all checkpoints are DP, due to the post-processing property of DP (Lemma 2), one can process/use these

checkpoints without incurring additional privacy cost.

3.1 Using Checkpoint Aggregations for Training

Algorithm 1 describes our general adaptive aggregation training framework. Apart from the parameters

needed to run the DP algorithm

, it uses a checkpoint aggregation function

fAGG

to compute an aggregate

checkpoint

θAGG

t+1

from the checkpoints (

θt+1, θt, . . . , θ0

) at each step

. Consequently,

uses

θAGG

t+1

for its next

training step. Note that Algorithm 1 has two hyperparameters: (1)

that decides when to start training

over the past checkpoints aggregate, and (2) parameter

speciﬁc to

fAGG

which we detail below, along with

fAGG

s. Due to the post-processing property of DP, using

fAGG

does not incur any additional privacy cost.

Though our framework can incorporate any custom

fAGG

, we present two natural instantiations for

fAGG

and

extensively evaluate them.

Exponential Moving Average (EMA):Our ﬁrst proposal uses an EMA function to aggregate all the

past checkpoints at training step

. Starting from the latest checkpoint, EMA assigns exponentially decaying

weights to each of the previous checkpoints. At step

, EMA maintains a moving average

θEMA

that is a

weighted average of θEMA

t−1and the latest checkpoint, θt. This is formalized as follows:

θEMA

t= (1 −βt)·θEMA

t−1+βt·θt(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RecyclingScraps:ImprovingPrivateLearningbyLeveragingCheckpointsViratShejwalkar∗,ArunGanesh†,RajivMathews†,YarongMu†,ShuangSong†,OmThakkar†,AbhradeepThakurta†,XinyiZheng†Google{vshejwalkar,arunganesh,mathews,ymu,shuangsong,omthkkr,athakurta,cazheng}@google.comAbstractInthiswork,wefocusonimprovingthea...

收起<<

Recycling Scraps Improving Private Learning by Leveraging Checkpoints Virat Shejwalkar Arun Ganesh Rajiv Mathews Yarong Mu Shuang Song Om Thakkar.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Recycling Scraps Improving Private Learning by Leveraging Checkpoints Virat Shejwalkar Arun Ganesh Rajiv Mathews Yarong Mu Shuang Song Om Thakkar

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: