
Gaussian noise addition in each training step, which makes this trade-off even trickier to understand in DP
ML Song et al. [2021]. In this work, we focus on both fronts of the problem.
Our contributions at a glance: First, we design a general framework that (adaptively) uses aggregates of
intermediate checkpoints (i.e., the intermediate iterates of model training) to increase the accuracy of DP ML
techniques. Next, we provide a method to estimate the uncertainty (variance) that DP noise adds to DP ML
training. Crucially, we attain both these goals with a single training run of the DP technique, thus incurring
no additional privacy cost. While both the goals are interleaved, for ease of presentation, we will separate the
exposition into two parts. In the following, we provide the details of our contributions, and place them in the
context of prior works.
Increasing accuracy using checkpoint aggregates (Sections 3 and 4): While the privacy analyses
for state-of-the-art DP ML techniques allow releasing/using all the training checkpoints, prior works in DP
ML [Abadi et al., 2016c, McMahan et al., 2017b, 2018, Erlingsson et al., 2019, Wang et al., 2019b, Zhu and
Wang, 2019, Balle et al., 2020, Erlingsson et al., 2020, Papernot et al., 2020, Tramer and Boneh, 2020, Andrew
et al., 2021, Kairouz et al., 2021, Amid et al., 2022, Feldman et al., 2022] use only the final model output by the
DP algorithm for establishing benchmarks. This is also how DP models are deployed in practice [Ramaswamy
et al., 2020, McMahan et al., 2022]. To our knowledge, De et al. [2022] is the only prior work that re-uses
intermediate checkpoints to increase the accuracy of DP-SGD. They note non-trivial accuracy gains by
post-processing the DP-SGD checkpoints using an exponential moving average (EMA). While [Chen et al.,
2017, Izmailov et al., 2018] explore checkpoint aggregation methods to improve performance in (non-DP) ML
settings, they observe negligible performance gains.
In this work, we propose a general framework that adaptively uses intermediate checkpoints to increase the
accuracy of state-of-the-art DP ML techniques. To our knowledge, this is the first work to re-use intermediate
checkpoints during DP ML training. Empirically, we demonstrate significant performance gains using our
framework for a next word prediction task with user-level DP for StackOverflow, an image classification
task with sample-level DP for CIFAR10, and an ad-click conversion prediction task with sample-level DP
for a proprietary pCVR dataset. It is worth noting that DP state-of-the-art for benchmark datasets has
repeatedly improved over the years since the foundational techniques from Abadi et al. [2016c] for CIFAR10
and McMahan et al. [2017b] for StackOverflow, hence any consistent improvements are instrumental in
advancing the state of DP ML.
Specifically, we show that training over aggregates of checkpoints achieves state-of-the-art prediction
accuracy of 22.74% at
ε
= 8
.
2 for StackOverflow (i.e., 2.09% relative gain over DP-FTRL from Kairouz et al.
[2021])
1
, and 57.51% at
ε
= 1 for CIFAR10 (i.e., 2.7% relative gain over DP-SGD as per De et al. [2022]),
respectively. For CIFAR100 task, we first improve the DP-SGD baseline of De et al. [2022] even without using
any of our aggregation methods. Similar to De et al. [2022], we warm-start DP training on CIFAR100 from
a checkpoint pre-trained on ImageNet. However, we use the EMA checkpoint of the pre-training pipeline
instead of the last checkpoint as in De et al. [2022], and improve DP-SGD performance by 5% and 3.2%
for
ε
1 and 8, respectively. Next, we show that training over aggregates further improves the accuracy on
CIFAR100 by 0.67% to 76.18% at
ε
= 1 (i.e., 0.89% relative gain over our improved CIFAR100 DP-SGD
baseline). Next, we show that these benefits further magnify in more practical settings with periodically
varying training data distributions. For instance, we note relative accuracy gains of 2.64% and 2.82% for
ε
of
18.9 and 8.2, respectively, for StackOverflow over DP-FTRL baseline in such a setting. We also experiment
with a proprietary, production-grade pCVR dataset Denison et al. [2022], Chua et al. [2024] and show that at
ε
= 6, training over aggregates of checkpoints improves AUC-loss (i.e., 1 - AUC) by 0.54% (relative) over the
DP-SGD baseline. Note that such an improvement is considered very significant in the context of ads ranking.
Theoretically, we show in Theorem 3.2 that for standard training regimes, the excess empirical risk of the
final checkpoint of DP-SGD is
log
(
n
) times more than that of the weighted average of the past
k
checkpoints,
where
n
is the size of dataset. It is interesting to theoretically analyze the use of checkpoint aggregations
during training, which we leave as future work.
Uncertainty quantification using intermediate checkpoints (Section 5): There are various sources
of randomness in an ML training pipeline [Abdar et al., 2021], e.g., choice of initial parameters, dataset,
batching, etc. This randomness induces uncertainty in the predictions made using such ML models. In
critical domains, e.g., medical diagnosis, self-driving cars and financial market analysis, failing to capture the
1These improvements are notable since there are 10kclasses in StackOverflow data.
2