Federated Calibration and Evaluation of Binary Classifiers Graham Cormode Igor Markov Abstract

2025-05-06 0 0 1014.04KB 24 页 10玖币
侵权投诉
Federated Calibration and Evaluation of Binary Classifiers
Graham Cormode*Igor Markov
Abstract
We address two major obstacles to practical use of supervised classifiers on distributed private data. Whether a
classifier was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores
must be calibrated, and (2) performance metrics must be evaluated — all without assembling labels in one place. In
particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated
setting under three privacy models (
i
) secure aggregation, (
ii
) distributed differential privacy, (
iii
) local differential
privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help
decide whether a given application has sufficient data to support federated calibration and evaluation.
1 Introduction
Modern machine learning (ML) draws insights from large amounts of data, e.g., by training prediction models on
collected labels. Traditional ML workflows assemble all data in one place, but human-generated data — the content
viewed online and reactions to this content, geographic locations, the text typed, sound and images recorded by portable
devices, interactions with friends, interactions with online ads, online purchases, etc — is subject to privacy constraints
and cannot be shared easily (Cohen & Nissim, 2020). This raises a major challenge: extending traditional ML techniques
to accommodate a federation of cooperating distributed clients with individually private data. In a general framework
to address this challenge, instead of collecting the data for centralized processing, clients evaluate a candidate model
on their labeled data, and send updates to a server for aggregation. For example, federated training performs learning
locally on the data in parallel in a privacy-respecting manner: the updates are gradients that are summed by the server
and then used to revise the model (McMahan et al., 2017).
Federated learning
(Kairouz et al., 2021) can offer privacy guarantees to clients. In a baseline level of disclosure
limitation, the clients never share raw data but only send out model updates. Formal privacy guarantees are obtained
via careful aggregation and by adding noise to updates (Wei et al., 2019). Given the prominence of federated learning,
different aspects of the training process (usually for neural networks) have received much attention: aiming to optimize
training speed, reduce communication and tighten privacy guarantees. However, deploying trained models usually
requires to (
i
) evaluate and track their performance on distributed (private) data, (
ii
) select the best model from
available alternatives, and (
iii
) calibrate a given model to frequent snapshots of non-stationary data (common in
industry applications). Strong privacy guarantees for client data during the model training must be matched by similar
protections for the entire ML pipeline. Otherwise, divulging private information of clients during product use would
negate earlier protections. E.g., knowing if some label agrees with model prediction can effectively reveal the private
label. Concretely, Matthews & Harel (2013) showed that disclosure of an ROC curve allows some recovery of the
sensitive input data. In this paper, we develop algorithms for (i) federated evaluation of classifier-quality metrics with
privacy guarantees and (ii) performing classifier calibration. See a summary of our results in Section 2.4 and Table 1.
Federated calibration and evaluation
of classifiers are two fundamental tasks that arise regardless of federated
learning and deep learning (Russell & Norvig, 2020). They are important whenever a classifier is used “in the wild” by
distributed clients on data that cannot be collected centrally due to privacy concerns. Common scenarios that arise in
practical deployments include the following.
A heuristic rule-based model is proposed as a baseline for a task but must be evaluated to understand whether a
more complex ML-based solution is even needed;
A model pre-trained via transfer learning for addressing multiple tasks (e.g., BERT (Devlin et al., 2019)) is to be
evaluated for its performance on a particular task;
*Meta AI, gcormode@meta.com
Meta AI, imarkov@meta.com
1
arXiv:2210.12526v1 [cs.CR] 22 Oct 2022
A new model has been trained with federated learning on fresh data, and needs to be compared with earlier
models on distributed test data before launching;
A model has been deployed to production, but its predictions must be continuously evaluated against user behavior
(e.g., click-through rate) to determine when model retraining is needed.
Federated calibration of a classifier score function remaps raw score values to probabilities, so that examples assigned
probability
p
are positive (approximately) a
p
proportion of the time. Since classifier decisions are routinely made
by comparing the evaluated score to a threshold, calibration ensures the validity of the threshold (especially for
nonstationary data) and the transparency/explainability of the classification procedure. As noted by Guo et al. (2017),
“modern neural networks are not well calibrated”, prompting recent study of calibration techniques (Minderer et al.,
2021). Calibration (if done well) does not affect precision-recall tradeoffs.
Federated evaluation of standard binary-classifier metrics is the second challenge we address: precision, recall,
accuracy and ROC-AUC (Russell & Norvig, 2020). Accuracy measures the fraction of predictions that are correct,
while recall and precision focus on only examples from a particular class, giving the fraction of these examples that are
predicted correctly, and the fraction of examples labeled as being in the class that are correct, respectively. ROC-AUC
is defined in terms of the area under the curve of a plot of the tradeoff between false positive ratio and true positive ratio.
These simple metrics are easy to compute when the test data is held centrally. It is more challenging to compute them
in a distributed setting, while providing formal guarantees of privacy and accuracy. For privacy, we leverage secure
aggregation and/or differential privacy (Section 2.2). For accuracy, we seek error bounds when estimating metrics, so
as to facilitate practical use. We present bounds as a function of the number of participating clients (
M
) and privacy
parameters (
). For the evaluation metrics we studied, errors decrease as low polynomials of
M
— good news for
medium-to-large deployments.
The federated computational techniques we use are instrumental for both challenges addressed in this paper. These
techniques compile statistical descriptions of the classifier score function as evaluated on the examples held by
distributed clients. Then we directly estimate classifier quality, ROC AUC, and the calibration mapping. The description
we need is essentially a histogram of the score distribution, whose buckets divide up the examples evenly. Estimation
error drops as this description gets refined: for a
B
-bucket histogram, estimation error due to the histogram scales as
1/B
or even
1/B2
in some cases. Hence,
B100
leads to very accurate results. Fortunately, the histogram-based
approach is compatible with various privacy models that provide strong guarantees under different scenarios by adding
noise to data and combining it. The novelty in our work is in showing the reductions of these important problems to
histogram computations, and in analyzing the resulting accuracy bounds. The privacy noise adds a term based on the
number of clients
M
:
M1/4
(worst case) or
M1
(the best case). Hence, we obtain good accuracy (under privacy)
with upwards of 10,000 participating clients.
The three models of privacy
considered in this work are: (1) Federated privacy via Secure Aggregation (Federated
for short), where the protocol reveals the true output of the computed aggregate (without noise addition) (Segal et al.,
2017). The secrecy of the client’s inputs is achieved by using Secure Aggregation to gather their values, only revealing
the aggregate (typically, the sum); (2) Distributed Differential Privacy (DistDP), where each client introduces a small
amount of noise, so that the sum of all these noise fragments is equivalent in distribution to a central noise distribution
such as Laplace or Gaussian, leading to a guarantee of differential privacy (Dwork & Roth, 2014). Using Secure
Aggregation then ensures that only the sum of inputs with privacy noise is revealed; and (3) Local Differential Privacy
(LocalDP), where each client adds sufficient noise to their report to ensure differential privacy on their message, so that
Secure Aggregation is not needed (Yang et al., 2020b). These models imply different error bounds as we trade accuracy
for the level of privacy and trust needed.
Our contributions
offer algorithmic techniques and error bounds for federated calibration of classifier scores and key
classifier quality metrics. These honor three different privacy models (above). For local and distributed differential
privacy we use the standard
privacy parameter (see Section 2.2), along with the
B
and
M
parameters (above) and
the
h
parameter explained in Section 2.3. Our theoretical bounds are summarized in a
4×3
matrix seen in Table 1,
and explained in greater detail in Section 2.4. Several key questions we address have largely eluded prior studies in
the federated setting, and even modest asymptotic improvements over prior results are significant in practice due to
large values of
M
. To better understand the volume of data needed to achieve acceptable accuracy bounds, we perform
experimental studies. Compared to a recent result on ROC AUC estimation under LocalDP (Bell et al., 2020), we
asymptotically improve bounds under less restrictive assumptions. Most recently, heuristics have been proposed for
AUC estimation based on local noise addition Sun et al. (2022a,b). These do not provide any accuracy guarantees, and
we see that our approach provides better results in our experimental study. Similar questions have also been studied in
2
the centralized model of DP (Stoddard et al., 2014), but these too lack the accuracy guarantees we can provide.
The paper outline
is as follows. Section 2 reviews technical background on classifier calibration and evaluation, as
well as privacy models. It also states the technical assumptions we use in proofs. Section 2.3 develops technical
machinery for federated aggregation using score histograms. These histograms are needed (
a
) in Section 3 for federated
evaluation of precision, recall and accuracy of binary classifiers, (
b
) in Section 4 for ROC AUC, and (
c
) in Section 5 for
calibration. Empirical evaluation of our techniques is reported in Section 6, and conclusions are drawn in Section 7.
2 Background, Notation and Assumptions
Supervised binary classification supports many practical applications, and its theoretical setting is conducive for formal
analyses. It also helps address multiclass classifications and subset selection (via indicator classifiers), while lightweight
ranking is routinely implemented by sorting binary classifier scores trained to predict which items are more likely to be
selected. Standard classifier metrics are often computed approximately in practice (e.g., by monte carlo sampling), but
this becomes more challenging in the federated setting.
2.1 Classifier Calibration and Evaluation Metrics
Given a trained score function
s(·)
which takes in examples
x
and outputs a score
s(x)[0,1]
, we define a binary
classifier based on a threshold T, via
pred(x) = (0if s(x)T
1if s(x)> T .
At T=0, the false positive ratio (FPR) and true positive ratio (TPR) are 1, and drop to 0 as T1.
Well-behaved score distributions.
For arbitrary score distributions, strong bounds for our problems seem elusive. But
empirically significant cases often exhibit some type of smoothness (moderate change), except for point spikes – score
values repeated for a large fraction of positive or negative examples: (
i
) for classifiers with a limited range of possible
output scores, (
ii
) when some inputs repeat verbatim many times, (
iii
) when a dominant feature value determines the
score. We call such score distributions well-behaved. Formally, we define a spike as any point with probability mass
> φ, hence at most 1spikes exist.
Definition 1.
Let
p(s)
and
n(s)
denote the score (mass) functions
p
and
n
of the positive and negative examples
respectively. Spikes are those points
s
for which
p(s)> φ
or
n(s)> φ
. We say that the score distribution is
(φ, `)-well-behaved if it is `- Lipschitz between spikes.
|p(s)p(s+ ∆)| ≤ `and |n(s)n(s+ ∆)| ≤ `(1)
This smoothness condition for a parameter `captures the idea that the amount of positive and negative examples does
not change too quickly with s(barring spikes in [s, s + ∆]).
Balanced input assumption.
For brevity, we assume that counts of positive and negative examples,
P
and
N
respectively, are bounded fractions (not too skewed) of the total number of examples
M=P+N
. Our results hold
regardless of the balance, but the simplified proofs expose core dependencies between results. In some places,
P=N
(perfect balance) maximizes our error bounds and illustrates worst-case behavior. Class imbalance occurs in practice,
e.g., in recognizing hate speech, where
PN
. We assume that classifier accuracy is
>0.5
(otherwise negating
classifier output improves accuracy). Hence, for balanced inputs, the fraction of true positives is at least a constant.
Calibration.
Given a score function
s
, calibration defines a transformation to apply to
s
to obtain an accurate estimate
of the probability that the example is positive. That is, for a set of examples and labels
(xi, yi)
, we want a function
c()
, so that
c(s(xi)) Pr[yi= 1]
. There are many approaches to find such a mapping
c
, such as isotonic regression,
or fitting a sigmoid function. A baseline approach is to perform histogram binning on the function, with buckets
chosen based on quantile boundaries. More advanced approaches combine information from multiple histograms
in parallel (Naeini et al., 2015). To measure the calibration quality, expected calibration error (ECE) arranges the
3
predictions for a set of test examples into a fixed number of bins and computes the expected deviation between the true
fraction of positives and predicted probabilities for each bin1.
Precision, Recall, Accuracy.
Given a classifier that makes (binary) predictions
pred(xi)
of the ground truth label
yi
on Mexamples xi, standard classifier quality metrics include:
Accuracy: the fraction of correctly predicted examples, i.e., |{i: pred(xi) = yi}|/M.
Recall: the fraction of correctly predicted positive examples (a.k.a. true positive ratio), i.e.,
|{i: pred(xi) = yi= 1}|/|{i:yi= 1}|.
Precision: the fraction of examples labeled positive that are labeled correctly, i.e.,
|{i: pred(xi) = yi= 1}|/|{i: pred(xi)=1}|.
ROC AUC (Receiver Operating Characteristic Area Under the Curve) is often used to capture the quality of a trained
ML classifier. Plotting TPR against FPR generates the ROC curve, and the ROC AUC (AUC for short) represents the area
under this curve. The AUC can be computed in several equivalent ways. Given a set of labeled examples
E={(x, y)}
with
±1
label
y
, the AUC equals the probability that a uniformly-selected positive example (
y= 1
) is ranked above
a uniformly-selected negative example (
y=1
). Let
N
be the number of negative examples,
|{(x, 1) E}|
, and
P=|{(x, 1) E}|. Then (Hand & Till, 2001)
AUC = 1
P N P(x,1)EP(z,1)EI[s(x)> s(z)] (2)
This expression of AUC simplifies computation, but compares pairs of examples across clients. We avoid such
distributed interactions by evaluating metrics via histograms.
2.2 Privacy models
Federated Privacy via Secure Aggregation (Federated)
assumes that each client holds one (scalar or vector) value
x(i)
. In practice, clients may hold multiple values, but this can easily be handled in our protocols by working with
their sum. In what follows, we assume a single value since this represents the hardest case for ensuring privacy and
security. The Secure Aggregation protocol computes the sum,
X=PM
i=1 x(i)
without revealing any intermediate
values. Various cryptographic protocols provide distributed implementations of Secure Aggregation (Segal et al., 2017;
Bell et al., 2020), or aggregation can be performed by a trusted aggregator, e.g., a server with secure hardware (Zhao
et al., 2021). While secure aggregation provides a practical and simple-to-understand privacy primitive, it does not fully
protect against a knowledgeable adversary. In particular, knowing the inputs of all clients except for one, the adversary
can subtract the values that they already hold. Hence, DP guarantees are sought for additional protection.
Distributed Differential Privacy (DistDP).
The model of differential privacy (DP) requires the output of a computation
to be a sample from a random variable, so for two inputs that are close, their outputs have similar probabilities – in our
case, captured as inputs that vary by the addition or removal of one item (event-level privacy) (Dwork & Roth, 2014).
DP is often achieved by adding calibrated noise from a suitable statistical distribution to the exact answer: in our setting,
this is instatiated by introducing distributed noise at each client which sums to give discrete Laplace noise with variance
O(1/2)
. The resulting (expected) absolute error for count queries is
O(1/)
, and for means of
M
values is
O(1/M)
.
The distributed noise is achieved by sampling from the difference of two (discrete) Pólya distributions (Balle et al.,
2020), which neatly combines with secure aggregation so only the noisy sum is revealed2.
Local Differential Privacy (LocalDP).
Local Differential Privacy can be understood as applying the DP definition at
the level of an individual client: each client creates a message based on their input, so that whatever inputs the client
holds, the probability of producing each possible message is similar. For a set of
M
clients who each hold a binary
bi
,
we can estimate the sum and the mean of
bi
under
-LDP by applying randomized response (Warner, 1965). For the
sum over
M
clients, the variance is
O(M/2)
, and so the absolute error is proportional to
M/
. After rescaling by
M
, the variance of the mean is
O(1/(M2))
, and so the absolute error is proportional to
1/M
(Yang et al., 2020b).
These bounds hold for histograms via “frequency oracles”, when each client holds one out of
B
possibilities — we
use Optimal Unary Encoding (Wang et al., 2017) and build a (private) view of the frequency histogram, which can be
normalized to get an empirical PMF.
In what follows, we treat as a fixed parameter close to 1 (for both LocalDP and DistDP cases).
1
Formally, the ECE is defined by Naeini et al. (2015) as
PK
j=1 P(j)|o(j)e(j)|
, where
P(j)
is the (empirical) probability that an example
falls in the
j
th bucket (out of
K
), while
o(j)
is the true fraction of positive examples that fall in the
j
th bin, and
e(j)
is the fraction predicted by the
model.
2Other privacy noise is possible via Binomial (Dwork et al., 2006) or Skellam (Agarwal et al., 2021) noise addition.
4
Table 1: Simplified variants of error bounds derived in this paper for three privacy models.
Federated DistDP LocalDP
P/R/A (for a given classifier) 0 1/M 1/M1/2
P/R/A (for a score function and varying threshold) 1/B 1/2/3M2/31/2/3M1/3
ROC AUC 1/B2(1
+1
B)1
Mh/M1/2
Expected Calibration Error (ECE) 1/M1/31/M1/31/1/2M1/4
2.3 Building score histograms
A key step in our algorithms is building an equi-depth histogram of the clients’ data. That is, given
M
samples as
scalar values, we seek a set of boundaries that partition the range into
B
buckets with (approximately) equal number of
samples per bucket. Fortunately, this is a well-studied problem, so in Appendix A we review how such histograms can
be computed under each privacy model and analyze their accuracy as a function of parameters
h
and
. The novelty in
our work is the way that we can use these histograms to provide classifier quality measures with privacy and accuracy
guarantees across a range of federated models.
For federated computation, we want to represent the distribution of scores with histograms. Based on a set of
B
buckets that partition
[0,1]
, we obtain two histograms,
n
and
p
, which provide the number of examples whose score
falls in each bucket. Specifically,
pi
and
ni
give the number of positive and negative examples respectively whose score
falls in bucket
i
. In what follows, we set the histogram bucket boundaries based on the (approximate) quantiles of the
score function for the given set of examples
E
, so that
pi+ni(P+N)/B
, where
P
and
N
denote the total number
of positive and negative examples respectively.
Overcoming heterogeneity.
A common concern when working in the federated model is data heterogeneity: the data
held by clients may be non-iid (i.e., some clients are more likely to have examples of a single class), and some clients
may hold many more examples than others. By working with histogram representations we are able to overcome
these concerns: the histograms we build are insensitive to how the data is distributed to clients, and the noise added
for privacy is similarly indifferent to data heterogeneity. Thus we can state our results solely in terms of a few basic
parameters (number of clients, number of histogram buckets etc.), independent of any other parameters of the data.
2.4 Summary of our results
Table 1 presents simplified versions of our main results. Here and throughout, we express error bounds in terms of the
expected absolute error, which can also be used as a bound that holds with high probability via standard concentration
inequalities. Without requiring any i.i.d. assumptions for distributed clients, our results clarify the expected magnitude
of the error, which should be small in comparison to the quantity being estimated: tightly bounding the error values
ensures accurate results. All the estimated quantities are in the
[0,1]
range, and for most (binary) classifiers of interest
the four quality metrics will be 1
2. For calibration, the expected calibration error is a small value in [0, 1].
To keep the presentation of these bounds simple, we make the balanced input assumption (from Section 2.1), i.e.,
there are
Θ(M)
positive examples and
Θ(M)
negative examples among the
M
clients. Error bounds are presented as a
function of the number of clients,
M
, the privacy parameter
, the number of buckets used to build a score histogram,
B
, and the height of the hierarchy used,
h
(see Section 3.2). Across the various problems the error increases as we
move from Federated to DistDP to LocalDP. This is expected, as the noise added in each case increases to compensate
for the weaker trust model.
Other trends we see are not as easy to guess. Increasing the number of buckets
B
often helps reduce the error,
but this is not always a factor, particularly for the LocalDP results. Increasing the number of examples,
M
, typically
decreases the error, although the rate of improvement as a function of
M
varies from
1/M1/4
in the worst case to
1/M
in the best case. Our experimental findings, presented in Appendix C, agree with this analysis and confirm the
anticipated impact of increasing
M
and of varying the parameters
B
and
h
. We observe high accuracy in the Federated
case and good accuracy when DP noise is added. Calibration error for DistDP is insensitive to
as explained after
Theorem 9. These results help building full-stack support for practical federated learning, and show the practicality of
federated classifier evaluation.
5
摘要:

FederatedCalibrationandEvaluationofBinaryClassiersGrahamCormode*IgorMarkov†AbstractWeaddresstwomajorobstaclestopracticaluseofsupervisedclassiersondistributedprivatedata.Whetheraclassierwastrainedbyafederationofcooperatingclientsortrainedcentrallyoutofdistribution,(1)theoutputscoresmustbecalibrate...

展开>> 收起<<
Federated Calibration and Evaluation of Binary Classifiers Graham Cormode Igor Markov Abstract.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:24 页 大小:1014.04KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注