Federated Calibration and Evaluation of Binary Classiﬁers Graham Cormode Igor Markov Abstract

2025-05-06 0 0 1014.04KB 24 页 10玖币

Federated Calibration and Evaluation of Binary Classiﬁers

Graham Cormode*Igor Markov†

Abstract

We address two major obstacles to practical use of supervised classiﬁers on distributed private data. Whether a

classiﬁer was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores

must be calibrated, and (2) performance metrics must be evaluated — all without assembling labels in one place. In

particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated

setting under three privacy models (

) secure aggregation, (

) distributed differential privacy, (

iii

) local differential

privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efﬁciency. They also help

decide whether a given application has sufﬁcient data to support federated calibration and evaluation.

1 Introduction

Modern machine learning (ML) draws insights from large amounts of data, e.g., by training prediction models on

collected labels. Traditional ML workﬂows assemble all data in one place, but human-generated data — the content

viewed online and reactions to this content, geographic locations, the text typed, sound and images recorded by portable

devices, interactions with friends, interactions with online ads, online purchases, etc — is subject to privacy constraints

and cannot be shared easily (Cohen & Nissim, 2020). This raises a major challenge: extending traditional ML techniques

to accommodate a federation of cooperating distributed clients with individually private data. In a general framework

to address this challenge, instead of collecting the data for centralized processing, clients evaluate a candidate model

on their labeled data, and send updates to a server for aggregation. For example, federated training performs learning

locally on the data in parallel in a privacy-respecting manner: the updates are gradients that are summed by the server

and then used to revise the model (McMahan et al., 2017).

Federated learning

(Kairouz et al., 2021) can offer privacy guarantees to clients. In a baseline level of disclosure

limitation, the clients never share raw data but only send out model updates. Formal privacy guarantees are obtained

via careful aggregation and by adding noise to updates (Wei et al., 2019). Given the prominence of federated learning,

different aspects of the training process (usually for neural networks) have received much attention: aiming to optimize

training speed, reduce communication and tighten privacy guarantees. However, deploying trained models usually

requires to (

) evaluate and track their performance on distributed (private) data, (

) select the best model from

available alternatives, and (

iii

) calibrate a given model to frequent snapshots of non-stationary data (common in

industry applications). Strong privacy guarantees for client data during the model training must be matched by similar

protections for the entire ML pipeline. Otherwise, divulging private information of clients during product use would

negate earlier protections. E.g., knowing if some label agrees with model prediction can effectively reveal the private

label. Concretely, Matthews & Harel (2013) showed that disclosure of an ROC curve allows some recovery of the

sensitive input data. In this paper, we develop algorithms for (i) federated evaluation of classiﬁer-quality metrics with

privacy guarantees and (ii) performing classiﬁer calibration. See a summary of our results in Section 2.4 and Table 1.

Federated calibration and evaluation

of classiﬁers are two fundamental tasks that arise regardless of federated

learning and deep learning (Russell & Norvig, 2020). They are important whenever a classiﬁer is used “in the wild” by

distributed clients on data that cannot be collected centrally due to privacy concerns. Common scenarios that arise in

practical deployments include the following.

•

A heuristic rule-based model is proposed as a baseline for a task but must be evaluated to understand whether a

more complex ML-based solution is even needed;

•

A model pre-trained via transfer learning for addressing multiple tasks (e.g., BERT (Devlin et al., 2019)) is to be

evaluated for its performance on a particular task;

*Meta AI, gcormode@meta.com

†Meta AI, imarkov@meta.com

arXiv:2210.12526v1 [cs.CR] 22 Oct 2022

•

A new model has been trained with federated learning on fresh data, and needs to be compared with earlier

models on distributed test data before launching;

•

A model has been deployed to production, but its predictions must be continuously evaluated against user behavior

(e.g., click-through rate) to determine when model retraining is needed.

Federated calibration of a classiﬁer score function remaps raw score values to probabilities, so that examples assigned

probability

are positive (approximately) a

proportion of the time. Since classiﬁer decisions are routinely made

by comparing the evaluated score to a threshold, calibration ensures the validity of the threshold (especially for

nonstationary data) and the transparency/explainability of the classiﬁcation procedure. As noted by Guo et al. (2017),

“modern neural networks are not well calibrated”, prompting recent study of calibration techniques (Minderer et al.,

2021). Calibration (if done well) does not affect precision-recall tradeoffs.

Federated evaluation of standard binary-classiﬁer metrics is the second challenge we address: precision, recall,

accuracy and ROC-AUC (Russell & Norvig, 2020). Accuracy measures the fraction of predictions that are correct,

while recall and precision focus on only examples from a particular class, giving the fraction of these examples that are

predicted correctly, and the fraction of examples labeled as being in the class that are correct, respectively. ROC-AUC

is deﬁned in terms of the area under the curve of a plot of the tradeoff between false positive ratio and true positive ratio.

These simple metrics are easy to compute when the test data is held centrally. It is more challenging to compute them

in a distributed setting, while providing formal guarantees of privacy and accuracy. For privacy, we leverage secure

aggregation and/or differential privacy (Section 2.2). For accuracy, we seek error bounds when estimating metrics, so

as to facilitate practical use. We present bounds as a function of the number of participating clients (

) and privacy

parameters (



). For the evaluation metrics we studied, errors decrease as low polynomials of

— good news for

medium-to-large deployments.

The federated computational techniques we use are instrumental for both challenges addressed in this paper. These

techniques compile statistical descriptions of the classiﬁer score function as evaluated on the examples held by

distributed clients. Then we directly estimate classiﬁer quality, ROC AUC, and the calibration mapping. The description

we need is essentially a histogram of the score distribution, whose buckets divide up the examples evenly. Estimation

error drops as this description gets reﬁned: for a

-bucket histogram, estimation error due to the histogram scales as

1/B

or even

1/B2

in some cases. Hence,

B≤100

leads to very accurate results. Fortunately, the histogram-based

approach is compatible with various privacy models that provide strong guarantees under different scenarios by adding

noise to data and combining it. The novelty in our work is in showing the reductions of these important problems to

histogram computations, and in analyzing the resulting accuracy bounds. The privacy noise adds a term based on the

number of clients

M−1/4

(worst case) or

M−1

(the best case). Hence, we obtain good accuracy (under privacy)

with upwards of 10,000 participating clients.

The three models of privacy

considered in this work are: (1) Federated privacy via Secure Aggregation (Federated

for short), where the protocol reveals the true output of the computed aggregate (without noise addition) (Segal et al.,

2017). The secrecy of the client’s inputs is achieved by using Secure Aggregation to gather their values, only revealing

the aggregate (typically, the sum); (2) Distributed Differential Privacy (DistDP), where each client introduces a small

amount of noise, so that the sum of all these noise fragments is equivalent in distribution to a central noise distribution

such as Laplace or Gaussian, leading to a guarantee of differential privacy (Dwork & Roth, 2014). Using Secure

Aggregation then ensures that only the sum of inputs with privacy noise is revealed; and (3) Local Differential Privacy

(LocalDP), where each client adds sufﬁcient noise to their report to ensure differential privacy on their message, so that

Secure Aggregation is not needed (Yang et al., 2020b). These models imply different error bounds as we trade accuracy

for the level of privacy and trust needed.

Our contributions

offer algorithmic techniques and error bounds for federated calibration of classiﬁer scores and key

classiﬁer quality metrics. These honor three different privacy models (above). For local and distributed differential

privacy we use the standard



privacy parameter (see Section 2.2), along with the

and

parameters (above) and

the

parameter explained in Section 2.3. Our theoretical bounds are summarized in a

4×3

matrix seen in Table 1,

and explained in greater detail in Section 2.4. Several key questions we address have largely eluded prior studies in

the federated setting, and even modest asymptotic improvements over prior results are signiﬁcant in practice due to

large values of

. To better understand the volume of data needed to achieve acceptable accuracy bounds, we perform

experimental studies. Compared to a recent result on ROC AUC estimation under LocalDP (Bell et al., 2020), we

asymptotically improve bounds under less restrictive assumptions. Most recently, heuristics have been proposed for

AUC estimation based on local noise addition Sun et al. (2022a,b). These do not provide any accuracy guarantees, and

we see that our approach provides better results in our experimental study. Similar questions have also been studied in

the centralized model of DP (Stoddard et al., 2014), but these too lack the accuracy guarantees we can provide.

The paper outline

is as follows. Section 2 reviews technical background on classiﬁer calibration and evaluation, as

well as privacy models. It also states the technical assumptions we use in proofs. Section 2.3 develops technical

machinery for federated aggregation using score histograms. These histograms are needed (

) in Section 3 for federated

evaluation of precision, recall and accuracy of binary classiﬁers, (

) in Section 4 for ROC AUC, and (

) in Section 5 for

calibration. Empirical evaluation of our techniques is reported in Section 6, and conclusions are drawn in Section 7.

2 Background, Notation and Assumptions

Supervised binary classiﬁcation supports many practical applications, and its theoretical setting is conducive for formal

analyses. It also helps address multiclass classiﬁcations and subset selection (via indicator classiﬁers), while lightweight

ranking is routinely implemented by sorting binary classiﬁer scores trained to predict which items are more likely to be

selected. Standard classiﬁer metrics are often computed approximately in practice (e.g., by monte carlo sampling), but

this becomes more challenging in the federated setting.

2.1 Classiﬁer Calibration and Evaluation Metrics

Given a trained score function

s(·)

which takes in examples

and outputs a score

s(x)∈[0,1]

, we deﬁne a binary

classiﬁer based on a threshold T, via

pred(x) = (0if s(x)≤T

1if s(x)> T .

At T=0, the false positive ratio (FPR) and true positive ratio (TPR) are 1, and drop to 0 as T→1.

Well-behaved score distributions.

For arbitrary score distributions, strong bounds for our problems seem elusive. But

empirically signiﬁcant cases often exhibit some type of smoothness (moderate change), except for point spikes – score

values repeated for a large fraction of positive or negative examples: (

) for classiﬁers with a limited range of possible

output scores, (

) when some inputs repeat verbatim many times, (

iii

) when a dominant feature value determines the

score. We call such score distributions well-behaved. Formally, we deﬁne a spike as any point with probability mass

> φ, hence at most 1/φ spikes exist.

Deﬁnition 1.

Let

p(s)

and

n(s)

denote the score (mass) functions

and

of the positive and negative examples

respectively. Spikes are those points

for which

p(s)> φ

n(s)> φ

. We say that the score distribution is

(φ, `)-well-behaved if it is `- Lipschitz between spikes.

|p(s)−p(s+ ∆)| ≤ `∆and |n(s)−n(s+ ∆)| ≤ `∆(1)

This smoothness condition for a parameter `captures the idea that the amount of positive and negative examples does

not change too quickly with s(barring spikes in [s, s + ∆]).

Balanced input assumption.

For brevity, we assume that counts of positive and negative examples,

and

respectively, are bounded fractions (not too skewed) of the total number of examples

M=P+N

. Our results hold

regardless of the balance, but the simpliﬁed proofs expose core dependencies between results. In some places,

P=N

(perfect balance) maximizes our error bounds and illustrates worst-case behavior. Class imbalance occurs in practice,

e.g., in recognizing hate speech, where

PN

. We assume that classiﬁer accuracy is

>0.5

(otherwise negating

classiﬁer output improves accuracy). Hence, for balanced inputs, the fraction of true positives is at least a constant.

Calibration.

Given a score function

, calibration deﬁnes a transformation to apply to

to obtain an accurate estimate

of the probability that the example is positive. That is, for a set of examples and labels

(xi, yi)

, we want a function

c()

, so that

c(s(xi)) ∼Pr[yi= 1]

. There are many approaches to ﬁnd such a mapping

, such as isotonic regression,

or ﬁtting a sigmoid function. A baseline approach is to perform histogram binning on the function, with buckets

chosen based on quantile boundaries. More advanced approaches combine information from multiple histograms

in parallel (Naeini et al., 2015). To measure the calibration quality, expected calibration error (ECE) arranges the

predictions for a set of test examples into a ﬁxed number of bins and computes the expected deviation between the true

fraction of positives and predicted probabilities for each bin1.

Precision, Recall, Accuracy.

Given a classiﬁer that makes (binary) predictions

pred(xi)

of the ground truth label

on Mexamples xi, standard classiﬁer quality metrics include:

•Accuracy: the fraction of correctly predicted examples, i.e., |{i: pred(xi) = yi}|/M.

•Recall: the fraction of correctly predicted positive examples (a.k.a. true positive ratio), i.e.,

|{i: pred(xi) = yi= 1}|/|{i:yi= 1}|.

•Precision: the fraction of examples labeled positive that are labeled correctly, i.e.,

|{i: pred(xi) = yi= 1}|/|{i: pred(xi)=1}|.

ROC AUC (Receiver Operating Characteristic Area Under the Curve) is often used to capture the quality of a trained

ML classiﬁer. Plotting TPR against FPR generates the ROC curve, and the ROC AUC (AUC for short) represents the area

under this curve. The AUC can be computed in several equivalent ways. Given a set of labeled examples

E={(x, y)}

with

±1

label

, the AUC equals the probability that a uniformly-selected positive example (

y= 1

) is ranked above

a uniformly-selected negative example (

y=−1

). Let

be the number of negative examples,

|{(x, −1) ∈E}|

, and

P=|{(x, 1) ∈E}|. Then (Hand & Till, 2001)

AUC = 1

P N P(x,1)∈EP(z,−1)∈EI[s(x)> s(z)] (2)

This expression of AUC simpliﬁes computation, but compares pairs of examples across clients. We avoid such

distributed interactions by evaluating metrics via histograms.

2.2 Privacy models

Federated Privacy via Secure Aggregation (Federated)

assumes that each client holds one (scalar or vector) value

x(i)

. In practice, clients may hold multiple values, but this can easily be handled in our protocols by working with

their sum. In what follows, we assume a single value since this represents the hardest case for ensuring privacy and

security. The Secure Aggregation protocol computes the sum,

X=PM

i=1 x(i)

without revealing any intermediate

values. Various cryptographic protocols provide distributed implementations of Secure Aggregation (Segal et al., 2017;

Bell et al., 2020), or aggregation can be performed by a trusted aggregator, e.g., a server with secure hardware (Zhao

et al., 2021). While secure aggregation provides a practical and simple-to-understand privacy primitive, it does not fully

protect against a knowledgeable adversary. In particular, knowing the inputs of all clients except for one, the adversary

can subtract the values that they already hold. Hence, DP guarantees are sought for additional protection.

Distributed Differential Privacy (DistDP).

The model of differential privacy (DP) requires the output of a computation

to be a sample from a random variable, so for two inputs that are close, their outputs have similar probabilities – in our

case, captured as inputs that vary by the addition or removal of one item (event-level privacy) (Dwork & Roth, 2014).

DP is often achieved by adding calibrated noise from a suitable statistical distribution to the exact answer: in our setting,

this is instatiated by introducing distributed noise at each client which sums to give discrete Laplace noise with variance

O(1/2)

. The resulting (expected) absolute error for count queries is

O(1/)

, and for means of

values is

O(1/M)

The distributed noise is achieved by sampling from the difference of two (discrete) Pólya distributions (Balle et al.,

2020), which neatly combines with secure aggregation so only the noisy sum is revealed2.

Local Differential Privacy (LocalDP).

Local Differential Privacy can be understood as applying the DP deﬁnition at

the level of an individual client: each client creates a message based on their input, so that whatever inputs the client

holds, the probability of producing each possible message is similar. For a set of

clients who each hold a binary

we can estimate the sum and the mean of

under



-LDP by applying randomized response (Warner, 1965). For the

sum over

clients, the variance is

O(M/2)

, and so the absolute error is proportional to

√M/

. After rescaling by

, the variance of the mean is

O(1/(M2))

, and so the absolute error is proportional to

1/√M

(Yang et al., 2020b).

These bounds hold for histograms via “frequency oracles”, when each client holds one out of

possibilities — we

use Optimal Unary Encoding (Wang et al., 2017) and build a (private) view of the frequency histogram, which can be

normalized to get an empirical PMF.

In what follows, we treat as a ﬁxed parameter close to 1 (for both LocalDP and DistDP cases).

Formally, the ECE is deﬁned by Naeini et al. (2015) as

j=1 P(j)|o(j)−e(j)|

, where

P(j)

is the (empirical) probability that an example

falls in the

th bucket (out of

), while

o(j)

is the true fraction of positive examples that fall in the

th bin, and

e(j)

is the fraction predicted by the

model.

2Other privacy noise is possible via Binomial (Dwork et al., 2006) or Skellam (Agarwal et al., 2021) noise addition.

Table 1: Simpliﬁed variants of error bounds derived in this paper for three privacy models.

Federated DistDP LocalDP

P/R/A (for a given classiﬁer) 0 1/M 1/M1/2

P/R/A (for a score function and varying threshold) 1/B 1/2/3M2/31/2/3M1/3

ROC AUC 1/B2(1

+1

B)1

Mh/M1/2

Expected Calibration Error (ECE) 1/M1/31/M1/31/1/2M1/4

2.3 Building score histograms

A key step in our algorithms is building an equi-depth histogram of the clients’ data. That is, given

samples as

scalar values, we seek a set of boundaries that partition the range into

buckets with (approximately) equal number of

samples per bucket. Fortunately, this is a well-studied problem, so in Appendix A we review how such histograms can

be computed under each privacy model and analyze their accuracy as a function of parameters

and



. The novelty in

our work is the way that we can use these histograms to provide classiﬁer quality measures with privacy and accuracy

guarantees across a range of federated models.

For federated computation, we want to represent the distribution of scores with histograms. Based on a set of

buckets that partition

[0,1]

, we obtain two histograms,

and

, which provide the number of examples whose score

falls in each bucket. Speciﬁcally,

and

give the number of positive and negative examples respectively whose score

falls in bucket

. In what follows, we set the histogram bucket boundaries based on the (approximate) quantiles of the

score function for the given set of examples

, so that

pi+ni≤(P+N)/B

, where

and

denote the total number

of positive and negative examples respectively.

Overcoming heterogeneity.

A common concern when working in the federated model is data heterogeneity: the data

held by clients may be non-iid (i.e., some clients are more likely to have examples of a single class), and some clients

may hold many more examples than others. By working with histogram representations we are able to overcome

these concerns: the histograms we build are insensitive to how the data is distributed to clients, and the noise added

for privacy is similarly indifferent to data heterogeneity. Thus we can state our results solely in terms of a few basic

parameters (number of clients, number of histogram buckets etc.), independent of any other parameters of the data.

2.4 Summary of our results

Table 1 presents simpliﬁed versions of our main results. Here and throughout, we express error bounds in terms of the

expected absolute error, which can also be used as a bound that holds with high probability via standard concentration

inequalities. Without requiring any i.i.d. assumptions for distributed clients, our results clarify the expected magnitude

of the error, which should be small in comparison to the quantity being estimated: tightly bounding the error values

ensures accurate results. All the estimated quantities are in the

[0,1]

range, and for most (binary) classiﬁers of interest

the four quality metrics will be ≥1

2. For calibration, the expected calibration error is a small value in [0, 1].

To keep the presentation of these bounds simple, we make the balanced input assumption (from Section 2.1), i.e.,

there are

Θ(M)

positive examples and

Θ(M)

negative examples among the

clients. Error bounds are presented as a

function of the number of clients,

, the privacy parameter



, the number of buckets used to build a score histogram,

, and the height of the hierarchy used,

(see Section 3.2). Across the various problems the error increases as we

move from Federated to DistDP to LocalDP. This is expected, as the noise added in each case increases to compensate

for the weaker trust model.

Other trends we see are not as easy to guess. Increasing the number of buckets

often helps reduce the error,

but this is not always a factor, particularly for the LocalDP results. Increasing the number of examples,

, typically

decreases the error, although the rate of improvement as a function of

varies from

1/M1/4

in the worst case to

1/M

in the best case. Our experimental ﬁndings, presented in Appendix C, agree with this analysis and conﬁrm the

anticipated impact of increasing

and of varying the parameters

and

. We observe high accuracy in the Federated

case and good accuracy when DP noise is added. Calibration error for DistDP is insensitive to



as explained after

Theorem 9. These results help building full-stack support for practical federated learning, and show the practicality of

federated classiﬁer evaluation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FederatedCalibrationandEvaluationofBinaryClassiersGrahamCormode*IgorMarkovAbstractWeaddresstwomajorobstaclestopracticaluseofsupervisedclassiersondistributedprivatedata.Whetheraclassierwastrainedbyafederationofcooperatingclientsortrainedcentrallyoutofdistribution,(1)theoutputscoresmustbecalibrate...

展开>> 收起<<

Federated Calibration and Evaluation of Binary Classiﬁers Graham Cormode Igor Markov Abstract.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Federated Calibration and Evaluation of Binary Classiﬁers Graham Cormode Igor Markov Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: