Differentially private multivariate medians Kelly Ramsay Aukosh Jagannath Shojaeddin Chenouri Abstract

2025-04-26 0 0 1.65MB 42 页 10玖币

侵权投诉

Diﬀerentially private multivariate medians

Kelly Ramsay, Aukosh Jagannath, Shoja’eddin Chenouri

Abstract

Statistical tools which satisfy rigorous privacy guarantees are necessary for modern data

analysis. It is well-known that robustness against contamination is linked to diﬀerential privacy.

Despite this fact, using multivariate medians for diﬀerentially private and robust multivariate

location estimation has not been systematically studied. We develop novel ﬁnite-sample perfor-

mance guarantees for diﬀerentially private multivariate depth-based medians, which are essen-

tially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey)

depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals,

the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our

results numerically using a Gaussian contamination model in dimensions up to d= 100, and

compare them to a state-of-the-art private mean estimation algorithm. As a by-product of our

investigation, we prove concentration inequalities for the output of the exponential mechanism

about the maximizer of the population objective function. This bound applies to objective

functions that satisfy a mild regularity condition.

1 Introduction

Protecting user privacy is necessary for a safe and fair society. In order to protect user privacy,

many large institutions, including the United States Census Bureau (Abowd et al., 2022), Apple

(Diﬀerential Privacy Team, 2017), and Google (Guevara, 2021), utilize a strong notion of privacy

known as diﬀerential privacy. Diﬀerential privacy is favored because it protects against many ad-

versarial attacks while requiring minimal assumptions on both the data and the adversary (Dwork

et al., 2017). It is now well-known that diﬀerential privacy is linked to robustness against con-

tamination, see (Dwork and Lei, 2009; Avella-Medina, 2020; Liu et al., 2021a) and the references

therein. Surprisingly, however, multivariate medians have not been systematically studied in the

diﬀerential privacy literature. We study this problem here. Note that there are several variants of

diﬀerential privacy. Here, we focus on the setting of pure diﬀerential privacy, which is the strictest

variant of diﬀerential privacy.

To date, the literature on diﬀerentially private location estimation has largely focused on means.

Foygel Barber and Duchi (2014) proved a lower bound on the minimax risk of diﬀerentially private

mean estimation in terms of the number of moments possessed by the population measure. Other

early works focused on the univariate setting (Karwa and Vadhan, 2017; Bun and Steinke, 2019).

Kamath et al. (2018) and Bun et al. (2019) studied diﬀerentially private mean estimation for several

parametric models, including the multivariate Gaussian model. Later, several authors studied

diﬀerentially private mean estimation for sub-Gaussian measures (Cai et al., 2021; Biswas et al.,

2020; Brown et al., 2021; Liu et al., 2021b). After which, Kamath et al. (2020); Liu et al. (2021a) and

Hopkins et al. (2022) studied diﬀerentially private mean estimation for measures with a ﬁnite second

moment. Within this framework, Kamath et al. (2020) considered heavy tails and Liu et al. (2021a)

arXiv:2210.06459v2 [math.ST] 26 Mar 2024

and Hopkins et al. (2022) considered robustness in terms of contamination models. Kamath et al.

(2020) quantiﬁed the sample complexity of their (purely) diﬀerentially private mean estimator in

terms of the number of moments possessed by the population measure. Liu et al. (2021a) quantiﬁed

the relationship between the accuracy of an (approximately) diﬀerentially private mean estimator

and the level of contamination in the observed sample. On a similar note, Hopkins et al. (2022)

showed that there exists a level of contamination under which the sample complexity of their

(purely) diﬀerentially private mean estimator remains unchanged.

On the other hand, if one wishes to estimate location robustly, the canonical estimators are

medians. The diﬀerential privacy literature on medians has focused on the univariate setting: Dwork

and Lei (2009) introduced an approximately diﬀerentially private median estimator with asymptotic

consistency guarantees, Avella-Medina and Brunel (2019a); Brunel and Avella-Medina (2020) then

introduced several median estimators which achieve sub-Gaussian error rates, and Tzamos et al.

(2020) introduced median estimators with minimax optimal sample complexity. Private estimation

of multivariate medians, however, has not been carefully studied.

In this paper, we study diﬀerentially private, multivariate median estimation through the frame-

work of depth functions. Depth-based median estimation is the standard approach to multivariate

median estimation. Here, one has a function, called the depth function, which provides a measure

of centrality. Its maximizer is then the median, which is generally robust. Popular depth-based

medians include the halfspace (or Tukey) median (Tukey, 1974), the spatial median (Vardi and

Zhang, 2000), and the simplicial median (Liu, 1990). For more on the motivation of the depth-

based medians approach, see Section 2 below and Small (1990); Vardi and Zhang (2000); Serﬂing

(2006). Our results apply to a broad class of depth functions, including those listed above, and we

make minimal distributional assumptions. In particular, we do not require concavity of the depth

function or that the population measure has moments of any order.

Our contributions are as follows: Our main result is a general ﬁnite-sample deviations bound for

private multivariate medians based on the exponential mechanism (Theorem 1). We use this result

to give upper bounds on the deviations (and sample complexities) of several, purely diﬀerentially

private multivariate medians arising from depth functions. These bounds are essentially sharp given

recent results of Kamath et al. (2018); Cai et al. (2021) and do not require any moment assumptions

on the population measure. In addition, we provide a fast implementation of a smoothed version

of the integrated dual depth-based median (Cuevas and Fraiman, 2009); we can compute the (non-

private) median of ten thousand 100-dimensional samples in less than one second on a personal

computer. We show that this smoothed version of the integrated dual depth satisﬁes desirable

properties for a depth function and can be approximated in polynomial time with ﬁnite-sample

performance guarantees. As a by-product of our analysis, we uncover a general but elementary

concentration bound (Theorem 11) for the exponential mechanism. We give a novel regularity

condition “(K, F)-regularity” (Condition 3) on the objective function. This regularity condition

implies both an upper bound on the sample complexity of a draw from the exponential mechanism

and an upper bound on the global sensitivity of the objective function. As a second by-product of

our analysis, we uncover uniform non-asymptotic error bounds for several depth functions and a

diﬀerentially private estimator of data depth values.

Before turning to our main results, it is important to note that several authors have recently used

depth functions in the context of diﬀerential privacy, likely due to their robustness properties, with

a heavy focus on the halfspace depth. Depth functions have been used in settings such as ﬁnding a

point in a convex hull (Beimel et al., 2019; Gao and Sheﬀet, 2020), comparing k-norm mechanisms

(Awan and Slavkovi´c, 2021), and improving robustness for mean estimation for Gaussian models

(Brown et al., 2021; Liu et al., 2021a). Another related work is that of Ben-Eliezer et al. (2022),

who focus on various problems concerning high-dimensional quantile estimation. In particular, one

problem Ben-Eliezer et al. (2022) consider is that of privately obtaining one or more points which

have halfspace depth above a given threshold. Though this problem is closely related, it is distinct.

For instance, their algorithm can be used to estimate a point with high halfspace depth. However,

it could not be used to directly estimate the halfspace median, since that would require setting the

threshold to be close to the depth of the empirical halfspace median, which depends on the sample.

To our knowledge, none of the aforementioned works have directly addressed depth-based median

estimation.

2 Robust private multivariate median estimation

In this section, we provide ﬁnite-sample deviation bounds for diﬀerentially private robust multi-

variate location estimation. Let us begin by brieﬂy recalling the following essential notions and

deﬁnitions from diﬀerential privacy. For a textbook introduction to the concept of diﬀerential

privacy see Dwork and Roth (2014).

Adataset of size n×dis a collection of npoints in Rd,Xn= (Xℓ)n

ℓ=1, with repetitions

allowed. Let Dn×dbe the set of all datasets of size n×d. For a dataset Yn, let D(Yn, m) =

{Zn∈Dn×d:|Zn△Yn|=m},that is, D(Yn, m) is the collection of datasets of size n×dwhich

diﬀer from Ynin exactly mpoints. Two datasets Ynand Znare said to be adjacent if Zn∈ D(Yn,1).

Finally, for a dataset Ynwith empirical measure ˆµYndeﬁne

M(Yn) = {˜µ∈ M1(Rd): ˜µ= ˆµZnfor some Zn∈ D(Yn,1)},

where, for a space S,M1(S) denotes the space of probability measures on S.

Let us now recall the notion of diﬀerential privacy. Suppose that we observe a dataset Xn

comprised of ni.i.d. samples from some unknown measure µ∈ M1(Rd) and produce a statistic, ˜

θn,

whose law conditionally on the samples depends on their empirical measure, ˆµn. Let us denote this

law by Pˆµn. The goal of diﬀerential privacy is to produce a statistic such that Pˆµnis non-degenerate

and, more precisely, obeys a certain “privacy guarantee” which is deﬁned as follows. Following the

privacy literature, we call the algorithm that takes ˆµnand returns the (random) statistic, ˜

θn, a

mechanism. We also follow the standard abuse of terminology and call ˜

θnthe mechanism when it

is clear from context.

Deﬁnition 1.A mechanism ˜

θnis ϵ-diﬀerentially private if for any dataset Yn, any ˜µ∈f

M(Yn), and

any measurable set B, it holds that

PˆµYn(B)≤eϵP˜µ(B).(2.1)

Here, ϵ > 0 is the privacy parameter, for which smaller values enforce stricter levels of privacy.

Many general purpose diﬀerentially private mechanisms rely on the concept of global sensitivity.

The global sensitivity GSnof a function ϕ:Rd× M1(Rd)→Ris given by

GSn(ϕ) = sup

Xn∈Dn×d,

˜µn∈

M(Xn)

sup

x∈Rd|ϕ(x, ˆµn)−ϕ(x, ˜µn)|.

The simplest diﬀerentially private mechanisms are the additive noise mechanisms, e.g., the

Laplace and Gaussian mechanisms. These mechanisms require the non-private estimator to have

ﬁnite global sensitivity. Unfortunately, virtually all standard location estimators, including both

univariate and multivariate medians, have inﬁnite global sensitivity, see e.g., (Avella-Medina and

Brunel, 2019b).

However, depth functions, the objective functions for multivariate medians, have ﬁnite global

sensitivity. This fact makes them a good candidate for use with the exponential mechanism, which

is a general mechanism used to produce diﬀerentially private estimators from non-private estimators

that are maximizers of a given objective function. For a given β > 0, base measure (i.e., prior)

π∈ M1(Rd) and objective function ϕ:Rd×M1(Rd)→R, suppose that Rexp(βϕ(θ, ν))dπ(θ)<∞

for all ν∈ M1(Rd), then the exponential mechanism with base measure πis given by

Qˆµn,β ∝exp(βϕ(θ, ˆµn))dπ. (2.2)

A sample from the exponential mechanism ˜

θn∼Qˆµn,β then provides an estimate of the population

value θ0= argmaxRdϕ(θ, µ). In regard to the choice of β, it was shown by McSherry and Talwar

(2007) that if ˜

θnis produced by the exponential mechanism, with β≤ϵ/2GSn(ϕ) then ˜

θnis

ϵ-diﬀerentially private.

When studying multivariate median estimation, one might try to naively extend prior work on

univariate medians to the multivariate setting, i.e., the coordinate-wise median. This, however, is

well-known to be an unsatisfactory as a measure of center (Small, 1990). Indeed, the coordinate-

wise median can reside outside the convex hull of the data (Serﬂing, 2006). Instead, the most

popular framework for multivariate median estimation is through maximizing a depth function.

Depth functions are functions of the form D: Rd× M1(Rd)→R+which, given a measure

and a point in the domain, assign a number. This number represents how central that point is

with respect to the measure, it is called the depth of that point. Given a depth function, the

corresponding median is then deﬁned to be a maximizer of this depth, or simply the deepest point:

MED(µ; D) = argmax

x∈Rd

D(x, µ).

Depth-based medians are generally robust, in the sense that they are not aﬀected by outliers. For

instance, depth-based medians have a high breakdown point and favorable properties related to the

inﬂuence function (Chen and Tyler, 2002; Zuo, 2004). This makes them an interesting direction of

study in a diﬀerentially private setting. Examples of commonly used depth functions include the

halfspace depth (or Tukey depth) (Tukey, 1974), the simplicial depth (Liu, 1988), the spatial depth

(Vardi and Zhang, 2000; Serﬂing, 2002), the integrated dual depth (IDD) (Cuevas and Fraiman,

2009), and the integrated-rank-weighted (IRW) depth (Ramsay et al., 2019). For a precise deﬁnition

of the depth functions we discuss here, as well as a discussion of their basic properties, see Section 5.

If, instead of the median, one is interested in computing depth values themselves, this can be done

privately with an additive noise mechanism, see Section 7.

Our main result relies on a regularity condition, for which we need to introduce the following

notation. Let Bdenote the space of Borel functions from Rdto [0,1]. For a family of functions,

F⊆B, deﬁne a pseudometric on M1(Rd), dF(µ, ν) = supg∈F|Rgd(µ−ν)|, where µ, ν ∈ M1(Rd).

Recall that many standard metrics on probability measures can be written in this fashion. For

example, taking Fto be the set of indicator functions of Borel sets yields the Total Variation

distance and taking Fto be the set of indicator functions of semi-inﬁnite rectangles yields the

Kolmogorov–Smirnov distance. We then deﬁne the following important regularity condition.

Deﬁnition 2.We say that D is (K, F)-regular if there exists a class of functions F⊂Bsuch that

D(x, ·) is K-Lipschitz with respect to the F-pseudometric uniformly in x, i.e., for all µ, ν ∈ M1(Rd)

sup

x|D(x, µ)−D(x, ν)| ≤ KdF(µ, ν).

The property (K, F)-regularity can be thought of as a robustness condition in the sense that

(K, F)-regular functions are aﬀected by extreme observations in a very limited manner, through

the boundedness of g∈F. Speciﬁcally, one can show that (K, F)-regularity implies the bound

GSn(D) ≤K/n on the global sensitivity, see Lemma 28. Thus, for the proposed private median

estimator, we will assume that D is (K, F)-regular (Condition 3) and take β=nϵ/2K, which

ensures ˜

θnis ϵ-diﬀerentially private, see Lemma 27. In our applications, we will typically take F

to be a class of functions with low complexity (in the sense of Vapnik–Chervonenkis dimension).

We now state the necessary regularity conditions on the pair (µ, D) that are required for our

main result. In the following, let VC(F) denote the Vapnik–Chervonenkis dimension of F.

Condition 1.The function D(·, µ) has a maximizer.

Condition 2.The map x7→ D(x, µ) is L-Lipschitz a.e. for some L > 0.

Condition 3.There is some K > 0 and some family of functions F⊆Bwith VC(F)<∞such

that D is (K, F)-regular.

To state our main result, we must also introduce the discrepancy function. Let ED,µ ={θ:

D(θ, µ) = max

x∈RdD(x, µ)}denote the set of maximizers of D(·, µ) for ﬁxed D and µ. For a set Aand

r > 0, let Br(A) = {x: min

y∈A∥x−y∥ ≤ r}. The discrepancy function of the pair (D, µ) is

α(t) = D(θ0, µ)−sup

x∈Bc

t(ED,µ)

D(x, µ),(2.3)

where θ0∈ED,µ.In Table 1 below, we explicitly calculate αfor several depth functions. Note that

αis a non-decreasing function because the optimization problem involved is over decreasing sets.

We are now ready to state our main result, which provides bounds on the ﬁnite-sample deviations

of the proposed diﬀerentially private multivariate medians via depths. For concreteness, we present

our results for two of the most popular choices of priors, namely Gaussian measures and the uniform

measure on AR(y), the d-dimensional cube of side-length Rcentered at y. In the following, let

dR,y(x) denote the minimum distance from a point x= (x1, . . . , xd) to a face of the cube AR(y):

dR,y(x) = min

1≤i≤d|xi−yi±R/2|. Similarly, denote the minimum distance from a set B⊂Rdto a

face of the cube AR(y) by dR,y(B) = inf

x∈BdR,y(x) and let d(x, B) = inf

y∈B∥x−y∥denote the usual

point-to-set distance. Deﬁne α−1(t) = sup{r≥0: α(r) = t}, where one notes that α−1exists

because αis a monotone function. Lastly, we write a≲bif a≤Cb for some universal constant

C > 0.

Theorem 1. If Conditions 1–3 hold, then the following holds:

(i) Suppose that L≥1. If π=N(θπ, σ2

πI), then there exists a universal constant c > 0such that

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiallyprivatemultivariatemediansKellyRamsay,AukoshJagannath,Shoja’eddinChenouriAbstractStatisticaltoolswhichsatisfyrigorousprivacyguaranteesarenecessaryformoderndataanalysis.Itiswell-knownthatrobustnessagainstcontaminationislinkedtodifferentialprivacy.Despitethisfact,usingmultivariatemedians...

展开>> 收起<<

Differentially private multivariate medians Kelly Ramsay Aukosh Jagannath Shojaeddin Chenouri Abstract.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentially private multivariate medians Kelly Ramsay Aukosh Jagannath Shojaeddin Chenouri Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: