Differentially private multivariate medians Kelly Ramsay Aukosh Jagannath Shojaeddin Chenouri Abstract

2025-04-26 0 0 1.65MB 42 页 10玖币
侵权投诉
Differentially private multivariate medians
Kelly Ramsay, Aukosh Jagannath, Shoja’eddin Chenouri
Abstract
Statistical tools which satisfy rigorous privacy guarantees are necessary for modern data
analysis. It is well-known that robustness against contamination is linked to differential privacy.
Despite this fact, using multivariate medians for differentially private and robust multivariate
location estimation has not been systematically studied. We develop novel finite-sample perfor-
mance guarantees for differentially private multivariate depth-based medians, which are essen-
tially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey)
depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals,
the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our
results numerically using a Gaussian contamination model in dimensions up to d= 100, and
compare them to a state-of-the-art private mean estimation algorithm. As a by-product of our
investigation, we prove concentration inequalities for the output of the exponential mechanism
about the maximizer of the population objective function. This bound applies to objective
functions that satisfy a mild regularity condition.
1 Introduction
Protecting user privacy is necessary for a safe and fair society. In order to protect user privacy,
many large institutions, including the United States Census Bureau (Abowd et al., 2022), Apple
(Differential Privacy Team, 2017), and Google (Guevara, 2021), utilize a strong notion of privacy
known as differential privacy. Differential privacy is favored because it protects against many ad-
versarial attacks while requiring minimal assumptions on both the data and the adversary (Dwork
et al., 2017). It is now well-known that differential privacy is linked to robustness against con-
tamination, see (Dwork and Lei, 2009; Avella-Medina, 2020; Liu et al., 2021a) and the references
therein. Surprisingly, however, multivariate medians have not been systematically studied in the
differential privacy literature. We study this problem here. Note that there are several variants of
differential privacy. Here, we focus on the setting of pure differential privacy, which is the strictest
variant of differential privacy.
To date, the literature on differentially private location estimation has largely focused on means.
Foygel Barber and Duchi (2014) proved a lower bound on the minimax risk of differentially private
mean estimation in terms of the number of moments possessed by the population measure. Other
early works focused on the univariate setting (Karwa and Vadhan, 2017; Bun and Steinke, 2019).
Kamath et al. (2018) and Bun et al. (2019) studied differentially private mean estimation for several
parametric models, including the multivariate Gaussian model. Later, several authors studied
differentially private mean estimation for sub-Gaussian measures (Cai et al., 2021; Biswas et al.,
2020; Brown et al., 2021; Liu et al., 2021b). After which, Kamath et al. (2020); Liu et al. (2021a) and
Hopkins et al. (2022) studied differentially private mean estimation for measures with a finite second
moment. Within this framework, Kamath et al. (2020) considered heavy tails and Liu et al. (2021a)
1
arXiv:2210.06459v2 [math.ST] 26 Mar 2024
and Hopkins et al. (2022) considered robustness in terms of contamination models. Kamath et al.
(2020) quantified the sample complexity of their (purely) differentially private mean estimator in
terms of the number of moments possessed by the population measure. Liu et al. (2021a) quantified
the relationship between the accuracy of an (approximately) differentially private mean estimator
and the level of contamination in the observed sample. On a similar note, Hopkins et al. (2022)
showed that there exists a level of contamination under which the sample complexity of their
(purely) differentially private mean estimator remains unchanged.
On the other hand, if one wishes to estimate location robustly, the canonical estimators are
medians. The differential privacy literature on medians has focused on the univariate setting: Dwork
and Lei (2009) introduced an approximately differentially private median estimator with asymptotic
consistency guarantees, Avella-Medina and Brunel (2019a); Brunel and Avella-Medina (2020) then
introduced several median estimators which achieve sub-Gaussian error rates, and Tzamos et al.
(2020) introduced median estimators with minimax optimal sample complexity. Private estimation
of multivariate medians, however, has not been carefully studied.
In this paper, we study differentially private, multivariate median estimation through the frame-
work of depth functions. Depth-based median estimation is the standard approach to multivariate
median estimation. Here, one has a function, called the depth function, which provides a measure
of centrality. Its maximizer is then the median, which is generally robust. Popular depth-based
medians include the halfspace (or Tukey) median (Tukey, 1974), the spatial median (Vardi and
Zhang, 2000), and the simplicial median (Liu, 1990). For more on the motivation of the depth-
based medians approach, see Section 2 below and Small (1990); Vardi and Zhang (2000); Serfling
(2006). Our results apply to a broad class of depth functions, including those listed above, and we
make minimal distributional assumptions. In particular, we do not require concavity of the depth
function or that the population measure has moments of any order.
Our contributions are as follows: Our main result is a general finite-sample deviations bound for
private multivariate medians based on the exponential mechanism (Theorem 1). We use this result
to give upper bounds on the deviations (and sample complexities) of several, purely differentially
private multivariate medians arising from depth functions. These bounds are essentially sharp given
recent results of Kamath et al. (2018); Cai et al. (2021) and do not require any moment assumptions
on the population measure. In addition, we provide a fast implementation of a smoothed version
of the integrated dual depth-based median (Cuevas and Fraiman, 2009); we can compute the (non-
private) median of ten thousand 100-dimensional samples in less than one second on a personal
computer. We show that this smoothed version of the integrated dual depth satisfies desirable
properties for a depth function and can be approximated in polynomial time with finite-sample
performance guarantees. As a by-product of our analysis, we uncover a general but elementary
concentration bound (Theorem 11) for the exponential mechanism. We give a novel regularity
condition “(K, F)-regularity” (Condition 3) on the objective function. This regularity condition
implies both an upper bound on the sample complexity of a draw from the exponential mechanism
and an upper bound on the global sensitivity of the objective function. As a second by-product of
our analysis, we uncover uniform non-asymptotic error bounds for several depth functions and a
differentially private estimator of data depth values.
Before turning to our main results, it is important to note that several authors have recently used
depth functions in the context of differential privacy, likely due to their robustness properties, with
a heavy focus on the halfspace depth. Depth functions have been used in settings such as finding a
point in a convex hull (Beimel et al., 2019; Gao and Sheffet, 2020), comparing k-norm mechanisms
(Awan and Slavkovi´c, 2021), and improving robustness for mean estimation for Gaussian models
2
(Brown et al., 2021; Liu et al., 2021a). Another related work is that of Ben-Eliezer et al. (2022),
who focus on various problems concerning high-dimensional quantile estimation. In particular, one
problem Ben-Eliezer et al. (2022) consider is that of privately obtaining one or more points which
have halfspace depth above a given threshold. Though this problem is closely related, it is distinct.
For instance, their algorithm can be used to estimate a point with high halfspace depth. However,
it could not be used to directly estimate the halfspace median, since that would require setting the
threshold to be close to the depth of the empirical halfspace median, which depends on the sample.
To our knowledge, none of the aforementioned works have directly addressed depth-based median
estimation.
2 Robust private multivariate median estimation
In this section, we provide finite-sample deviation bounds for differentially private robust multi-
variate location estimation. Let us begin by briefly recalling the following essential notions and
definitions from differential privacy. For a textbook introduction to the concept of differential
privacy see Dwork and Roth (2014).
Adataset of size n×dis a collection of npoints in Rd,Xn= (X)n
=1, with repetitions
allowed. Let Dn×dbe the set of all datasets of size n×d. For a dataset Yn, let D(Yn, m) =
{ZnDn×d:|ZnYn|=m},that is, D(Yn, m) is the collection of datasets of size n×dwhich
differ from Ynin exactly mpoints. Two datasets Ynand Znare said to be adjacent if Zn∈ D(Yn,1).
Finally, for a dataset Ynwith empirical measure ˆµYndefine
f
M(Yn) = {˜µ∈ M1(Rd): ˜µ= ˆµZnfor some Zn∈ D(Yn,1)},
where, for a space S,M1(S) denotes the space of probability measures on S.
Let us now recall the notion of differential privacy. Suppose that we observe a dataset Xn
comprised of ni.i.d. samples from some unknown measure µ∈ M1(Rd) and produce a statistic, ˜
θn,
whose law conditionally on the samples depends on their empirical measure, ˆµn. Let us denote this
law by Pˆµn. The goal of differential privacy is to produce a statistic such that Pˆµnis non-degenerate
and, more precisely, obeys a certain “privacy guarantee” which is defined as follows. Following the
privacy literature, we call the algorithm that takes ˆµnand returns the (random) statistic, ˜
θn, a
mechanism. We also follow the standard abuse of terminology and call ˜
θnthe mechanism when it
is clear from context.
Definition 1.A mechanism ˜
θnis ϵ-differentially private if for any dataset Yn, any ˜µf
M(Yn), and
any measurable set B, it holds that
PˆµYn(B)eϵP˜µ(B).(2.1)
Here, ϵ > 0 is the privacy parameter, for which smaller values enforce stricter levels of privacy.
Many general purpose differentially private mechanisms rely on the concept of global sensitivity.
The global sensitivity GSnof a function ϕ:Rd× M1(Rd)Ris given by
GSn(ϕ) = sup
XnDn×d,
˜µn
f
M(Xn)
sup
xRd|ϕ(x, ˆµn)ϕ(x, ˜µn)|.
The simplest differentially private mechanisms are the additive noise mechanisms, e.g., the
Laplace and Gaussian mechanisms. These mechanisms require the non-private estimator to have
3
finite global sensitivity. Unfortunately, virtually all standard location estimators, including both
univariate and multivariate medians, have infinite global sensitivity, see e.g., (Avella-Medina and
Brunel, 2019b).
However, depth functions, the objective functions for multivariate medians, have finite global
sensitivity. This fact makes them a good candidate for use with the exponential mechanism, which
is a general mechanism used to produce differentially private estimators from non-private estimators
that are maximizers of a given objective function. For a given β > 0, base measure (i.e., prior)
π∈ M1(Rd) and objective function ϕ:Rd×M1(Rd)R, suppose that Rexp(βϕ(θ, ν))(θ)<
for all ν∈ M1(Rd), then the exponential mechanism with base measure πis given by
Qˆµnexp(βϕ(θ, ˆµn)). (2.2)
A sample from the exponential mechanism ˜
θnQˆµnthen provides an estimate of the population
value θ0= argmaxRdϕ(θ, µ). In regard to the choice of β, it was shown by McSherry and Talwar
(2007) that if ˜
θnis produced by the exponential mechanism, with βϵ/2GSn(ϕ) then ˜
θnis
ϵ-differentially private.
When studying multivariate median estimation, one might try to naively extend prior work on
univariate medians to the multivariate setting, i.e., the coordinate-wise median. This, however, is
well-known to be an unsatisfactory as a measure of center (Small, 1990). Indeed, the coordinate-
wise median can reside outside the convex hull of the data (Serfling, 2006). Instead, the most
popular framework for multivariate median estimation is through maximizing a depth function.
Depth functions are functions of the form D: Rd× M1(Rd)R+which, given a measure
and a point in the domain, assign a number. This number represents how central that point is
with respect to the measure, it is called the depth of that point. Given a depth function, the
corresponding median is then defined to be a maximizer of this depth, or simply the deepest point:
MED(µ; D) = argmax
xRd
D(x, µ).
Depth-based medians are generally robust, in the sense that they are not affected by outliers. For
instance, depth-based medians have a high breakdown point and favorable properties related to the
influence function (Chen and Tyler, 2002; Zuo, 2004). This makes them an interesting direction of
study in a differentially private setting. Examples of commonly used depth functions include the
halfspace depth (or Tukey depth) (Tukey, 1974), the simplicial depth (Liu, 1988), the spatial depth
(Vardi and Zhang, 2000; Serfling, 2002), the integrated dual depth (IDD) (Cuevas and Fraiman,
2009), and the integrated-rank-weighted (IRW) depth (Ramsay et al., 2019). For a precise definition
of the depth functions we discuss here, as well as a discussion of their basic properties, see Section 5.
If, instead of the median, one is interested in computing depth values themselves, this can be done
privately with an additive noise mechanism, see Section 7.
Our main result relies on a regularity condition, for which we need to introduce the following
notation. Let Bdenote the space of Borel functions from Rdto [0,1]. For a family of functions,
FB, define a pseudometric on M1(Rd), dF(µ, ν) = supgF|Rgd(µν)|, where µ, ν ∈ M1(Rd).
Recall that many standard metrics on probability measures can be written in this fashion. For
example, taking Fto be the set of indicator functions of Borel sets yields the Total Variation
distance and taking Fto be the set of indicator functions of semi-infinite rectangles yields the
Kolmogorov–Smirnov distance. We then define the following important regularity condition.
4
Definition 2.We say that D is (K, F)-regular if there exists a class of functions FBsuch that
D(x, ·) is K-Lipschitz with respect to the F-pseudometric uniformly in x, i.e., for all µ, ν ∈ M1(Rd)
sup
x|D(x, µ)D(x, ν)| ≤ KdF(µ, ν).
The property (K, F)-regularity can be thought of as a robustness condition in the sense that
(K, F)-regular functions are affected by extreme observations in a very limited manner, through
the boundedness of gF. Specifically, one can show that (K, F)-regularity implies the bound
GSn(D) K/n on the global sensitivity, see Lemma 28. Thus, for the proposed private median
estimator, we will assume that D is (K, F)-regular (Condition 3) and take β=nϵ/2K, which
ensures ˜
θnis ϵ-differentially private, see Lemma 27. In our applications, we will typically take F
to be a class of functions with low complexity (in the sense of Vapnik–Chervonenkis dimension).
We now state the necessary regularity conditions on the pair (µ, D) that are required for our
main result. In the following, let VC(F) denote the Vapnik–Chervonenkis dimension of F.
Condition 1.The function D(·, µ) has a maximizer.
Condition 2.The map x7→ D(x, µ) is L-Lipschitz a.e. for some L > 0.
Condition 3.There is some K > 0 and some family of functions FBwith VC(F)<such
that D is (K, F)-regular.
To state our main result, we must also introduce the discrepancy function. Let ED={θ:
D(θ, µ) = max
xRdD(x, µ)}denote the set of maximizers of D(·, µ) for fixed D and µ. For a set Aand
r > 0, let Br(A) = {x: min
yAxy∥ ≤ r}. The discrepancy function of the pair (D, µ) is
α(t) = D(θ0, µ)sup
xBc
t(ED)
D(x, µ),(2.3)
where θ0ED.In Table 1 below, we explicitly calculate αfor several depth functions. Note that
αis a non-decreasing function because the optimization problem involved is over decreasing sets.
We are now ready to state our main result, which provides bounds on the finite-sample deviations
of the proposed differentially private multivariate medians via depths. For concreteness, we present
our results for two of the most popular choices of priors, namely Gaussian measures and the uniform
measure on AR(y), the d-dimensional cube of side-length Rcentered at y. In the following, let
dR,y(x) denote the minimum distance from a point x= (x1, . . . , xd) to a face of the cube AR(y):
dR,y(x) = min
1id|xiyi±R/2|. Similarly, denote the minimum distance from a set BRdto a
face of the cube AR(y) by dR,y(B) = inf
xBdR,y(x) and let d(x, B) = inf
yBxydenote the usual
point-to-set distance. Define α1(t) = sup{r0: α(r) = t}, where one notes that α1exists
because αis a monotone function. Lastly, we write abif aCb for some universal constant
C > 0.
Theorem 1. If Conditions 1–3 hold, then the following holds:
(i) Suppose that L1. If π=N(θπ, σ2
πI), then there exists a universal constant c > 0such that
5
摘要:

DifferentiallyprivatemultivariatemediansKellyRamsay,AukoshJagannath,Shoja’eddinChenouriAbstractStatisticaltoolswhichsatisfyrigorousprivacyguaranteesarenecessaryformoderndataanalysis.Itiswell-knownthatrobustnessagainstcontaminationislinkedtodifferentialprivacy.Despitethisfact,usingmultivariatemedians...

展开>> 收起<<
Differentially private multivariate medians Kelly Ramsay Aukosh Jagannath Shojaeddin Chenouri Abstract.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:1.65MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注