DiPPS Differentially Private Propensity Scores for Bias Correction

2025-04-22 0 0 326.96KB 12 页 10玖币
侵权投诉
DiPPS: Differentially Private Propensity Scores for Bias Correction
Liangwei Chen*1, Valentin Hartmann*2, Robert West2
1Google
2EPFL
chenlw@google.com, valentin.hartmann@epfl.ch, robert.west@epfl.ch
Abstract
In surveys, it is typically up to the individuals to decide if they
want to participate or not, which leads to participation bias:
the individuals willing to share their data might not be repre-
sentative of the entire population. Similarly, there are cases
where one does not have direct access to any data of the target
population and has to resort to publicly available proxy data
sampled from a different distribution. In this paper, we present
Differentially Private Propensity Scores for Bias Correction
(DiPPS), a method for approximating the true data distribution
of interest in both of the above settings. We assume that the
data analyst has access to a dataset
e
D
that was sampled from
the distribution of interest in a biased way. As individuals may
be more willing to share their data when given a privacy guar-
antee, we further assume that the analyst is allowed locally
differentially private access to a set of samples
D
from the
true, unbiased distribution. Each data point from the private,
unbiased dataset
D
is mapped to a probability distribution
over clusters (learned from the biased dataset
e
D
), from which
a single cluster is sampled via the exponential mechanism and
shared with the data analyst. This way, the analyst gathers a
distribution over clusters, which they use to compute propen-
sity scores for the points in the biased
e
D
, which are in turn
used to reweight the points in
e
D
to approximate the true data
distribution. It is now possible to compute any function on
the resulting reweighted dataset without further access to the
private
D
. In experiments on datasets from various domains,
we show that DiPPS successfully brings the distribution of
the available dataset closer to the distribution of interest in
terms of Wasserstein distance. We further show that this results
in improved estimates for different statistics, in many cases
even outperforming differential privacy mechanisms that are
specifically designed for these statistics.
Introduction
Participation bias is the bias that occurs when the non-partic-
ipation of certain individuals in a study biases the collected
data. Participation bias has been identified as a problem in
surveys in many different domains:
In a sexuality survey (Dunne et al. 1997), participants had
higher levels of education, were less politically conserva-
*These authors contributed equally.
Work done while at EPFL.
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
tive, less harm-avoidant and more sexually liberal than
non-participants.
In a longitudinal health study (Lissner et al. 2003), indi-
viduals who not only participated in the initial, but also
in later stages of the study were better educated and more
healthy than individuals who dropped out during the study
period.
In a study on sick leave of employees (Van Goor and
Verhage 1999), people who were on sick leave less often
and shorter were more likely to participate than others.
Participation bias may also occur outside of traditional
surveys. Nowadays many software products ask their users
whether they want to share anonymous usage statistics. Take
the example of a browser developer that wants to collect
data about how often different features of their browser are
used. Users who refuse to share their data with the browser
developer (non-participants) might do this because they are
more concerned about their privacy than users who do share
their data (participants) (Korkeila et al. 2001; R
¨
onmark et al.
1999). Non-participants would hence be more likely to use
privacy-related features such as tracking protection or the
private browsing mode. Consequently, the browser developer
would underestimate the use of such features if they would
base their analysis solely on the data of the participants. Note
that the promise of collecting anonymous usage statistics is
not of much worth to users, since their identity is typically
still known to the browser developer, e.g., via the account
with which they logged into the browser, via their IP address,
etc. Even if the developer does not associate the collected
records with the identity of the user, users are still at risk of de-
anonymization (Narayanan and Shmatikov 2008; Sweeney
2000).
A related problem occurs when initially no data from the
target population exists at all and one has to resort to proxy
data. E.g., a linguist might want to study language patterns
in private messages, but only has access to public Twitter
message. The developer of a camera app for smartphones
might want to improve the post-processing algorithms by
analyzing the most typical lighting conditions in their users’
photos, but since the photos are stored locally, the developer
must resort to publicly available photos on platforms like
Flickr. In these cases, the public proxy data and the target data
come from the same domain, but differ in their distribution:
arXiv:2210.02360v2 [cs.CR] 19 Jun 2023
Private messages contain more intimate information than
public ones and people select only their most beautiful photos
to upload to Flickr.
For this reason, big efforts are undertaken to convince more
individuals to participate in studies (de Winter et al. 2005). A
particularly convincing argument for participation might be
the promise of local differential privacy. Users might be more
willing to share, e.g., a scalar differentially private value
than a full plain-text vector with information about them-
selves (Warner 1965). A meta-analysis by Lensvelt-Mulders
et al. (2005) based on
38
studies shows that individuals are
more prone to providing correct answers to survey questions
when given a differential privacy guarantee via the random-
ized response mechanism (Warner 1965). Companies such as
Google (Erlingsson, Pihur, and Korolova 2014), Apple (2017)
and Microsoft (Ding, Kulkarni, and Yekhanin 2017) have
recognized this potential and implemented data collection
mechanisms in their products that provide local differential
privacy.
In this paper, we assume that there are two sets of indi-
viduals: participants, to whose data we have full access, and
non-participants, to whose data we only have locally differen-
tially private access (for individuals to whose data we do not
even have locally differentially private access, see ‘Problem
Definition’). Our method, Differentially Private Propensity
Scores for Bias Correction (DiPPS), uses the differentially
private access to the non-participants’ data to estimate the
data distribution of all individuals. It reduces the participa-
tion bias that would occur from using only the participants’
data for drawing conclusions about the entire population. Our
method can even be used when there exist no participants;
then, the participant data is replaced by a proxy dataset, and
only the non-participants’ data distribution is approximated.
The differentially private value that the non-participants share
with the data analyst is the value of a single categorical vari-
able and requires only one round of communication. This
makes DiPPS suitable even for offline settings such as offline
surveys, and further makes it easy to explain what data is
shared and how privacy is preserved to laymen.
Overview of DiPPS.Our method consists of three main
steps.
1.
First, a clustering model is trained on the participant data.
This model transforms each data point into a probability
distribution over a finite number of classes. In our case,
this is a probabilistic clustering model, but other imple-
mentations such as dimensionality reduction models are
possible as well.
2.
Then, this model is shipped to the non-participants. They
apply the clustering model to their data and sample a
single value from the resulting probability distribution
in a locally differentially private way. Afterwards, they
return this value.
3.
In the last step, the values returned by the non-partici-
pants are used to estimate the propensity of each of the
participants’ data points to be indeed part of the partic-
ipant dataset. These propensity scores are then used to
reweight the participant data points to either model the
distribution of all individuals, or the distribution of only
the non-participants in the case of a proxy dataset.
The non-participants are guaranteed local differential privacy:
Local differential privacy. Differential privacy (Dwork et al.
2006) is a privacy notion that is widely used in academic
research and increasingly also in industry applications. It
states that the output of a randomized mechanism that is in-
voked on a database should reveal only very little information
about any individual record in the database. Local differential
privacy applies the concept to distributed databases, where
each individual holds their own data (Kasiviswanathan et al.
2011):
Definition 1. Let
M
be a randomized mechanism and let
ε>0
.
M
provides
ε
-local differential privacy if, for all pairs
of possible values
x,x
of an individual’s data and all possible
sets of outputs S:
Pr[M(x)S]eεPr[M(x)S].
An
ε
-differential privacy guarantee for a mechanism
M
is an upper bound on the amount by which an adversary can
update any prior belief about the database, given the output
of
M
(Kasiviswanathan and Smith 2014). Smaller values of
ε
mean more privacy, larger values less privacy. A typical
choice is ε=1.
Overview of the paper. We first discuss related work. Then
we formally define the participation bias correction problem
that DiPPS solves, followed by the description of the different
components of our solution (see Fig. 1 for an overview) and
the results of the various experiments that we use to evaluate
it. Finally, we discuss limitations and possible extensions of
DiPPS and summarize the paper.
Related Work
Traditional methods for participation bias correction (Lund-
str
¨
om and S
¨
arndal 1999; Valliant 1993; Ekholm and Laakso-
nen 1991) assume that one has auxiliary information about
the respondents and the non-respondents. This could, e.g.,
be geographical information when doing a survey via house
visits, or known population totals. If, for example, the target
population is the entire population of a country, then the pop-
ulation totals can come from census data. If the covariates
D
,
the auxiliary information
A
and the variable
Z
indicating par-
ticipation/non-participation form the Markov chain
DAZ
,
then these methods can work well. See (Groves 2006) for
more details. Often methods for participation bias correction
use propensity scores, which are the probabilities of the indi-
viduals being respondents, given their covariates (Little and
Vartivarian 2003). The collected samples are then reweighted
with the inverse of the propensity scores to correct for the
participation bias. This is what we do in our method as well.
Reweighting data records can also be required for causal
inference. Agarwal and Singh (2021) propose a reweighting
method for estimating causal parameters such as the average
treatment effect from data that has been released with differ-
ential privacy. Instead of weighting points with their inverse
propensity score, they use an error-in-variable balancing tech-
nique. Like us in our concrete choice of implementation, they
assume a low rank data matrix, and confirm the validity of
this assumption on US census data.
Another related setting is the following: For a machine
learning (ML) task, there exist two datasets, one is labeled
and the other one unlabeled, and there is a covariate shift
between the two. When training an ML model, one would
like to account for this shift by also taking into account the
unlabeled data. Several methods for solving this problem
have been proposed (Huang et al. 2007; Rosset et al. 2005;
Zadrozny 2004). In this context, the idea of using cluster-
ing to correct for sample selection bias has already been
explored (Cortes et al. 2008). All of these methods work only
in the non-private setting, where one has direct access to the
unlabeled data.
We, however, assume that we neither have auxiliary infor-
mation about the non-participants, nor non-private access to
parts of their data. Access to data of non-participants is only
allowed in a locally differentially private way. For providing
local differential privacy in the processing of distributed data,
there exists a multitude of methods: for computing means
(Wang et al. 2019; Duchi, Jordan, and Wainwright 2018), for
computing counts (Erlingsson, Pihur, and Korolova 2014)
or even for training machine learning models (Truex et al.
2019), to just name a few. What all of these methods have in
common is that each one only serves a single purpose. The
data analyst has to decide beforehand which function they
want to compute on the data. If they decide to perform addi-
tional analyses later, which is, e.g., the case in exploratory
and adaptive data analysis, they have to invoke another dif-
ferentially private mechanism. This requires further rounds
of communication with the individuals that hold the data
and, more importantly, each additional function computation
decreases the level of privacy (Rogers et al. 2016; Kairouz,
Oh, and Viswanath 2015). As opposed to that, our method
estimates a distribution. This means that once the method has
been executed, the data analyst can compute any number of
arbitrary functions they want on this distribution, including
all kinds of statistics, but also more complicated functions
such as training ML models. Furthermore, while methods
for, e.g., locally differentially private gradient descent require
many rounds of communication, our method works with a
single round of communication.
A trust setting similar to ours has been introduced earlier
by Avent et al. (2017). As opposed to us, they assume one
dataset with locally differentially private access and one with
centrally differentially private access, whereas we assume
one dataset with locally differentially private access and one
with non-private access. They describe a method for com-
puting the most popular records of a web search log (Avent
et al. 2017) and methods for mean estimation (Avent, Dubey,
and Korolova 2020), whereas we consider the more general
problem of distribution estimation. Note that our method can
in principle be extended to their setting with purely differen-
tially privacy access to data; see our discussion section.
Other works (Kancharla and Kang 2021; Clark and De-
sharnais 1998) consider bias in locally differentially private
surveys due to users not following the DP protocol faithfully.
This might occur in our setting if the data collecting party
gives the users only the options to share their data without
a privacy guarantee or with DP, but not to not share data at
all. The authors propose to split the users into two groups, let
those groups invoke DP mechanisms with different parame-
ters, and compare the two sets of responses to correct for this
bias.
Problem Definition
Our method is also applicable in an offline setting, but assume
for simplicity that there is a company that is selling a software
and wants to collect data from its users over the Internet to,
e.g., analyze usage patterns to improve the software, train
ML models that are to be integrated in the software, to spot
market opportunities for new products, etc. We consider two
settings:
1.
Users of the software get the option to share their data,
e.g., usage statistics, as it is common in a lot of nowadays’
software (Windows, Firefox, ...). Some users decide to
share their data directly (without a privacy mechanism
in place), some decide to only share data with a local
differential privacy guarantee. The company therefore has
direct access to a (potentially biased) subset of the data
and in addition it has locally differentially private access
to the rest of the data.
2.
The company does not have direct access to any user data,
but only to a proxy dataset that comes from a similar
distribution as the user data. If the user data consists of
private text messages, the proxy dataset could for example
be tweets from Twitter or public forum posts. Assume that
the company has locally differentially private access to
the user data.
In both settings, the company wants to use the data to which it
has direct access to perform data analysis, ML model training
or other data-dependent tasks. But in both cases, that data is
most likely biased: in 1, the covariate distribution of users
who are willing to share their data might differ from the
covariate distribution of users who are not willing to share
their data. In 2, the data even comes from a different source.
The problem that we are solving is the reduction of this bias.
We now formalize this problem.
Let
D
be the random variable that subsumes the covariates
of the user data. Let
Z
be a binary random variable indicating
whether the company has direct, non-private accces to a data
point or not. This gives rise to the joint distribution
(D,Z)
. As-
sume that there exists a multiset
X
of samples
(d,z)
of
(D,Z)
.
Using
{{·}}
to denote multisets, let
X0={{d|(d,0)X}}
be the data to which the company only has locally differen-
tially private access and let
X1={{d|(d,1)X}}
be the
data to which the company has direct access. The goal is to
estimate the distribution of
D
(in Setting 1) or the distribution
of D|Z=0 (in Setting 2).
In the following we will refer to the data that can be directly
accessed (i.e., directly shared data in Setting 1 and proxy data
in Setting 2) as the participant data
U1
and the data that can
only be accessed in a locally differentially private way as the
non-participant data
U2
. Note that we do not consider a third
group of users: those who are not willing to share any data,
not even when provided a privacy guarantee. We denote their
data by
U3
. This third group of users is empty if the company
摘要:

DiPPS:DifferentiallyPrivatePropensityScoresforBiasCorrectionLiangweiChen*†1,ValentinHartmann*2,RobertWest21Google2EPFLchenlw@google.com,valentin.hartmann@epfl.ch,robert.west@epfl.chAbstractInsurveys,itistypicallyuptotheindividualstodecideiftheywanttoparticipateornot,whichleadstoparticipationbias:the...

展开>> 收起<<
DiPPS Differentially Private Propensity Scores for Bias Correction.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:326.96KB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注