DiPPS Differentially Private Propensity Scores for Bias Correction

2025-04-22 0 0 326.96KB 12 页 10玖币

侵权投诉

DiPPS: Differentially Private Propensity Scores for Bias Correction

Liangwei Chen*†1, Valentin Hartmann*2, Robert West2

1Google

2EPFL

chenlw@google.com, valentin.hartmann@epﬂ.ch, robert.west@epﬂ.ch

Abstract

In surveys, it is typically up to the individuals to decide if they

want to participate or not, which leads to participation bias:

the individuals willing to share their data might not be repre-

sentative of the entire population. Similarly, there are cases

where one does not have direct access to any data of the target

population and has to resort to publicly available proxy data

sampled from a different distribution. In this paper, we present

Differentially Private Propensity Scores for Bias Correction

(DiPPS), a method for approximating the true data distribution

of interest in both of the above settings. We assume that the

data analyst has access to a dataset

that was sampled from

the distribution of interest in a biased way. As individuals may

be more willing to share their data when given a privacy guar-

antee, we further assume that the analyst is allowed locally

differentially private access to a set of samples

from the

true, unbiased distribution. Each data point from the private,

unbiased dataset

is mapped to a probability distribution

over clusters (learned from the biased dataset

), from which

a single cluster is sampled via the exponential mechanism and

shared with the data analyst. This way, the analyst gathers a

distribution over clusters, which they use to compute propen-

sity scores for the points in the biased

, which are in turn

used to reweight the points in

to approximate the true data

distribution. It is now possible to compute any function on

the resulting reweighted dataset without further access to the

private

. In experiments on datasets from various domains,

we show that DiPPS successfully brings the distribution of

the available dataset closer to the distribution of interest in

terms of Wasserstein distance. We further show that this results

in improved estimates for different statistics, in many cases

even outperforming differential privacy mechanisms that are

speciﬁcally designed for these statistics.

Introduction

Participation bias is the bias that occurs when the non-partic-

ipation of certain individuals in a study biases the collected

data. Participation bias has been identiﬁed as a problem in

surveys in many different domains:

•

In a sexuality survey (Dunne et al. 1997), participants had

higher levels of education, were less politically conserva-

*These authors contributed equally.

†Work done while at EPFL.

tive, less harm-avoidant and more sexually liberal than

non-participants.

•

In a longitudinal health study (Lissner et al. 2003), indi-

viduals who not only participated in the initial, but also

in later stages of the study were better educated and more

healthy than individuals who dropped out during the study

period.

•

In a study on sick leave of employees (Van Goor and

Verhage 1999), people who were on sick leave less often

and shorter were more likely to participate than others.

Participation bias may also occur outside of traditional

surveys. Nowadays many software products ask their users

whether they want to share anonymous usage statistics. Take

the example of a browser developer that wants to collect

data about how often different features of their browser are

used. Users who refuse to share their data with the browser

developer (non-participants) might do this because they are

more concerned about their privacy than users who do share

their data (participants) (Korkeila et al. 2001; R

onmark et al.

1999). Non-participants would hence be more likely to use

privacy-related features such as tracking protection or the

private browsing mode. Consequently, the browser developer

would underestimate the use of such features if they would

base their analysis solely on the data of the participants. Note

that the promise of collecting anonymous usage statistics is

not of much worth to users, since their identity is typically

still known to the browser developer, e.g., via the account

with which they logged into the browser, via their IP address,

etc. Even if the developer does not associate the collected

records with the identity of the user, users are still at risk of de-

anonymization (Narayanan and Shmatikov 2008; Sweeney

2000).

A related problem occurs when initially no data from the

target population exists at all and one has to resort to proxy

data. E.g., a linguist might want to study language patterns

in private messages, but only has access to public Twitter

message. The developer of a camera app for smartphones

might want to improve the post-processing algorithms by

analyzing the most typical lighting conditions in their users’

photos, but since the photos are stored locally, the developer

must resort to publicly available photos on platforms like

Flickr. In these cases, the public proxy data and the target data

come from the same domain, but differ in their distribution:

arXiv:2210.02360v2 [cs.CR] 19 Jun 2023

Private messages contain more intimate information than

public ones and people select only their most beautiful photos

to upload to Flickr.

For this reason, big efforts are undertaken to convince more

individuals to participate in studies (de Winter et al. 2005). A

particularly convincing argument for participation might be

the promise of local differential privacy. Users might be more

willing to share, e.g., a scalar differentially private value

than a full plain-text vector with information about them-

selves (Warner 1965). A meta-analysis by Lensvelt-Mulders

et al. (2005) based on

studies shows that individuals are

more prone to providing correct answers to survey questions

when given a differential privacy guarantee via the random-

ized response mechanism (Warner 1965). Companies such as

Google (Erlingsson, Pihur, and Korolova 2014), Apple (2017)

and Microsoft (Ding, Kulkarni, and Yekhanin 2017) have

recognized this potential and implemented data collection

mechanisms in their products that provide local differential

privacy.

In this paper, we assume that there are two sets of indi-

viduals: participants, to whose data we have full access, and

non-participants, to whose data we only have locally differen-

tially private access (for individuals to whose data we do not

even have locally differentially private access, see ‘Problem

Deﬁnition’). Our method, Differentially Private Propensity

Scores for Bias Correction (DiPPS), uses the differentially

private access to the non-participants’ data to estimate the

data distribution of all individuals. It reduces the participa-

tion bias that would occur from using only the participants’

data for drawing conclusions about the entire population. Our

method can even be used when there exist no participants;

then, the participant data is replaced by a proxy dataset, and

only the non-participants’ data distribution is approximated.

The differentially private value that the non-participants share

with the data analyst is the value of a single categorical vari-

able and requires only one round of communication. This

makes DiPPS suitable even for ofﬂine settings such as ofﬂine

surveys, and further makes it easy to explain what data is

shared and how privacy is preserved to laymen.

Overview of DiPPS.Our method consists of three main

steps.

First, a clustering model is trained on the participant data.

This model transforms each data point into a probability

distribution over a ﬁnite number of classes. In our case,

this is a probabilistic clustering model, but other imple-

mentations such as dimensionality reduction models are

possible as well.

Then, this model is shipped to the non-participants. They

apply the clustering model to their data and sample a

single value from the resulting probability distribution

in a locally differentially private way. Afterwards, they

return this value.

In the last step, the values returned by the non-partici-

pants are used to estimate the propensity of each of the

participants’ data points to be indeed part of the partic-

ipant dataset. These propensity scores are then used to

reweight the participant data points to either model the

distribution of all individuals, or the distribution of only

the non-participants in the case of a proxy dataset.

The non-participants are guaranteed local differential privacy:

Local differential privacy. Differential privacy (Dwork et al.

2006) is a privacy notion that is widely used in academic

research and increasingly also in industry applications. It

states that the output of a randomized mechanism that is in-

voked on a database should reveal only very little information

about any individual record in the database. Local differential

privacy applies the concept to distributed databases, where

each individual holds their own data (Kasiviswanathan et al.

2011):

Deﬁnition 1. Let

be a randomized mechanism and let

ε>0

provides

-local differential privacy if, for all pairs

of possible values

x,x′

of an individual’s data and all possible

sets of outputs S:

Pr[M(x)∈S]≤eεPr[M(x′)∈S].

-differential privacy guarantee for a mechanism

is an upper bound on the amount by which an adversary can

update any prior belief about the database, given the output

(Kasiviswanathan and Smith 2014). Smaller values of

mean more privacy, larger values less privacy. A typical

choice is ε=1.

Overview of the paper. We ﬁrst discuss related work. Then

we formally deﬁne the participation bias correction problem

that DiPPS solves, followed by the description of the different

components of our solution (see Fig. 1 for an overview) and

the results of the various experiments that we use to evaluate

it. Finally, we discuss limitations and possible extensions of

DiPPS and summarize the paper.

Related Work

Traditional methods for participation bias correction (Lund-

str

om and S

arndal 1999; Valliant 1993; Ekholm and Laakso-

nen 1991) assume that one has auxiliary information about

the respondents and the non-respondents. This could, e.g.,

be geographical information when doing a survey via house

visits, or known population totals. If, for example, the target

population is the entire population of a country, then the pop-

ulation totals can come from census data. If the covariates

the auxiliary information

and the variable

indicating par-

ticipation/non-participation form the Markov chain

D–A–Z

then these methods can work well. See (Groves 2006) for

more details. Often methods for participation bias correction

use propensity scores, which are the probabilities of the indi-

viduals being respondents, given their covariates (Little and

Vartivarian 2003). The collected samples are then reweighted

with the inverse of the propensity scores to correct for the

participation bias. This is what we do in our method as well.

Reweighting data records can also be required for causal

inference. Agarwal and Singh (2021) propose a reweighting

method for estimating causal parameters such as the average

treatment effect from data that has been released with differ-

ential privacy. Instead of weighting points with their inverse

propensity score, they use an error-in-variable balancing tech-

nique. Like us in our concrete choice of implementation, they

assume a low rank data matrix, and conﬁrm the validity of

this assumption on US census data.

Another related setting is the following: For a machine

learning (ML) task, there exist two datasets, one is labeled

and the other one unlabeled, and there is a covariate shift

between the two. When training an ML model, one would

like to account for this shift by also taking into account the

unlabeled data. Several methods for solving this problem

have been proposed (Huang et al. 2007; Rosset et al. 2005;

Zadrozny 2004). In this context, the idea of using cluster-

ing to correct for sample selection bias has already been

explored (Cortes et al. 2008). All of these methods work only

in the non-private setting, where one has direct access to the

unlabeled data.

We, however, assume that we neither have auxiliary infor-

mation about the non-participants, nor non-private access to

parts of their data. Access to data of non-participants is only

allowed in a locally differentially private way. For providing

local differential privacy in the processing of distributed data,

there exists a multitude of methods: for computing means

(Wang et al. 2019; Duchi, Jordan, and Wainwright 2018), for

computing counts (Erlingsson, Pihur, and Korolova 2014)

or even for training machine learning models (Truex et al.

2019), to just name a few. What all of these methods have in

common is that each one only serves a single purpose. The

data analyst has to decide beforehand which function they

want to compute on the data. If they decide to perform addi-

tional analyses later, which is, e.g., the case in exploratory

and adaptive data analysis, they have to invoke another dif-

ferentially private mechanism. This requires further rounds

of communication with the individuals that hold the data

and, more importantly, each additional function computation

decreases the level of privacy (Rogers et al. 2016; Kairouz,

Oh, and Viswanath 2015). As opposed to that, our method

estimates a distribution. This means that once the method has

been executed, the data analyst can compute any number of

arbitrary functions they want on this distribution, including

all kinds of statistics, but also more complicated functions

such as training ML models. Furthermore, while methods

for, e.g., locally differentially private gradient descent require

many rounds of communication, our method works with a

single round of communication.

A trust setting similar to ours has been introduced earlier

by Avent et al. (2017). As opposed to us, they assume one

dataset with locally differentially private access and one with

centrally differentially private access, whereas we assume

one dataset with locally differentially private access and one

with non-private access. They describe a method for com-

puting the most popular records of a web search log (Avent

et al. 2017) and methods for mean estimation (Avent, Dubey,

and Korolova 2020), whereas we consider the more general

problem of distribution estimation. Note that our method can

in principle be extended to their setting with purely differen-

tially privacy access to data; see our discussion section.

Other works (Kancharla and Kang 2021; Clark and De-

sharnais 1998) consider bias in locally differentially private

surveys due to users not following the DP protocol faithfully.

This might occur in our setting if the data collecting party

gives the users only the options to share their data without

a privacy guarantee or with DP, but not to not share data at

all. The authors propose to split the users into two groups, let

those groups invoke DP mechanisms with different parame-

ters, and compare the two sets of responses to correct for this

bias.

Problem Deﬁnition

Our method is also applicable in an ofﬂine setting, but assume

for simplicity that there is a company that is selling a software

and wants to collect data from its users over the Internet to,

e.g., analyze usage patterns to improve the software, train

ML models that are to be integrated in the software, to spot

market opportunities for new products, etc. We consider two

settings:

Users of the software get the option to share their data,

e.g., usage statistics, as it is common in a lot of nowadays’

software (Windows, Firefox, ...). Some users decide to

share their data directly (without a privacy mechanism

in place), some decide to only share data with a local

differential privacy guarantee. The company therefore has

direct access to a (potentially biased) subset of the data

and in addition it has locally differentially private access

to the rest of the data.

The company does not have direct access to any user data,

but only to a proxy dataset that comes from a similar

distribution as the user data. If the user data consists of

private text messages, the proxy dataset could for example

be tweets from Twitter or public forum posts. Assume that

the company has locally differentially private access to

the user data.

In both settings, the company wants to use the data to which it

has direct access to perform data analysis, ML model training

or other data-dependent tasks. But in both cases, that data is

most likely biased: in 1, the covariate distribution of users

who are willing to share their data might differ from the

covariate distribution of users who are not willing to share

their data. In 2, the data even comes from a different source.

The problem that we are solving is the reduction of this bias.

We now formalize this problem.

Let

be the random variable that subsumes the covariates

of the user data. Let

be a binary random variable indicating

whether the company has direct, non-private accces to a data

point or not. This gives rise to the joint distribution

(D,Z)

. As-

sume that there exists a multiset

of samples

(d,z)

(D,Z)

Using

to denote multisets, let

X0={{d|(d,0)∈X}}

be the data to which the company only has locally differen-

tially private access and let

X1={{d|(d,1)∈X}}

be the

data to which the company has direct access. The goal is to

estimate the distribution of

(in Setting 1) or the distribution

of D|Z=0 (in Setting 2).

In the following we will refer to the data that can be directly

accessed (i.e., directly shared data in Setting 1 and proxy data

in Setting 2) as the participant data

and the data that can

only be accessed in a locally differentially private way as the

non-participant data

. Note that we do not consider a third

group of users: those who are not willing to share any data,

not even when provided a privacy guarantee. We denote their

data by

. This third group of users is empty if the company

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DiPPS:DifferentiallyPrivatePropensityScoresforBiasCorrectionLiangweiChen*†1,ValentinHartmann*2,RobertWest21Google2EPFLchenlw@google.com,valentin.hartmann@epfl.ch,robert.west@epfl.chAbstractInsurveys,itistypicallyuptotheindividualstodecideiftheywanttoparticipateornot,whichleadstoparticipationbias:the...

展开>> 收起<<

DiPPS Differentially Private Propensity Scores for Bias Correction.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DiPPS Differentially Private Propensity Scores for Bias Correction

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: