assume a low rank data matrix, and confirm the validity of
this assumption on US census data.
Another related setting is the following: For a machine
learning (ML) task, there exist two datasets, one is labeled
and the other one unlabeled, and there is a covariate shift
between the two. When training an ML model, one would
like to account for this shift by also taking into account the
unlabeled data. Several methods for solving this problem
have been proposed (Huang et al. 2007; Rosset et al. 2005;
Zadrozny 2004). In this context, the idea of using cluster-
ing to correct for sample selection bias has already been
explored (Cortes et al. 2008). All of these methods work only
in the non-private setting, where one has direct access to the
unlabeled data.
We, however, assume that we neither have auxiliary infor-
mation about the non-participants, nor non-private access to
parts of their data. Access to data of non-participants is only
allowed in a locally differentially private way. For providing
local differential privacy in the processing of distributed data,
there exists a multitude of methods: for computing means
(Wang et al. 2019; Duchi, Jordan, and Wainwright 2018), for
computing counts (Erlingsson, Pihur, and Korolova 2014)
or even for training machine learning models (Truex et al.
2019), to just name a few. What all of these methods have in
common is that each one only serves a single purpose. The
data analyst has to decide beforehand which function they
want to compute on the data. If they decide to perform addi-
tional analyses later, which is, e.g., the case in exploratory
and adaptive data analysis, they have to invoke another dif-
ferentially private mechanism. This requires further rounds
of communication with the individuals that hold the data
and, more importantly, each additional function computation
decreases the level of privacy (Rogers et al. 2016; Kairouz,
Oh, and Viswanath 2015). As opposed to that, our method
estimates a distribution. This means that once the method has
been executed, the data analyst can compute any number of
arbitrary functions they want on this distribution, including
all kinds of statistics, but also more complicated functions
such as training ML models. Furthermore, while methods
for, e.g., locally differentially private gradient descent require
many rounds of communication, our method works with a
single round of communication.
A trust setting similar to ours has been introduced earlier
by Avent et al. (2017). As opposed to us, they assume one
dataset with locally differentially private access and one with
centrally differentially private access, whereas we assume
one dataset with locally differentially private access and one
with non-private access. They describe a method for com-
puting the most popular records of a web search log (Avent
et al. 2017) and methods for mean estimation (Avent, Dubey,
and Korolova 2020), whereas we consider the more general
problem of distribution estimation. Note that our method can
in principle be extended to their setting with purely differen-
tially privacy access to data; see our discussion section.
Other works (Kancharla and Kang 2021; Clark and De-
sharnais 1998) consider bias in locally differentially private
surveys due to users not following the DP protocol faithfully.
This might occur in our setting if the data collecting party
gives the users only the options to share their data without
a privacy guarantee or with DP, but not to not share data at
all. The authors propose to split the users into two groups, let
those groups invoke DP mechanisms with different parame-
ters, and compare the two sets of responses to correct for this
bias.
Problem Definition
Our method is also applicable in an offline setting, but assume
for simplicity that there is a company that is selling a software
and wants to collect data from its users over the Internet to,
e.g., analyze usage patterns to improve the software, train
ML models that are to be integrated in the software, to spot
market opportunities for new products, etc. We consider two
settings:
1.
Users of the software get the option to share their data,
e.g., usage statistics, as it is common in a lot of nowadays’
software (Windows, Firefox, ...). Some users decide to
share their data directly (without a privacy mechanism
in place), some decide to only share data with a local
differential privacy guarantee. The company therefore has
direct access to a (potentially biased) subset of the data
and in addition it has locally differentially private access
to the rest of the data.
2.
The company does not have direct access to any user data,
but only to a proxy dataset that comes from a similar
distribution as the user data. If the user data consists of
private text messages, the proxy dataset could for example
be tweets from Twitter or public forum posts. Assume that
the company has locally differentially private access to
the user data.
In both settings, the company wants to use the data to which it
has direct access to perform data analysis, ML model training
or other data-dependent tasks. But in both cases, that data is
most likely biased: in 1, the covariate distribution of users
who are willing to share their data might differ from the
covariate distribution of users who are not willing to share
their data. In 2, the data even comes from a different source.
The problem that we are solving is the reduction of this bias.
We now formalize this problem.
Let
D
be the random variable that subsumes the covariates
of the user data. Let
Z
be a binary random variable indicating
whether the company has direct, non-private accces to a data
point or not. This gives rise to the joint distribution
(D,Z)
. As-
sume that there exists a multiset
X
of samples
(d,z)
of
(D,Z)
.
Using
{{·}}
to denote multisets, let
X0={{d|(d,0)∈X}}
be the data to which the company only has locally differen-
tially private access and let
X1={{d|(d,1)∈X}}
be the
data to which the company has direct access. The goal is to
estimate the distribution of
D
(in Setting 1) or the distribution
of D|Z=0 (in Setting 2).
In the following we will refer to the data that can be directly
accessed (i.e., directly shared data in Setting 1 and proxy data
in Setting 2) as the participant data
U1
and the data that can
only be accessed in a locally differentially private way as the
non-participant data
U2
. Note that we do not consider a third
group of users: those who are not willing to share any data,
not even when provided a privacy guarantee. We denote their
data by
U3
. This third group of users is empty if the company