RWN A Novel Neighborhood-Based Method for Statistical Disclosure Control Noah Perry Department of Statistics University of California Davis

2025-04-24 0 0 452.07KB 27 页 10玖币
侵权投诉
RWN: A Novel Neighborhood-Based Method for
Statistical Disclosure Control
Noah Perry, Department of Statistics, University of California, Davis
Norman Matloff, Department of Computer Science, University of California, Davis
Patrick Tendick, FINRA
October 14, 2022
Abstract
A novel variation of the data swapping approach to statistical disclosure
control is presented, aimed particularly at preservation of multivariate rela-
tions in the original dataset. A theorem is proved in support of the method,
and extensive empirical investigation is reported.
1 Introduction
The field of statistical disclosure control (SDC)—maintaining privacy of individual
records in a dataset while retaining the statistical utility of the data—has long been
the subject of arcane technical analysis, conducted mainly by statisticians. The
advent of differential privacy (DP) methods in 2006 [7] brought computer scientists
into the field, and the arcane nature of the field changed with the well-publicized
adoption of DP by the United States Census Bureau in 2017. That move by the
Bureau has become somewhat controversial [3] [19] and though we do not address
that controversy, we note the salutary effect of greatly increasing public awareness
of SDC.
Different SDC tools may be appropriate for different databases in different
settings, not just in terms of numeric degree of protection afforded by a tool, but
also in terms of usability, interpretatibility and transparency, for end users [18].
Here we develop new methodology that we believe database administrators (DBAs)
will find useful in a variety of settings.
Our proposed method, Randomization within Neighborhoods (RWN), to be
presented below, is inspired by data swapping, a classic approach to SDC. How-
ever, RWN differs from previous data swapping approaches, in that it exploits a
1
arXiv:2210.06687v1 [stat.ME] 13 Oct 2022
certain statistical independence property, to be described in detail in the next sec-
tion.
This paper is organized as follows. We give a brief overview of SDC meth-
ods in Section 2, followed by a discussion of considerations in SDC specific to
databases used for statistical analysis in Sections 3 and 4. The RWN method is
presented in Section 5. The underlying theory is given in Section 6. Tuning pa-
rameter selection is covered in Section 7. Our empirical investigation is discussed
in Section 8, and computational issues are discussed in Section 9.
Note by the way that while we refer for convenience at some points to census
data as a concerete example, our methodology is meant to be general, not specific
to the census. It could be applied to medical data, employee data and so on.
2 SDC Methods
Good surveys of SDC methods are in [6] [11] [4]. In order to contrast with RWN,
we give a brief overview here.
Previously the Census Bureau had used the popular data swapping method
for SDC [10] [6] [9]. A set of key variables is defined that may render certain
individual records in the data vulnerable to disclosure. Some records, especially
those deemed most at risk to disclosure, will have the values of their key variables
swapped with those in other records, say drawn from the same geographic region.
There are variants, notably data shuffling [17]. These methods are typically ap-
plied one variable at a time, thus creating the concern of attenuation of multivariate
relations.
The Bureau has also used cell suppression, in which any query concerning a
very small number of database records is denied. However, in 2017, the Bureau,
after performing various simulations, decided that the swapping approach was in
danger of reconstruction attacks and turned to DP.
Another standard SDC approach is data perturbation, in which random noise
is added to achieve privacy. The Census Bureau has used this method in the past as
well,
DP is also a perturbation method. Its novelty, though, is in its ability to be able
to quantify the degree of privacy, in a manner having certain mathematical traits,
such as composability. Another difference from classic noise addition methods is
that in DP, the noise is typically added to a statistic of interest, say a mean (global
DP), rather than to the microdata itself (local DP).
Two of the present authors, NM and PT, have a data perturbation background
background [15] [13] [22] [21], and were interested in modernizing that approach,
making a proposal in 2016 [14] for a novel SDC method inspired by data swapping.
2
The present work, joined by author NP, develops those ideas.
3 Statistical Views and Goals
Leo Breiman’s famous essay [5] on predictive modeling described a “cultural” dif-
ference between researchers in statistics and computer science, the former viewing
the world more in terms of probabilistic behavior, the latter in terms of algorithms.
Curiously, as pointed out by statistician Larry Wasserman [23], a similar difference
later arose in the SDC field. Referring to the need in (global) DP to develop a sep-
arate DP-compliant method for each statistical procedure used (mean, regression
analysis etc.), he noted the contrasting views:
CS view: Receive a query for a [specific statistical procedure], return
a private answer.
Statistics view: Give me data. Then I can: draw plots, fit models, test
fit, estimate parameters, make predictions ...
An amusement park metaphor will be useful. Under the CS approach, one must
buy a separate ticket for each ride. Some rides won’t be available at all, pending
development of tickets tailored to those rides. With statistics, one purchases a day
pass, good for all rides. Our focus here is on settings in which one wishes to have
a “statistical day pass. We protect the data in some way, say perturbation, then
let users conduct whatever types of statistical analyses they wish in an open-ended
way.
Good arguments can be made for either view. As noted, the one-query-at-a-
time nature of the DP/CS approach enables the setting of precise guarantees of
privacy, which would be difficult or impossible in the statistics approach. On the
other hand, this means the user is restricted to only the types of queries for which
a DP version has already been developed and implemented. Linear regression may
be available, say, but not quantile regression.
(We note in passing that unlike most applications of DP, the US Census Bu-
reau’s DP methodology does amount to a “day pass. This is because they add
noise to the raw data, which are cell counts in a huge contingency table, rather than
to the output of a statistical analysis, such as a mean.)
Due to our goal of developing a statistical analysis “day pass, in the sense
Wasserman described—estimating parameters, fitting models and so on, open-
endedly—we look at the data in the usual statistical manner, i.e. as a sample from
a population. This is in contrast to many SDC applications in which the data them-
selves are of primary interest, say the total count of people in a given income range
for a given census block.
3
A typical example might be that of a medical database, in which the privacy of
individual patients is required, but with which medical researchers can still conduct
statistical analyses, making population inferences.
We suppose here that there is some population value θfor which we wish to ob-
tain a sample estimate b
θ, performing statistical operations such as inference (con-
fidence intervals, hypothesis tests). These operations will be conducted on the
perturbed data obtained by applying RWN to our original microdata. Typically θ
will be vector valued, such as a vector of regression coefficients.
4 Desirable Statistical “Day Pass” Characteristics
In developing a new SDC procedure, such as our proposed RWN, these goals are
key:
Ability to handle mixed continuous and discrete/categorical data.
Preservation, to the degree possible, of not only univariate but also multi-
variate distributions/relations.
Limiting the increase in size of the standard errors of b
θ.
Preservation, to the degree possible of statistical inference levels related to
b
θ.
On the other hand, as noted, we do not take as a goal the preservation of marginal
totals as in Census data.
Let’s elaborate a bit on these goals.
4.1 Handling Mixed Continuous and Categorical Data
A major obstacle to data perturbation methods is their inability to handle discrete/-
categorical data. Consider a variable such as Number of Children in Family. After
noise addition, a value may become negative, an unacceptable situation. A simi-
lar difficulty arises with categorical variables, after they are converted to dummy
(one-hot) form.
Indeed, the vast majority of the Census Bureau’s TopDown algorithm [1] is
devoted to making adjustments to negative values, and satisfying certain constraints
involving marginal totals.
RWN will be seen to handle mixed continuous and discrete/categorical data in
a simple, natural manner.
4
4.2 Preservation of Multivariate Relations
Absent some compensating feature, any change to the data arising from applying
a “day pass” SDC procedure, say perturbation or swapping, will result in distor-
tions of the relations between variables in the data. This will also occur with cell
suppression methods. Since analysis of multivariate relations comprise the very
core of statistics, we take as a major goal at least approximately preserving such
relations.
We are of course willing to let those relations be one aspect of the utility/pri-
vacy tradeoff that is necessary to any disclosure avoidance technique. Let’s call
this property Multivariate Relations Attenuation Resistance (MRAR).
The goal then is to develop an SDC method that includes MRAR, with the
method providing the user a “lever” that she can use to choose her desired utili-
ty/privacy tradeoff level.
Comparatively little work in the SDC field has focused on MRAR. It is men-
tioned only briefly in [11] and [6] — and no wonder, as MRAR is a challenging
condition to meet.
Consider noise addition methods. One actually can preserve second-order mo-
ment structure by setting the covariance matrix of the noise to that of the data [12]
[15] [21]. But higher-order moments are lost and other distortions can occur. And
there are no obvious techniques for extending this property with noise addition in
mixed continuous/categorical variable settings.
5 RWN: Randomization within Neighborhoods
The method works roughly as follows. For each record in the data, we define a
neighborhood using either a Euclidean distance-based radius or k-nearest neigh-
bors. Then, for each record rwe randomly choose a subset of the variables to
perturb. For each such variable, we replace its original value by its counterpart in
a randomly chosen record in the neighborhood of r. A key point is that a different
random neighbor record is used for each of the variables to be perturbed in r.
More formally:
Let W= (wij ), i = 1, ..., n, j = 1, ..., p denote our original data on nindivid-
uals and pvariables and W0= (w0
ij ), i = 1, ..., n, j = 1, ..., p be the released (i.e.
perturbed) data.
Choose neighborhood radius  > 0, or number of nearest neighbors k, and
modification probability q. Then we form our released data W0as follows:
For i= 1, ...n:
5
摘要:

RWN:ANovelNeighborhood-BasedMethodforStatisticalDisclosureControlNoahPerry,DepartmentofStatistics,UniversityofCalifornia,DavisNormanMatloff,DepartmentofComputerScience,UniversityofCalifornia,DavisPatrickTendick,FINRAOctober14,2022AbstractAnovelvariationofthedataswappingapproachtostatisticaldisclosur...

展开>> 收起<<
RWN A Novel Neighborhood-Based Method for Statistical Disclosure Control Noah Perry Department of Statistics University of California Davis.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:452.07KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注