RWN A Novel Neighborhood-Based Method for Statistical Disclosure Control Noah Perry Department of Statistics University of California Davis

2025-04-24 0 0 452.07KB 27 页 10玖币

侵权投诉

RWN: A Novel Neighborhood-Based Method for

Statistical Disclosure Control

Noah Perry, Department of Statistics, University of California, Davis

Norman Matloff, Department of Computer Science, University of California, Davis

Patrick Tendick, FINRA

October 14, 2022

Abstract

A novel variation of the data swapping approach to statistical disclosure

control is presented, aimed particularly at preservation of multivariate rela-

tions in the original dataset. A theorem is proved in support of the method,

and extensive empirical investigation is reported.

1 Introduction

The ﬁeld of statistical disclosure control (SDC)—maintaining privacy of individual

records in a dataset while retaining the statistical utility of the data—has long been

the subject of arcane technical analysis, conducted mainly by statisticians. The

advent of differential privacy (DP) methods in 2006 [7] brought computer scientists

into the ﬁeld, and the arcane nature of the ﬁeld changed with the well-publicized

adoption of DP by the United States Census Bureau in 2017. That move by the

Bureau has become somewhat controversial [3] [19] and though we do not address

that controversy, we note the salutary effect of greatly increasing public awareness

of SDC.

Different SDC tools may be appropriate for different databases in different

settings, not just in terms of numeric degree of protection afforded by a tool, but

also in terms of usability, interpretatibility and transparency, for end users [18].

Here we develop new methodology that we believe database administrators (DBAs)

will ﬁnd useful in a variety of settings.

Our proposed method, Randomization within Neighborhoods (RWN), to be

presented below, is inspired by data swapping, a classic approach to SDC. How-

ever, RWN differs from previous data swapping approaches, in that it exploits a

arXiv:2210.06687v1 [stat.ME] 13 Oct 2022

certain statistical independence property, to be described in detail in the next sec-

tion.

This paper is organized as follows. We give a brief overview of SDC meth-

ods in Section 2, followed by a discussion of considerations in SDC speciﬁc to

databases used for statistical analysis in Sections 3 and 4. The RWN method is

presented in Section 5. The underlying theory is given in Section 6. Tuning pa-

rameter selection is covered in Section 7. Our empirical investigation is discussed

in Section 8, and computational issues are discussed in Section 9.

Note by the way that while we refer for convenience at some points to census

data as a concerete example, our methodology is meant to be general, not speciﬁc

to the census. It could be applied to medical data, employee data and so on.

2 SDC Methods

Good surveys of SDC methods are in [6] [11] [4]. In order to contrast with RWN,

we give a brief overview here.

Previously the Census Bureau had used the popular data swapping method

for SDC [10] [6] [9]. A set of key variables is deﬁned that may render certain

individual records in the data vulnerable to disclosure. Some records, especially

those deemed most at risk to disclosure, will have the values of their key variables

swapped with those in other records, say drawn from the same geographic region.

There are variants, notably data shufﬂing [17]. These methods are typically ap-

plied one variable at a time, thus creating the concern of attenuation of multivariate

relations.

The Bureau has also used cell suppression, in which any query concerning a

very small number of database records is denied. However, in 2017, the Bureau,

after performing various simulations, decided that the swapping approach was in

danger of reconstruction attacks and turned to DP.

Another standard SDC approach is data perturbation, in which random noise

is added to achieve privacy. The Census Bureau has used this method in the past as

well,

DP is also a perturbation method. Its novelty, though, is in its ability to be able

to quantify the degree of privacy, in a manner having certain mathematical traits,

such as composability. Another difference from classic noise addition methods is

that in DP, the noise is typically added to a statistic of interest, say a mean (global

DP), rather than to the microdata itself (local DP).

Two of the present authors, NM and PT, have a data perturbation background

background [15] [13] [22] [21], and were interested in modernizing that approach,

making a proposal in 2016 [14] for a novel SDC method inspired by data swapping.

The present work, joined by author NP, develops those ideas.

3 Statistical Views and Goals

Leo Breiman’s famous essay [5] on predictive modeling described a “cultural” dif-

ference between researchers in statistics and computer science, the former viewing

the world more in terms of probabilistic behavior, the latter in terms of algorithms.

Curiously, as pointed out by statistician Larry Wasserman [23], a similar difference

later arose in the SDC ﬁeld. Referring to the need in (global) DP to develop a sep-

arate DP-compliant method for each statistical procedure used (mean, regression

analysis etc.), he noted the contrasting views:

CS view: Receive a query for a [speciﬁc statistical procedure], return

a private answer.

Statistics view: Give me data. Then I can: draw plots, ﬁt models, test

ﬁt, estimate parameters, make predictions ...

An amusement park metaphor will be useful. Under the CS approach, one must

buy a separate ticket for each ride. Some rides won’t be available at all, pending

development of tickets tailored to those rides. With statistics, one purchases a day

pass, good for all rides. Our focus here is on settings in which one wishes to have

a “statistical day pass.” We protect the data in some way, say perturbation, then

let users conduct whatever types of statistical analyses they wish in an open-ended

way.

Good arguments can be made for either view. As noted, the one-query-at-a-

time nature of the DP/CS approach enables the setting of precise guarantees of

privacy, which would be difﬁcult or impossible in the statistics approach. On the

other hand, this means the user is restricted to only the types of queries for which

a DP version has already been developed and implemented. Linear regression may

be available, say, but not quantile regression.

(We note in passing that unlike most applications of DP, the US Census Bu-

reau’s DP methodology does amount to a “day pass.” This is because they add

noise to the raw data, which are cell counts in a huge contingency table, rather than

to the output of a statistical analysis, such as a mean.)

Due to our goal of developing a statistical analysis “day pass,” in the sense

Wasserman described—estimating parameters, ﬁtting models and so on, open-

endedly—we look at the data in the usual statistical manner, i.e. as a sample from

a population. This is in contrast to many SDC applications in which the data them-

selves are of primary interest, say the total count of people in a given income range

for a given census block.

A typical example might be that of a medical database, in which the privacy of

individual patients is required, but with which medical researchers can still conduct

statistical analyses, making population inferences.

We suppose here that there is some population value θfor which we wish to ob-

tain a sample estimate b

θ, performing statistical operations such as inference (con-

ﬁdence intervals, hypothesis tests). These operations will be conducted on the

perturbed data obtained by applying RWN to our original microdata. Typically θ

will be vector valued, such as a vector of regression coefﬁcients.

4 Desirable Statistical “Day Pass” Characteristics

In developing a new SDC procedure, such as our proposed RWN, these goals are

key:

• Ability to handle mixed continuous and discrete/categorical data.

• Preservation, to the degree possible, of not only univariate but also multi-

variate distributions/relations.

• Limiting the increase in size of the standard errors of b

θ.

• Preservation, to the degree possible of statistical inference levels related to

θ.

On the other hand, as noted, we do not take as a goal the preservation of marginal

totals as in Census data.

Let’s elaborate a bit on these goals.

4.1 Handling Mixed Continuous and Categorical Data

A major obstacle to data perturbation methods is their inability to handle discrete/-

categorical data. Consider a variable such as Number of Children in Family. After

noise addition, a value may become negative, an unacceptable situation. A simi-

lar difﬁculty arises with categorical variables, after they are converted to dummy

(one-hot) form.

Indeed, the vast majority of the Census Bureau’s TopDown algorithm [1] is

devoted to making adjustments to negative values, and satisfying certain constraints

involving marginal totals.

RWN will be seen to handle mixed continuous and discrete/categorical data in

a simple, natural manner.

4.2 Preservation of Multivariate Relations

Absent some compensating feature, any change to the data arising from applying

a “day pass” SDC procedure, say perturbation or swapping, will result in distor-

tions of the relations between variables in the data. This will also occur with cell

suppression methods. Since analysis of multivariate relations comprise the very

core of statistics, we take as a major goal at least approximately preserving such

relations.

We are of course willing to let those relations be one aspect of the utility/pri-

vacy tradeoff that is necessary to any disclosure avoidance technique. Let’s call

this property Multivariate Relations Attenuation Resistance (MRAR).

The goal then is to develop an SDC method that includes MRAR, with the

method providing the user a “lever” that she can use to choose her desired utili-

ty/privacy tradeoff level.

Comparatively little work in the SDC ﬁeld has focused on MRAR. It is men-

tioned only brieﬂy in [11] and [6] — and no wonder, as MRAR is a challenging

condition to meet.

Consider noise addition methods. One actually can preserve second-order mo-

ment structure by setting the covariance matrix of the noise to that of the data [12]

[15] [21]. But higher-order moments are lost and other distortions can occur. And

there are no obvious techniques for extending this property with noise addition in

mixed continuous/categorical variable settings.

5 RWN: Randomization within Neighborhoods

The method works roughly as follows. For each record in the data, we deﬁne a

neighborhood using either a Euclidean distance-based radius or k-nearest neigh-

bors. Then, for each record rwe randomly choose a subset of the variables to

perturb. For each such variable, we replace its original value by its counterpart in

a randomly chosen record in the neighborhood of r. A key point is that a different

random neighbor record is used for each of the variables to be perturbed in r.

More formally:

Let W= (wij ), i = 1, ..., n, j = 1, ..., p denote our original data on nindivid-

uals and pvariables and W0= (w0

ij ), i = 1, ..., n, j = 1, ..., p be the released (i.e.

perturbed) data.

Choose neighborhood radius  > 0, or number of nearest neighbors k, and

modiﬁcation probability q. Then we form our released data W0as follows:

For i= 1, ...n:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RWN:ANovelNeighborhood-BasedMethodforStatisticalDisclosureControlNoahPerry,DepartmentofStatistics,UniversityofCalifornia,DavisNormanMatloff,DepartmentofComputerScience,UniversityofCalifornia,DavisPatrickTendick,FINRAOctober14,2022AbstractAnovelvariationofthedataswappingapproachtostatisticaldisclosur...

展开>> 收起<<

RWN A Novel Neighborhood-Based Method for Statistical Disclosure Control Noah Perry Department of Statistics University of California Davis.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RWN A Novel Neighborhood-Based Method for Statistical Disclosure Control Noah Perry Department of Statistics University of California Davis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: