Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control Ali Furkan Kalay

2025-05-06 0 0 6.04MB 30 页 10玖币

侵权投诉

Generating Synthetic Data with Locally Estimated

Distributions for Disclosure Control∗

Ali Furkan Kalay†

February 18, 2025

Abstract

Sensitive datasets are often underutilized in research and industry due to privacy

concerns, limiting the potential of valuable data-driven insights. Synthetic data gen-

eration presents a promising solution to address this challenge by balancing privacy

protection with data utility. This paper introduces a new approach to mitigate privacy

risks associated with outlier observations in synthetic datasets: the Local Resampler

(LR). The LR leverages the k-nearest neighbors algorithm to generate synthetic data

while minimizing disclosure risks by underrepresenting outliers, even when they are not

detectable in marginal distributions. Theoretical and empirical analyses demonstrate

that the LR eﬀectively mitigates outlier-driven disclosure risks, and accurately repli-

cates multimodal, skewed, and non-convex support distributions. The semiparametric

nature of the LR ensures a low computational burden and works eﬃciently even with

small samples. By parameterizing the balance between privacy risks and data utility,

this approach promotes broader access to sensitive datasets for research.

Keywords: Synthetic Data, Disclosure Control, k-nearest Neighbor, Data Engineering

∗This study is part of Kalay’s (2024) thesis entitled “Essays on Administrative Data Methodologies” at

the University of Queensland, School of Economics. The implementation of the proposed algorithms are

available as at The Python Package Index: pypi.org/project/synloc/.

†alifurkan.kalay@mq.edu.au, Macquarie University Centre for Health Economy

arXiv:2210.00884v2 [stat.CO] 15 Feb 2025

1. Introduction

Sensitive datasets, such as ﬁnancial records, healthcare data, and administrative data, hold

signiﬁcant potential for generating data-driven insights. Yet, stringent privacy regulations

and the need to protect sensitive information often limits access to these resources. Synthetic

data generation methods oﬀer an alternative, allowing researchers to work with stochastic-

model-generated datasets that preserve statistical characteristics while minimizing privacy

risks. The optimal synthetic data methods must meet two critical standards: (1) accurate

representation of the original dataset’s statistical properties, and (2) robust disclosure risk

control to protect the identity of individuals, especially those who exhibit atypical patterns

(outliers). While synthetic data generation techniques are evolving rapidly, inherent control

of disclosure risks remains a challenge. The lack of robust statistical disclosure control

methods causes authorities to adopt cautious approaches when developing synthetic data

infrastructures, limiting the utility of sensitive datasets.

This paper introduces a new approach, the Local Resampler (LR), which leverages the

k-nearest neighbor algorithm for the generation of synthetic datasets. LR demonstrates

two key advantages: (1) the ability to accurately replicate complex distributions (e.g.,

multimodal, skewed and non-convex support distributions) with minimal hyperparameter

tuning, and (2) the capacity to minimize disclosure risks by underrepresenting outliers in

the synthetic data. In contrast to benchmark approaches, the latter feature eliminates the

need for analysts1to resort to time-consuming manual trimming processes often limited to

marginal distributions, which are insuﬃcient for comprehensive outlier identiﬁcation.

While the importance of outliers in synthetic data literature is recognized (Sweeney, 2002;

Drechsler, 2011; Bowen, 2021; Jordon et al., 2022), existing methods lack the ability to

statistically control the disclosure risks they pose. This paper uniquely addresses risks

driven speciﬁcally by outliers within joint distributions and show that LR inherently handles

such outlier-driven disclosure risk.

The concept of synthetic data in research has been discussed and explored for decades, ini-

tiated by Rubin (1993). A notable example is The Longitudinal Business Database (LBD).

The synthetic LBD allows researchers to obtain preliminary results and test compatibility

of their code before submitting scripts (STATA and SAS) to the United States Census Bu-

reau. This paper speciﬁcally addresses the risks posed by outliers within joint distributions

and shows that LR inherently handles such outlier-driven disclosure risks.

The use of sensitive datasets for research purposes is becoming increasingly common. Con-

sequently, there will be a growing demand for infrastructure similar to the synthetic Lon-

1I refer to the “analyst” as the individual responsible for creating synthetic data to ensure disclosure

control, while maintaining the utility of the dataset. This means that any statistical analysis performed on

the synthetic data should closely resemble the results obtained from the original sample.

gitudinal LBD and, hence, for reliable synthetic data methods. For instance, a search for

“administrative data” on sciencedirect.com yielded 3,206 articles in the year 2000. This

number grew to 26,307 results in 2023.2Research in health is another key domain often

using sensitive datasets (e.g., Chen et al. (2021); Hernandez et al. (2022)). These datasets

must be protected due to privacy concerns; however, they also hold signiﬁcant potential

for providing valuable insights. Reliable methods for disclosure control are essential for

ensuring robust data governance practices that balance research potential with the ethical

and responsible use of sensitive information.

LR is a generic approach aiming to address this demand by allowing users to control disclo-

sure risks for data privacy. I illustrate (Appendix C) how the locality of the distributions

can be deﬁned using other methods such as clustering algorithms. The LR algorithm is

especially suitable for statistical disclosure control, as it has a unique property of being

biased towards the sample mean (or cluster means in multimodal distributions), which in-

herently reduces disclosure risk by mitigating the impact of outliers that typically require

manual intervention. This property is particularly important as it automates the process

of safeguarding data privacy. Lastly, LR does not require large samples to work eﬃciently,

unlike the deep learning algorithms, which are computationally burdensome and diﬃcult to

optimize.

Section 2 reviews existing methods for creating synthetic data and highlights their broader

applications. Section 3 details the LR algorithm, along with its assumptions, limitations,

and possible extensions. Section 4 demonstrates the application of LR using artiﬁcial

datasets. Finally, Section 5 concludes the paper and suggests directions for future research

and applications.

2. Literature Review

Rubin (1993) proposed a multiple imputation approach to create synthetic versions of the

microdata to “honor the conﬁdentiality constraints.” Since then, the multiple imputation

approach has been further developed.3Various software packages are available to create

synthetic data today, such as synthpop (Nowok et al., 2016) and SDV (Patki et al.,

2016). synthpop adapts the multiple imputation approach suggested by Rubin (1993);

Raghunathan et al. (2003), and allows incorporating both parametric and nonparametric

models.4SDV, on the other hand, includes various methods, such as copula and deep

2Indeed, this surge is mainly driven by the overall increase in publications of academic papers. Figure

A.1 isolates this trend and displays the ratio of “administrative data” search results on sciencedirect.com to

total search outcomes by year. The proportions in 2023 are threefold higher than in 2000. Not surprisingly,

this increase is especially pronounced in data-oriented disciplines, such as economics, in comparison to other

ﬁelds across all years. This likely reﬂects the particular value of administrative data for policy-relevant

research.

3See Raghunathan (2021) for a literature review.

4The package oﬀers considerable ﬂexibility to users, such as handling and synthesizing missing values,

learning algorithms.

Multiple imputation methods replace observed values with model-predicted values as if

they were missing. These methods are well-established for addressing missing data problems

(Van Buuren & Oudshoorn, 1999; Stekhoven & B¨uhlmann, 2012). In the context of synthetic

data generation, the approach diﬀers slightly: the Analyst can choose separate models for

each variable, and nonparametric methods are often preferred over parametric models to

capture the nonlinearity and complexity of the original data distribution (Raab et al., 2014;

Drechsler & Reiter, 2011). However, nonparametric methods potentially overﬁt the data

and outliers can be a particular privacy risk in such circumstances. Therefore, analysts

are encouraged to trim outliers from the marginal distributions before generating synthetic

data. This practice, while being eﬃcient, is not suﬃcient to control such risks as shown in

Section 3.2.

An alternative approach involves estimating a joint distribution and drawing values from

it, rather than sequentially estimating posterior predictive distributions for each variable.

Copulas are a preferred solution due to their ﬂexibility, accommodating both parametric

(e.g., Gaussian copula) and nonparametric distributions (e.g., Vine copula) (Sun et al.,

2019). However, copula-based models can be challenging to generalize, requiring appropriate

speciﬁcation for marginal distributions and struggling to synthesize continuous and discrete

variables simultaneously.

Deep learning algorithms, such as Generative Adversarial Networks (GAN) (Goodfellow

et al., 2014) and Variational Autoencoders (VAE) (Kingma & Welling, 2013), have also

been employed for synthetic data generation. While these methods can model complex

data distributions, they are computationally intensive and diﬃcult to optimize. Recent

advancements, such as the use of normalizing ﬂows to synthesize tabular data (Kamthe

et al., 2021), have improved their applicability. These methods can synthesize mixed data

types and replicate distributions with non-convex support.

Distance-based methods like k-Nearest Neighbors (kNN) have been utilized in various ways

for creating synthetic data. Chawla et al. (2002) introduced the Synthetic Minority Over-

sampling Technique (SMOTE), which generates synthetic samples by interpolating between

neighboring observations. This technique is primarily used to augment imbalanced datasets

in classiﬁcation problems. Numerous extensions and variants of SMOTE have been devel-

oped for diﬀerent purposes; see Kov´acs (2019) for a comprehensive overview.

Recently, Sivakumar et al. (2022) proposed a method combining Mega-Trend Diﬀusion (Li

et al., 2007) and kNN, named k-Nearest Neighbor Mega-Trend Diﬀusion. One of the main

advantages of this approach is its ability to create high-quality synthetic datasets from

selecting synthesis sequences, stratifying the sample, sampling from predictive posterior distributions of the

models, and so on. The theoretical aspects of synthpop package were discussed in Raghunathan et al.

(2003); Raab et al. (2014).

small samples. It resembles the Local Resampling (LR) approach with kNN, but diﬀers

signiﬁcantly in how kNN is employed.

The use of clustering algorithms like K-means for synthetic data generation is not a new

concept. The LR approach that incorporates K-means shares similarities with mixture

models, which have been previously applied for synthetic data generation (Chokwitthaya

et al., 2020). However, estimating mixture models can be computationally demanding. A

related technique is the K-means SMOTE algorithm introduced by Douzas et al. (2018),

which utilizes K-means to enhance imbalanced datasets for classiﬁcation purposes, similar

to other SMOTE variants.

The multiple imputation method, as originally proposed by Rubin (1993), oﬀers a straight-

forward implementation and demonstrates robustness compared to many alternative meth-

ods. Nonparametric approaches within multiple imputation are capable of eﬀectively cap-

turing asymmetric and nonlinear distributions for both discrete and continuous data (Drech-

sler & Reiter, 2011). Moreover, Raab et al. (2014) provided a theoretical foundation for mak-

ing valid inferences from synthetic samples—an aspect often lacking in other approaches.

Given these beneﬁts, synthpop is used as the primary benchmark method in this study,

representing this approach. Additionally, methods available in the SDV package (Patki et

al., 2016), which is continuously evolving and includes diverse techniques such as copulas

and deep learning, are also evaluated. synthpop and SDV were selected due to their

popularity and their focus on generating synthetic data for disclosure control, which aligns

well with the objectives of this paper.

While kNN and other distance-based methods are widely recognized for their potential in

synthetic data generation—exempliﬁed by techniques such as SMOTE and kNN diﬀusion

models—this study reconceptualizes the SMOTE algorithm speciﬁcally to address synthetic

data generation with a focus on privacy preservation. The LR approach is essentially a

variant of the SMOTE algorithm, but it has been adapted to mitigate disclosure risks

posed by outliers, particularly within joint distributions—an area largely overlooked in

existing literature. This study conceptualizes the outlier problem in synthetic datasets and

demonstrates the eﬀectiveness of LR both theoretically and empirically. By establishing

the theoretical and practical advantages of using kNN within the LR framework, this study

illustrates how LR can be parameterized to balance data utility (i.e., replication accuracy)

with privacy risks, all while maintaining a low computational cost.

3. Local Resampler Algorithm

Let xibe a p-dimensional vector, where irepresents the observation in our sample Sof size

n. Our goal is to create a synthetic sample {ˆ

xi}n′

i=1 of size n′that has similar distributional

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneratingSyntheticDatawithLocallyEstimatedDistributionsforDisclosureControl∗AliFurkanKalay†February18,2025AbstractSensitivedatasetsareoftenunderutilizedinresearchandindustryduetoprivacyconcerns,limitingthepotentialofvaluabledata-driveninsights.Syntheticdatagen-erationpresentsapromisingsolutiontoadd...

展开>> 收起<<

Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control Ali Furkan Kalay.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control Ali Furkan Kalay

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: