Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control Ali Furkan Kalay

2025-05-06 0 0 6.04MB 30 页 10玖币
侵权投诉
Generating Synthetic Data with Locally Estimated
Distributions for Disclosure Control
Ali Furkan Kalay
February 18, 2025
Abstract
Sensitive datasets are often underutilized in research and industry due to privacy
concerns, limiting the potential of valuable data-driven insights. Synthetic data gen-
eration presents a promising solution to address this challenge by balancing privacy
protection with data utility. This paper introduces a new approach to mitigate privacy
risks associated with outlier observations in synthetic datasets: the Local Resampler
(LR). The LR leverages the k-nearest neighbors algorithm to generate synthetic data
while minimizing disclosure risks by underrepresenting outliers, even when they are not
detectable in marginal distributions. Theoretical and empirical analyses demonstrate
that the LR effectively mitigates outlier-driven disclosure risks, and accurately repli-
cates multimodal, skewed, and non-convex support distributions. The semiparametric
nature of the LR ensures a low computational burden and works efficiently even with
small samples. By parameterizing the balance between privacy risks and data utility,
this approach promotes broader access to sensitive datasets for research.
Keywords: Synthetic Data, Disclosure Control, k-nearest Neighbor, Data Engineering
This study is part of Kalay’s (2024) thesis entitled “Essays on Administrative Data Methodologies” at
the University of Queensland, School of Economics. The implementation of the proposed algorithms are
available as at The Python Package Index: pypi.org/project/synloc/.
alifurkan.kalay@mq.edu.au, Macquarie University Centre for Health Economy
1
arXiv:2210.00884v2 [stat.CO] 15 Feb 2025
1. Introduction
Sensitive datasets, such as financial records, healthcare data, and administrative data, hold
significant potential for generating data-driven insights. Yet, stringent privacy regulations
and the need to protect sensitive information often limits access to these resources. Synthetic
data generation methods offer an alternative, allowing researchers to work with stochastic-
model-generated datasets that preserve statistical characteristics while minimizing privacy
risks. The optimal synthetic data methods must meet two critical standards: (1) accurate
representation of the original dataset’s statistical properties, and (2) robust disclosure risk
control to protect the identity of individuals, especially those who exhibit atypical patterns
(outliers). While synthetic data generation techniques are evolving rapidly, inherent control
of disclosure risks remains a challenge. The lack of robust statistical disclosure control
methods causes authorities to adopt cautious approaches when developing synthetic data
infrastructures, limiting the utility of sensitive datasets.
This paper introduces a new approach, the Local Resampler (LR), which leverages the
k-nearest neighbor algorithm for the generation of synthetic datasets. LR demonstrates
two key advantages: (1) the ability to accurately replicate complex distributions (e.g.,
multimodal, skewed and non-convex support distributions) with minimal hyperparameter
tuning, and (2) the capacity to minimize disclosure risks by underrepresenting outliers in
the synthetic data. In contrast to benchmark approaches, the latter feature eliminates the
need for analysts1to resort to time-consuming manual trimming processes often limited to
marginal distributions, which are insufficient for comprehensive outlier identification.
While the importance of outliers in synthetic data literature is recognized (Sweeney, 2002;
Drechsler, 2011; Bowen, 2021; Jordon et al., 2022), existing methods lack the ability to
statistically control the disclosure risks they pose. This paper uniquely addresses risks
driven specifically by outliers within joint distributions and show that LR inherently handles
such outlier-driven disclosure risk.
The concept of synthetic data in research has been discussed and explored for decades, ini-
tiated by Rubin (1993). A notable example is The Longitudinal Business Database (LBD).
The synthetic LBD allows researchers to obtain preliminary results and test compatibility
of their code before submitting scripts (STATA and SAS) to the United States Census Bu-
reau. This paper specifically addresses the risks posed by outliers within joint distributions
and shows that LR inherently handles such outlier-driven disclosure risks.
The use of sensitive datasets for research purposes is becoming increasingly common. Con-
sequently, there will be a growing demand for infrastructure similar to the synthetic Lon-
1I refer to the “analyst” as the individual responsible for creating synthetic data to ensure disclosure
control, while maintaining the utility of the dataset. This means that any statistical analysis performed on
the synthetic data should closely resemble the results obtained from the original sample.
2
gitudinal LBD and, hence, for reliable synthetic data methods. For instance, a search for
“administrative data” on sciencedirect.com yielded 3,206 articles in the year 2000. This
number grew to 26,307 results in 2023.2Research in health is another key domain often
using sensitive datasets (e.g., Chen et al. (2021); Hernandez et al. (2022)). These datasets
must be protected due to privacy concerns; however, they also hold significant potential
for providing valuable insights. Reliable methods for disclosure control are essential for
ensuring robust data governance practices that balance research potential with the ethical
and responsible use of sensitive information.
LR is a generic approach aiming to address this demand by allowing users to control disclo-
sure risks for data privacy. I illustrate (Appendix C) how the locality of the distributions
can be defined using other methods such as clustering algorithms. The LR algorithm is
especially suitable for statistical disclosure control, as it has a unique property of being
biased towards the sample mean (or cluster means in multimodal distributions), which in-
herently reduces disclosure risk by mitigating the impact of outliers that typically require
manual intervention. This property is particularly important as it automates the process
of safeguarding data privacy. Lastly, LR does not require large samples to work efficiently,
unlike the deep learning algorithms, which are computationally burdensome and difficult to
optimize.
Section 2 reviews existing methods for creating synthetic data and highlights their broader
applications. Section 3 details the LR algorithm, along with its assumptions, limitations,
and possible extensions. Section 4 demonstrates the application of LR using artificial
datasets. Finally, Section 5 concludes the paper and suggests directions for future research
and applications.
2. Literature Review
Rubin (1993) proposed a multiple imputation approach to create synthetic versions of the
microdata to “honor the confidentiality constraints.” Since then, the multiple imputation
approach has been further developed.3Various software packages are available to create
synthetic data today, such as synthpop (Nowok et al., 2016) and SDV (Patki et al.,
2016). synthpop adapts the multiple imputation approach suggested by Rubin (1993);
Raghunathan et al. (2003), and allows incorporating both parametric and nonparametric
models.4SDV, on the other hand, includes various methods, such as copula and deep
2Indeed, this surge is mainly driven by the overall increase in publications of academic papers. Figure
A.1 isolates this trend and displays the ratio of “administrative data” search results on sciencedirect.com to
total search outcomes by year. The proportions in 2023 are threefold higher than in 2000. Not surprisingly,
this increase is especially pronounced in data-oriented disciplines, such as economics, in comparison to other
fields across all years. This likely reflects the particular value of administrative data for policy-relevant
research.
3See Raghunathan (2021) for a literature review.
4The package offers considerable flexibility to users, such as handling and synthesizing missing values,
3
learning algorithms.
Multiple imputation methods replace observed values with model-predicted values as if
they were missing. These methods are well-established for addressing missing data problems
(Van Buuren & Oudshoorn, 1999; Stekhoven & B¨uhlmann, 2012). In the context of synthetic
data generation, the approach differs slightly: the Analyst can choose separate models for
each variable, and nonparametric methods are often preferred over parametric models to
capture the nonlinearity and complexity of the original data distribution (Raab et al., 2014;
Drechsler & Reiter, 2011). However, nonparametric methods potentially overfit the data
and outliers can be a particular privacy risk in such circumstances. Therefore, analysts
are encouraged to trim outliers from the marginal distributions before generating synthetic
data. This practice, while being efficient, is not sufficient to control such risks as shown in
Section 3.2.
An alternative approach involves estimating a joint distribution and drawing values from
it, rather than sequentially estimating posterior predictive distributions for each variable.
Copulas are a preferred solution due to their flexibility, accommodating both parametric
(e.g., Gaussian copula) and nonparametric distributions (e.g., Vine copula) (Sun et al.,
2019). However, copula-based models can be challenging to generalize, requiring appropriate
specification for marginal distributions and struggling to synthesize continuous and discrete
variables simultaneously.
Deep learning algorithms, such as Generative Adversarial Networks (GAN) (Goodfellow
et al., 2014) and Variational Autoencoders (VAE) (Kingma & Welling, 2013), have also
been employed for synthetic data generation. While these methods can model complex
data distributions, they are computationally intensive and difficult to optimize. Recent
advancements, such as the use of normalizing flows to synthesize tabular data (Kamthe
et al., 2021), have improved their applicability. These methods can synthesize mixed data
types and replicate distributions with non-convex support.
Distance-based methods like k-Nearest Neighbors (kNN) have been utilized in various ways
for creating synthetic data. Chawla et al. (2002) introduced the Synthetic Minority Over-
sampling Technique (SMOTE), which generates synthetic samples by interpolating between
neighboring observations. This technique is primarily used to augment imbalanced datasets
in classification problems. Numerous extensions and variants of SMOTE have been devel-
oped for different purposes; see Koacs (2019) for a comprehensive overview.
Recently, Sivakumar et al. (2022) proposed a method combining Mega-Trend Diffusion (Li
et al., 2007) and kNN, named k-Nearest Neighbor Mega-Trend Diffusion. One of the main
advantages of this approach is its ability to create high-quality synthetic datasets from
selecting synthesis sequences, stratifying the sample, sampling from predictive posterior distributions of the
models, and so on. The theoretical aspects of synthpop package were discussed in Raghunathan et al.
(2003); Raab et al. (2014).
4
small samples. It resembles the Local Resampling (LR) approach with kNN, but differs
significantly in how kNN is employed.
The use of clustering algorithms like K-means for synthetic data generation is not a new
concept. The LR approach that incorporates K-means shares similarities with mixture
models, which have been previously applied for synthetic data generation (Chokwitthaya
et al., 2020). However, estimating mixture models can be computationally demanding. A
related technique is the K-means SMOTE algorithm introduced by Douzas et al. (2018),
which utilizes K-means to enhance imbalanced datasets for classification purposes, similar
to other SMOTE variants.
The multiple imputation method, as originally proposed by Rubin (1993), offers a straight-
forward implementation and demonstrates robustness compared to many alternative meth-
ods. Nonparametric approaches within multiple imputation are capable of effectively cap-
turing asymmetric and nonlinear distributions for both discrete and continuous data (Drech-
sler & Reiter, 2011). Moreover, Raab et al. (2014) provided a theoretical foundation for mak-
ing valid inferences from synthetic samples—an aspect often lacking in other approaches.
Given these benefits, synthpop is used as the primary benchmark method in this study,
representing this approach. Additionally, methods available in the SDV package (Patki et
al., 2016), which is continuously evolving and includes diverse techniques such as copulas
and deep learning, are also evaluated. synthpop and SDV were selected due to their
popularity and their focus on generating synthetic data for disclosure control, which aligns
well with the objectives of this paper.
While kNN and other distance-based methods are widely recognized for their potential in
synthetic data generation—exemplified by techniques such as SMOTE and kNN diffusion
models—this study reconceptualizes the SMOTE algorithm specifically to address synthetic
data generation with a focus on privacy preservation. The LR approach is essentially a
variant of the SMOTE algorithm, but it has been adapted to mitigate disclosure risks
posed by outliers, particularly within joint distributions—an area largely overlooked in
existing literature. This study conceptualizes the outlier problem in synthetic datasets and
demonstrates the effectiveness of LR both theoretically and empirically. By establishing
the theoretical and practical advantages of using kNN within the LR framework, this study
illustrates how LR can be parameterized to balance data utility (i.e., replication accuracy)
with privacy risks, all while maintaining a low computational cost.
3. Local Resampler Algorithm
Let xibe a p-dimensional vector, where irepresents the observation in our sample Sof size
n. Our goal is to create a synthetic sample {ˆ
xi}n
i=1 of size nthat has similar distributional
5
摘要:

GeneratingSyntheticDatawithLocallyEstimatedDistributionsforDisclosureControl∗AliFurkanKalay†February18,2025AbstractSensitivedatasetsareoftenunderutilizedinresearchandindustryduetoprivacyconcerns,limitingthepotentialofvaluabledata-driveninsights.Syntheticdatagen-erationpresentsapromisingsolutiontoadd...

展开>> 收起<<
Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control Ali Furkan Kalay.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:6.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注