Synthetic Dataset Generation for Privacy-Preserving Machine Learning Efstathia Soufleri Gobinda Saha Kaushik Roy

2025-05-02 0 0 1.17MB 9 页 10玖币
侵权投诉
Synthetic Dataset Generation for
Privacy-Preserving Machine Learning
Efstathia Soufleri, Gobinda Saha, Kaushik Roy
ECE, Purdue University, USA
Abstract. Machine Learning (ML) has achieved enormous success in
solving a variety of problems in computer vision, speech recognition, ob-
ject detection, to name a few. The principal reason for this success is the
availability of huge datasets for training deep neural networks (DNNs).
However, datasets can not be publicly released if they contain sensi-
tive information such as medical or financial records. In such cases, data
privacy becomes a major concern. Encryption methods offer a possible
solution to this issue, however their deployment on ML applications is
non-trivial, as they seriously impact the classification accuracy and re-
sult in substantial computational overhead. Alternatively, obfuscation
techniques can be used, but maintaining a good balance between visual
privacy and accuracy is challenging. In this work, we propose a method
to generate secure synthetic datasets from the original private datasets.
In our method, given a network with Batch Normalization (BN) layers
pre-trained on the original dataset, we first record the layer-wise BN
statistics. Next, using the BN statistics and the pre-trained model, we
generate the synthetic dataset by optimizing random noises such that
the synthetic data match the layer-wise statistical distribution of the
original model. We evaluate our method on image classification dataset
(CIFAR10) and show that our synthetic data can be used for training
networks from scratch, producing reasonable classification performance.1
Keywords: Synthetic Images, Privacy, Deep Learning, Neural Networks,
Privacy-Preserving Machine Learning
1 Introduction
Machine Learning (ML) has been integrated with great success in a wide range
of applications such as computer vision, autonomous driving, speech recognition,
natural language processing, object detection and so on. The availability of large
datasets and advancements in techniques for training deep neural network mod-
els have played integral roles towards such success. Moreover, the cloud providers
offer various Machine Learning as a Service (MLaaS) platforms such as Microsoft
Azure ML Studio [1], Google Cloud ML Engine [2], and Amazon Sagemaker [3]
etc., where computational resources is provided for running ML workloads. For
such cloud-based computing, the ML algorithms are either provided by the users
1Work in Progress
arXiv:2210.03205v5 [cs.CR] 11 Feb 2023
2 Efstathia Soufleri, Gobinda Saha, Kaushik Roy
or selected from the standard ML algorithm libraries [41], whereas the datasets
are usually shared to cloud by the users to meet application-specific require-
ments. However, it might not be always possible to share private data to the
cloud if they contain sensitive information such as medical or financial records
or the user may need to follow certain terms and regulations that forbid public
release of the data.
Thus, it is essential to protect the privacy of the training data before publicly
releasing them, as this would offer numerous advantages. First, it will be benefi-
cial for the research community - scientists could open source their data without
privacy concerns and consequences. Access to the data will facilitate researchers
to reproduce experiments from other studies and hence, transparency will be
promoted. Additionally, by combining data from different sources can lead to
better models that can be built. Moreover, data can potentially be traded in
data markets, where protection of sensitive information is of utmost importance
[39]. Overall, collaboration between users will be facilitated and this will help in
a broader way towards advancements in ML research.
In literature, several proposals have been suggested to ensure visual pri-
vacy of the data. Among those methods, a popular approach is data encryp-
tion [10,37,7,9,12,23], where the user encrypts the data before sending to the
cloud. This is considered to be a highly successful method and has demon-
strated exceptional results for protecting the data privacy [17,29]. However, this
method is computationally expensive making it prohibitive for wide adapta-
tion in ML applications [21]. Alternatively, researchers have suggested several
image-obfuscation techniques for visual privacy preservation [39]. These tech-
niques include image blurring [30], mixing [20,40], adding noise [41], pixelizing
[35] etc., which allow the model to be trained with the obfuscated images and
achieve good accuracy-privacy trade-off. However, when the images are highly
obfuscated, though visual privacy is well ensured, the performance of the network
might drop beyond the desirable point.
In this work, we introduce an algorithm for generating secure synthetic data
from the original private data. Specifically, as inputs, our method requires the
original data and a network with Batch Normalization (BN) layers pre-trained
on these data. From this network we record the BN statistics - the running
mean and running variance - from each layer. Then, we initialize the synthetic
data as Gaussian noise and optimize them to match the recorded BN statistics
[11]. Upon generation of the synthetic dataset, they can be publicly released
and used for training DNNs from scratch. We evaluate our methodology on
CIFAR10 [25] image classification datasets. We train a network (ResNet20) from
scratch on synthetic CIFAR10 images and obtain up to 61.79% classification
accuracy. We show that such classification performance depends on number of
optimization steps used for synthetic data generation. For longer optimization
steps classification performance increases, however the generated images look
more real-like sacrificing some data privacy.
摘要:

SyntheticDatasetGenerationforPrivacy-PreservingMachineLearningEfstathiaSoufleri,GobindaSaha,KaushikRoyECE,PurdueUniversity,USAAbstract.MachineLearning(ML)hasachievedenormoussuccessinsolvingavarietyofproblemsincomputervision,speechrecognition,ob-jectdetection,tonameafew.Theprincipalreasonforthissucce...

展开>> 收起<<
Synthetic Dataset Generation for Privacy-Preserving Machine Learning Efstathia Soufleri Gobinda Saha Kaushik Roy.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.17MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注