Synthetic Dataset Generation for Privacy-Preserving Machine Learning Efstathia Soufleri Gobinda Saha Kaushik Roy

2025-05-02 0 0 1.17MB 9 页 10玖币

侵权投诉

Synthetic Dataset Generation for

Privacy-Preserving Machine Learning

Efstathia Souﬂeri, Gobinda Saha, Kaushik Roy

ECE, Purdue University, USA

Abstract. Machine Learning (ML) has achieved enormous success in

solving a variety of problems in computer vision, speech recognition, ob-

ject detection, to name a few. The principal reason for this success is the

availability of huge datasets for training deep neural networks (DNNs).

However, datasets can not be publicly released if they contain sensi-

tive information such as medical or ﬁnancial records. In such cases, data

privacy becomes a major concern. Encryption methods oﬀer a possible

solution to this issue, however their deployment on ML applications is

non-trivial, as they seriously impact the classiﬁcation accuracy and re-

sult in substantial computational overhead. Alternatively, obfuscation

techniques can be used, but maintaining a good balance between visual

privacy and accuracy is challenging. In this work, we propose a method

to generate secure synthetic datasets from the original private datasets.

In our method, given a network with Batch Normalization (BN) layers

pre-trained on the original dataset, we ﬁrst record the layer-wise BN

statistics. Next, using the BN statistics and the pre-trained model, we

generate the synthetic dataset by optimizing random noises such that

the synthetic data match the layer-wise statistical distribution of the

original model. We evaluate our method on image classiﬁcation dataset

(CIFAR10) and show that our synthetic data can be used for training

networks from scratch, producing reasonable classiﬁcation performance.1

Keywords: Synthetic Images, Privacy, Deep Learning, Neural Networks,

Privacy-Preserving Machine Learning

1 Introduction

Machine Learning (ML) has been integrated with great success in a wide range

of applications such as computer vision, autonomous driving, speech recognition,

natural language processing, object detection and so on. The availability of large

datasets and advancements in techniques for training deep neural network mod-

els have played integral roles towards such success. Moreover, the cloud providers

oﬀer various Machine Learning as a Service (MLaaS) platforms such as Microsoft

Azure ML Studio [1], Google Cloud ML Engine [2], and Amazon Sagemaker [3]

etc., where computational resources is provided for running ML workloads. For

such cloud-based computing, the ML algorithms are either provided by the users

1Work in Progress

arXiv:2210.03205v5 [cs.CR] 11 Feb 2023

2 Efstathia Souﬂeri, Gobinda Saha, Kaushik Roy

or selected from the standard ML algorithm libraries [41], whereas the datasets

are usually shared to cloud by the users to meet application-speciﬁc require-

ments. However, it might not be always possible to share private data to the

cloud if they contain sensitive information such as medical or ﬁnancial records

or the user may need to follow certain terms and regulations that forbid public

release of the data.

Thus, it is essential to protect the privacy of the training data before publicly

releasing them, as this would oﬀer numerous advantages. First, it will be beneﬁ-

cial for the research community - scientists could open source their data without

privacy concerns and consequences. Access to the data will facilitate researchers

to reproduce experiments from other studies and hence, transparency will be

promoted. Additionally, by combining data from diﬀerent sources can lead to

better models that can be built. Moreover, data can potentially be traded in

data markets, where protection of sensitive information is of utmost importance

[39]. Overall, collaboration between users will be facilitated and this will help in

a broader way towards advancements in ML research.

In literature, several proposals have been suggested to ensure visual pri-

vacy of the data. Among those methods, a popular approach is data encryp-

tion [10,37,7,9,12,23], where the user encrypts the data before sending to the

cloud. This is considered to be a highly successful method and has demon-

strated exceptional results for protecting the data privacy [17,29]. However, this

method is computationally expensive making it prohibitive for wide adapta-

tion in ML applications [21]. Alternatively, researchers have suggested several

image-obfuscation techniques for visual privacy preservation [39]. These tech-

niques include image blurring [30], mixing [20,40], adding noise [41], pixelizing

[35] etc., which allow the model to be trained with the obfuscated images and

achieve good accuracy-privacy trade-oﬀ. However, when the images are highly

obfuscated, though visual privacy is well ensured, the performance of the network

might drop beyond the desirable point.

In this work, we introduce an algorithm for generating secure synthetic data

from the original private data. Speciﬁcally, as inputs, our method requires the

original data and a network with Batch Normalization (BN) layers pre-trained

on these data. From this network we record the BN statistics - the running

mean and running variance - from each layer. Then, we initialize the synthetic

data as Gaussian noise and optimize them to match the recorded BN statistics

[11]. Upon generation of the synthetic dataset, they can be publicly released

and used for training DNNs from scratch. We evaluate our methodology on

CIFAR10 [25] image classiﬁcation datasets. We train a network (ResNet20) from

scratch on synthetic CIFAR10 images and obtain up to 61.79% classiﬁcation

accuracy. We show that such classiﬁcation performance depends on number of

optimization steps used for synthetic data generation. For longer optimization

steps classiﬁcation performance increases, however the generated images look

more real-like sacriﬁcing some data privacy.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SyntheticDatasetGenerationforPrivacy-PreservingMachineLearningEfstathiaSoufleri,GobindaSaha,KaushikRoyECE,PurdueUniversity,USAAbstract.MachineLearning(ML)hasachievedenormoussuccessinsolvingavarietyofproblemsincomputervision,speechrecognition,ob-jectdetection,tonameafew.Theprincipalreasonforthissucce...

展开>> 收起<<

Synthetic Dataset Generation for Privacy-Preserving Machine Learning Efstathia Soufleri Gobinda Saha Kaushik Roy.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Synthetic Dataset Generation for Privacy-Preserving Machine Learning Efstathia Soufleri Gobinda Saha Kaushik Roy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: