2 Efstathia Soufleri, Gobinda Saha, Kaushik Roy
or selected from the standard ML algorithm libraries [41], whereas the datasets
are usually shared to cloud by the users to meet application-specific require-
ments. However, it might not be always possible to share private data to the
cloud if they contain sensitive information such as medical or financial records
or the user may need to follow certain terms and regulations that forbid public
release of the data.
Thus, it is essential to protect the privacy of the training data before publicly
releasing them, as this would offer numerous advantages. First, it will be benefi-
cial for the research community - scientists could open source their data without
privacy concerns and consequences. Access to the data will facilitate researchers
to reproduce experiments from other studies and hence, transparency will be
promoted. Additionally, by combining data from different sources can lead to
better models that can be built. Moreover, data can potentially be traded in
data markets, where protection of sensitive information is of utmost importance
[39]. Overall, collaboration between users will be facilitated and this will help in
a broader way towards advancements in ML research.
In literature, several proposals have been suggested to ensure visual pri-
vacy of the data. Among those methods, a popular approach is data encryp-
tion [10,37,7,9,12,23], where the user encrypts the data before sending to the
cloud. This is considered to be a highly successful method and has demon-
strated exceptional results for protecting the data privacy [17,29]. However, this
method is computationally expensive making it prohibitive for wide adapta-
tion in ML applications [21]. Alternatively, researchers have suggested several
image-obfuscation techniques for visual privacy preservation [39]. These tech-
niques include image blurring [30], mixing [20,40], adding noise [41], pixelizing
[35] etc., which allow the model to be trained with the obfuscated images and
achieve good accuracy-privacy trade-off. However, when the images are highly
obfuscated, though visual privacy is well ensured, the performance of the network
might drop beyond the desirable point.
In this work, we introduce an algorithm for generating secure synthetic data
from the original private data. Specifically, as inputs, our method requires the
original data and a network with Batch Normalization (BN) layers pre-trained
on these data. From this network we record the BN statistics - the running
mean and running variance - from each layer. Then, we initialize the synthetic
data as Gaussian noise and optimize them to match the recorded BN statistics
[11]. Upon generation of the synthetic dataset, they can be publicly released
and used for training DNNs from scratch. We evaluate our methodology on
CIFAR10 [25] image classification datasets. We train a network (ResNet20) from
scratch on synthetic CIFAR10 images and obtain up to 61.79% classification
accuracy. We show that such classification performance depends on number of
optimization steps used for synthetic data generation. For longer optimization
steps classification performance increases, however the generated images look
more real-like sacrificing some data privacy.