1
InfoShape: Task-Based Neural Data Shaping via
Mutual Information
Homa Esfahanizadeh, William Wu, Manya Ghobadi, Regina Barzilay, and Muriel M´
edard
EECS Department, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139
Abstract—The use of mutual information as a tool in private
data sharing has remained an open challenge due to the difficulty
of its estimation in practice. In this paper, we propose InfoShape,
a task-based encoder that aims to remove unnecessary sensitive
information from training data while maintaining enough relevant
information for a particular ML training task. We achieve
this goal by utilizing mutual information estimators that are
based on neural networks, in order to measure two performance
metrics, privacy and utility. Using these together in a Lagrangian
optimization, we train a separate neural network as a lossy encoder.
We empirically show that InfoShape is capable of shaping the
encoded samples to be informative for a specific downstream task
while eliminating unnecessary sensitive information. Moreover,
we demonstrate that the classification accuracy of downstream
models has a meaningful connection with our utility and privacy
measures.
Index Terms—Task-based encoding, privacy, utility, mutual
information, private training.
I. INTRODUCTION
Mutual information (MI) is a measure to quantify how much
information is obtained about one random variable by observing
another random variable [
1
]. In a data sharing setting, the data-
owner often would like to transform their sensitive samples
such that only the necessary information for a specific task
is preserved, while sensitive information that can be used
for adversarial purposes is eliminated. MI is an excellent
candidate that can be used to develop task-based compression
for data-sharing to address the privacy-utility trade-off problem
[
2
]. However, estimating MI without knowing the distribution
of original data and transformed data is very difficult, and,
consequently, using this critical metric has remained limited.
In this paper, we utilize numerical estimation of MI to train a
task-based lossy encoder for data sharing.
Machine learning (ML) efforts in various sensitive domains
face a major bottleneck due to the shortage of publicly available
training data [
3
]. Acquisition and release of sensitive data is
a primary issue currently hindering the creation of public
large-scale datasets. For example, certain federal regulatory
laws such as HIPAA [
4
] and GDPR [
5
] prohibit medical
centers from sharing their patients’ identifiable information.
This motivates us to approach the issue from an information
theoretic perspective. Our goal is to enable data-owners to
eliminate sensitive parts of their data that are not critical for
a specific training task before data sharing. We consider a
setting where a lossy compressor encodes the data according
to two objectives: (i) training a shared model on the combined
encoded data of several institutions with a predictive utility
that is comparable to the un-encoded baseline; (ii) limiting
the use of data for adversarial purposes. In practice, there is a
trade-off between the utility and privacy goals.
The state-of-the-art solutions to tackle this privacy-utility
trade-off problem mainly involve data-owners sharing their en-
crypted data, distorted data, or transformed data. Cryptographic
methods [
6
], [
7
], [
8
] enable training ML models on encrypted
data and offer extremely strong security guarantees. However,
these methods have a high computational and communication
overhead, thereby hindering practical deployment. Distorting
the data by adding noise is another solution which can obtain
the theoretical notion of differential privacy [
9
], [
10
], [
11
],
but, unfortunately, often results in notable utility cost. Finally,
transformation schemes convert the sensitive data from the
original representation to an encoded representation by using
a randomly-chosen encoder [
12
], [
13
], [
14
]; however, if the
instance of the random encoder chosen by the data-owner is
revealed, the original data can be reconstructed.
In contrast, we design an encoding scheme to convert
the original representation of the training data into a new
representation that excludes sensitive information. Thus, the
privacy comes from the lossy behaviour of the encoder (i.e.,
compressor) that we design for a targeted training task. The
privacy goal is to limit the disclosed information about sensitive
features of a sample given its encoded representation, and the
utility goal is to obtain a competent classifier when trained on
the encoded data. We propose a dual optimization approach
to preserve privacy while maintaining utility. In particular, we
use MI to quantify the privacy and utility performance, and we
train a neural network that plays the role of our lossy encoder.
There has been recent progress for estimating bounds on
the mutual information via numerical methods [
15
], [
16
], [
17
],
[
18
]. We combine the privacy and utility measures using MI
estimations into a single loss metric to improve an encoder in
its training phase. Once the encoder is trained, it is utilized
by individual data owners as a task-based lossy compressor to
encode their data for release with the associated labels.
II. PROBLEM STATEMENT
We denote the set of all samples of a distribution by
X
.
Each sample
x∈ X
is labeled via function
L:X → Y
. A
data-owner has a sensitive dataset
D ⊆ X
that she wishes to
outsource to a third party for training a specific classification
model (i.e., to learn the function
L
). For privacy concerns,
the data-owner first encodes the sensitive data, via an encoder
T:X → Z
, and then publicly releases the labeled encoded
data
{(T(x), L(x))}x∈D
. An adversary may have access to
arXiv:2210.15034v3 [cs.IT] 2 Jun 2023