1 InfoShape Task-Based Neural Data Shaping via Mutual Information

2025-04-30 0 0 2MB 6 页 10玖币
侵权投诉
1
InfoShape: Task-Based Neural Data Shaping via
Mutual Information
Homa Esfahanizadeh, William Wu, Manya Ghobadi, Regina Barzilay, and Muriel M´
edard
EECS Department, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139
Abstract—The use of mutual information as a tool in private
data sharing has remained an open challenge due to the difficulty
of its estimation in practice. In this paper, we propose InfoShape,
a task-based encoder that aims to remove unnecessary sensitive
information from training data while maintaining enough relevant
information for a particular ML training task. We achieve
this goal by utilizing mutual information estimators that are
based on neural networks, in order to measure two performance
metrics, privacy and utility. Using these together in a Lagrangian
optimization, we train a separate neural network as a lossy encoder.
We empirically show that InfoShape is capable of shaping the
encoded samples to be informative for a specific downstream task
while eliminating unnecessary sensitive information. Moreover,
we demonstrate that the classification accuracy of downstream
models has a meaningful connection with our utility and privacy
measures.
Index Terms—Task-based encoding, privacy, utility, mutual
information, private training.
I. INTRODUCTION
Mutual information (MI) is a measure to quantify how much
information is obtained about one random variable by observing
another random variable [
1
]. In a data sharing setting, the data-
owner often would like to transform their sensitive samples
such that only the necessary information for a specific task
is preserved, while sensitive information that can be used
for adversarial purposes is eliminated. MI is an excellent
candidate that can be used to develop task-based compression
for data-sharing to address the privacy-utility trade-off problem
[
2
]. However, estimating MI without knowing the distribution
of original data and transformed data is very difficult, and,
consequently, using this critical metric has remained limited.
In this paper, we utilize numerical estimation of MI to train a
task-based lossy encoder for data sharing.
Machine learning (ML) efforts in various sensitive domains
face a major bottleneck due to the shortage of publicly available
training data [
3
]. Acquisition and release of sensitive data is
a primary issue currently hindering the creation of public
large-scale datasets. For example, certain federal regulatory
laws such as HIPAA [
4
] and GDPR [
5
] prohibit medical
centers from sharing their patients’ identifiable information.
This motivates us to approach the issue from an information
theoretic perspective. Our goal is to enable data-owners to
eliminate sensitive parts of their data that are not critical for
a specific training task before data sharing. We consider a
setting where a lossy compressor encodes the data according
to two objectives: (i) training a shared model on the combined
encoded data of several institutions with a predictive utility
that is comparable to the un-encoded baseline; (ii) limiting
the use of data for adversarial purposes. In practice, there is a
trade-off between the utility and privacy goals.
The state-of-the-art solutions to tackle this privacy-utility
trade-off problem mainly involve data-owners sharing their en-
crypted data, distorted data, or transformed data. Cryptographic
methods [
6
], [
7
], [
8
] enable training ML models on encrypted
data and offer extremely strong security guarantees. However,
these methods have a high computational and communication
overhead, thereby hindering practical deployment. Distorting
the data by adding noise is another solution which can obtain
the theoretical notion of differential privacy [
9
], [
10
], [
11
],
but, unfortunately, often results in notable utility cost. Finally,
transformation schemes convert the sensitive data from the
original representation to an encoded representation by using
a randomly-chosen encoder [
12
], [
13
], [
14
]; however, if the
instance of the random encoder chosen by the data-owner is
revealed, the original data can be reconstructed.
In contrast, we design an encoding scheme to convert
the original representation of the training data into a new
representation that excludes sensitive information. Thus, the
privacy comes from the lossy behaviour of the encoder (i.e.,
compressor) that we design for a targeted training task. The
privacy goal is to limit the disclosed information about sensitive
features of a sample given its encoded representation, and the
utility goal is to obtain a competent classifier when trained on
the encoded data. We propose a dual optimization approach
to preserve privacy while maintaining utility. In particular, we
use MI to quantify the privacy and utility performance, and we
train a neural network that plays the role of our lossy encoder.
There has been recent progress for estimating bounds on
the mutual information via numerical methods [
15
], [
16
], [
17
],
[
18
]. We combine the privacy and utility measures using MI
estimations into a single loss metric to improve an encoder in
its training phase. Once the encoder is trained, it is utilized
by individual data owners as a task-based lossy compressor to
encode their data for release with the associated labels.
II. PROBLEM STATEMENT
We denote the set of all samples of a distribution by
X
.
Each sample
x∈ X
is labeled via function
L:X → Y
. A
data-owner has a sensitive dataset
D ⊆ X
that she wishes to
outsource to a third party for training a specific classification
model (i.e., to learn the function
L
). For privacy concerns,
the data-owner first encodes the sensitive data, via an encoder
T:X → Z
, and then publicly releases the labeled encoded
data
{(T(x), L(x))}x∈D
. An adversary may have access to
arXiv:2210.15034v3 [cs.IT] 2 Jun 2023
2
the deposited dataset, but uses it for adversarial purposes, i.e.,
deriving a sensitive feature
S(x)
from
T(x)
, where
S:X →
Y
. We call
L(x)
and
S(x)
the public and private label of
sample x X , respectively.
The utility goal is to preserve from each sample as much
information as needed to train a competitive downstream clas-
sification model. The privacy goal is to eliminate unnecessary
sensitive data from each sample, which is not critical for the
training task but might be misused by an adversary. There are
several methods to quantify the privacy and utility performance,
and here, we use Shannon entropy [1].
Definition 1. The utility score is negative of the average uncer-
tainty about the public label given its encoded representation,
Mutility(T)H[L(x)|T(x)].(1)
There are two potential ways to express the privacy: Given
the encoded representation, one is the average uncertainty about
the original sample and one is the average uncertainty about
a sensitive feature of the original sample. While each can be
advantageous over the other depending on the problem setting,
without loss of generality and for simplicity, we use the second
approach in this paper.
Definition 2. The privacy score is the average uncertainty
about the private label given its encoded representation,
Mprivacy(T)H[S(x)|T(x)].(2)
The privacy and utility are competing targets, and in this
paper, we design a lossy encoder that offers a desired trade-off
via a Lagrangian optimization. Consider the family of possible
encoders as T. An optimal encoder T∈ T is obtained as,
T= arg min
T∈T Mutility(T) + λMprivacy(T),
where
λ
is a non-negative metric that controls the trade-off
between privacy and utility, chosen to be
1
in our experiment.
There has been increasing theoretical interest in using
information theoretic measures to encode data for privacy
goals. These are organized under the Information Bottleneck
method [
19
]. However, since it is difficult to calculate these
measures due to their dependence on certain (often intractable)
probability distributions, they have remained impractical to
use. Recent efforts for estimating and incorporating these
measures have also faced practical challenges, and deriving
connections between optimizing variations of these measures
and the success of the task-based encoding (both the utility
and privacy aspects) are still open challenges [
15
], [
17
], [
20
].
III. ELIMINATING SENSITIVE DATA
We propose a dual optimization mechanism, dubbed InfoS-
hape, to simultaneously preserve privacy while also maintaining
utility on downstream classification tasks, see Fig. 1. We choose
the name of InfoShape since our scheme trains a neural network
encoder to act as a task-specific lossy compressor, by keeping
as much relevant information as possible for our intended
downstream task while “shaping” the data to achieve a private
representation.
Encoder Privacy
Utility
x
L(x)
z
×𝜆
Training Loss
Performance
Estimation
Fig. 1. InfoShape design procedure: At each training iteration, the privacy
and utility are scored for improving the encoder.
Consider InfoShape as an encoder
Tθ
with set of parameters
θ
(i.e., an ML model with weights described by
θ
). We define the
loss metric
Q(θ)
for evaluating the privacy-utility performance
of Tθas follows:
Q(θ) = Mprivacy(Tθ) + λMutility(Tθ).(3)
This loss is used for training the encoder by adjusting θ.
Our optimization problem is to determine the set of parame-
ters
θ
such that the loss metric defined in Eq (3) is minimized.
We solve this optimization , i.e.,
θ= arg min Q(θ)
, numeri-
cally via the stochastic gradient descent (SGD) method [21],
G=Q(θitr), θitr+1 =hθitr, G.(4)
Eq. (4) shows the gradient of the loss function with respect
to weights of the encoder, as well as the weight update step.
Here, h(.)is a gradient-based optimizer.
Once the encoder is trained, it can be utilized by individual
data-owners as a task-based lossy compressor to encode their
data and to enable the release of data for collaborative training.
A. Neural Estimation of Performance Scores
We utilize neural estimation of MI to numerically approxi-
mate the privacy and utility scores. For this, we re-write the
privacy and utility scores in Eq. (1)-(2), as follows:
Mutility(T) = I[L(x); T(x)] H[L(x)],
Mprivacy(T) = H[S(x)] I[S(x); T(x)].(5)
Note that the terms
H[L(x)]
and
H[S(x)]
can be computed
using the closed-form formulation of Shannon entropy and
given the empirical distribution of the public or private labels.
Nevertheless, as they do not depend on the encoder, they vanish
in the gradient.
For training the lossy encoder, we use a set of samples
{x, L(x), S(x)}x∈P
, such that
P X \ D
. The underlying
distributions are unknown (e.g.,
P[L(x)|x]
which characterizes
a perfect classifier, and
P[S(x)|T(x)]
which characterizes a
computationally unbounded adversary). Consequently, MI is
difficult to compute for a finite dataset of high-dimensional
inputs [
22
]. Thus, we consider tractable variational lower
bounds that approximate MI [15], [18].
Let us consider two random variables
α∈ A
and
β
B
. By definition, MI can be expressed in terms of the KL-
divergence (a measure of distance between two distributions
摘要:

1InfoShape:Task-BasedNeuralDataShapingviaMutualInformationHomaEsfahanizadeh,WilliamWu,ManyaGhobadi,ReginaBarzilay,andMurielM´edardEECSDepartment,MassachusettsInstituteofTechnology(MIT),Cambridge,MA02139Abstract—Theuseofmutualinformationasatoolinprivatedatasharinghasremainedanopenchallengeduetothedif...

展开>> 收起<<
1 InfoShape Task-Based Neural Data Shaping via Mutual Information.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:2MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注