1 InfoShape Task-Based Neural Data Shaping via Mutual Information

2025-04-30 0 0 2MB 6 页 10玖币

侵权投诉

InfoShape: Task-Based Neural Data Shaping via

Mutual Information

Homa Esfahanizadeh, William Wu, Manya Ghobadi, Regina Barzilay, and Muriel M´

edard

EECS Department, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139

Abstract—The use of mutual information as a tool in private

data sharing has remained an open challenge due to the difﬁculty

of its estimation in practice. In this paper, we propose InfoShape,

a task-based encoder that aims to remove unnecessary sensitive

information from training data while maintaining enough relevant

information for a particular ML training task. We achieve

this goal by utilizing mutual information estimators that are

based on neural networks, in order to measure two performance

metrics, privacy and utility. Using these together in a Lagrangian

optimization, we train a separate neural network as a lossy encoder.

We empirically show that InfoShape is capable of shaping the

encoded samples to be informative for a speciﬁc downstream task

while eliminating unnecessary sensitive information. Moreover,

we demonstrate that the classiﬁcation accuracy of downstream

models has a meaningful connection with our utility and privacy

measures.

Index Terms—Task-based encoding, privacy, utility, mutual

information, private training.

I. INTRODUCTION

Mutual information (MI) is a measure to quantify how much

information is obtained about one random variable by observing

another random variable [

]. In a data sharing setting, the data-

owner often would like to transform their sensitive samples

such that only the necessary information for a speciﬁc task

is preserved, while sensitive information that can be used

for adversarial purposes is eliminated. MI is an excellent

candidate that can be used to develop task-based compression

for data-sharing to address the privacy-utility trade-off problem

[

]. However, estimating MI without knowing the distribution

of original data and transformed data is very difﬁcult, and,

consequently, using this critical metric has remained limited.

In this paper, we utilize numerical estimation of MI to train a

task-based lossy encoder for data sharing.

Machine learning (ML) efforts in various sensitive domains

face a major bottleneck due to the shortage of publicly available

training data [

]. Acquisition and release of sensitive data is

a primary issue currently hindering the creation of public

large-scale datasets. For example, certain federal regulatory

laws such as HIPAA [

] and GDPR [

] prohibit medical

centers from sharing their patients’ identiﬁable information.

This motivates us to approach the issue from an information

theoretic perspective. Our goal is to enable data-owners to

eliminate sensitive parts of their data that are not critical for

a speciﬁc training task before data sharing. We consider a

setting where a lossy compressor encodes the data according

to two objectives: (i) training a shared model on the combined

encoded data of several institutions with a predictive utility

that is comparable to the un-encoded baseline; (ii) limiting

the use of data for adversarial purposes. In practice, there is a

trade-off between the utility and privacy goals.

The state-of-the-art solutions to tackle this privacy-utility

trade-off problem mainly involve data-owners sharing their en-

crypted data, distorted data, or transformed data. Cryptographic

methods [

], [

] enable training ML models on encrypted

data and offer extremely strong security guarantees. However,

these methods have a high computational and communication

overhead, thereby hindering practical deployment. Distorting

the data by adding noise is another solution which can obtain

the theoretical notion of differential privacy [

], [

but, unfortunately, often results in notable utility cost. Finally,

transformation schemes convert the sensitive data from the

original representation to an encoded representation by using

a randomly-chosen encoder [

], [

]; however, if the

instance of the random encoder chosen by the data-owner is

revealed, the original data can be reconstructed.

In contrast, we design an encoding scheme to convert

the original representation of the training data into a new

representation that excludes sensitive information. Thus, the

privacy comes from the lossy behaviour of the encoder (i.e.,

compressor) that we design for a targeted training task. The

privacy goal is to limit the disclosed information about sensitive

features of a sample given its encoded representation, and the

utility goal is to obtain a competent classiﬁer when trained on

the encoded data. We propose a dual optimization approach

to preserve privacy while maintaining utility. In particular, we

use MI to quantify the privacy and utility performance, and we

train a neural network that plays the role of our lossy encoder.

There has been recent progress for estimating bounds on

the mutual information via numerical methods [

], [

[

]. We combine the privacy and utility measures using MI

estimations into a single loss metric to improve an encoder in

its training phase. Once the encoder is trained, it is utilized

by individual data owners as a task-based lossy compressor to

encode their data for release with the associated labels.

II. PROBLEM STATEMENT

We denote the set of all samples of a distribution by

Each sample

x∈ X

is labeled via function

L:X → Y

. A

data-owner has a sensitive dataset

D ⊆ X

that she wishes to

outsource to a third party for training a speciﬁc classiﬁcation

model (i.e., to learn the function

). For privacy concerns,

the data-owner ﬁrst encodes the sensitive data, via an encoder

T:X → Z

, and then publicly releases the labeled encoded

data

{(T(x), L(x))}x∈D

. An adversary may have access to

arXiv:2210.15034v3 [cs.IT] 2 Jun 2023

the deposited dataset, but uses it for adversarial purposes, i.e.,

deriving a sensitive feature

S(x)

from

T(x)

, where

S:X →

Y′

. We call

L(x)

and

S(x)

the public and private label of

sample x∈ X , respectively.

The utility goal is to preserve from each sample as much

information as needed to train a competitive downstream clas-

siﬁcation model. The privacy goal is to eliminate unnecessary

sensitive data from each sample, which is not critical for the

training task but might be misused by an adversary. There are

several methods to quantify the privacy and utility performance,

and here, we use Shannon entropy [1].

Deﬁnition 1. The utility score is negative of the average uncer-

tainty about the public label given its encoded representation,

Mutility(T)≜−H[L(x)|T(x)].(1)

There are two potential ways to express the privacy: Given

the encoded representation, one is the average uncertainty about

the original sample and one is the average uncertainty about

a sensitive feature of the original sample. While each can be

advantageous over the other depending on the problem setting,

without loss of generality and for simplicity, we use the second

approach in this paper.

Deﬁnition 2. The privacy score is the average uncertainty

about the private label given its encoded representation,

Mprivacy(T)≜H[S(x)|T(x)].(2)

The privacy and utility are competing targets, and in this

paper, we design a lossy encoder that offers a desired trade-off

via a Lagrangian optimization. Consider the family of possible

encoders as T. An optimal encoder T∗∈ T is obtained as,

T∗= arg min

T∈T Mutility(T) + λMprivacy(T),

where

is a non-negative metric that controls the trade-off

between privacy and utility, chosen to be

in our experiment.

There has been increasing theoretical interest in using

information theoretic measures to encode data for privacy

goals. These are organized under the Information Bottleneck

method [

]. However, since it is difﬁcult to calculate these

measures due to their dependence on certain (often intractable)

probability distributions, they have remained impractical to

use. Recent efforts for estimating and incorporating these

measures have also faced practical challenges, and deriving

connections between optimizing variations of these measures

and the success of the task-based encoding (both the utility

and privacy aspects) are still open challenges [

], [

III. ELIMINATING SENSITIVE DATA

We propose a dual optimization mechanism, dubbed InfoS-

hape, to simultaneously preserve privacy while also maintaining

utility on downstream classiﬁcation tasks, see Fig. 1. We choose

the name of InfoShape since our scheme trains a neural network

encoder to act as a task-speciﬁc lossy compressor, by keeping

as much relevant information as possible for our intended

downstream task while “shaping” the data to achieve a private

representation.

Encoder Privacy

Utility

L(x)

×𝜆

Training Loss

Performance

Estimation

Fig. 1. InfoShape design procedure: At each training iteration, the privacy

and utility are scored for improving the encoder.

Consider InfoShape as an encoder

Tθ

with set of parameters

(i.e., an ML model with weights described by

). We deﬁne the

loss metric

Q(θ)

for evaluating the privacy-utility performance

of Tθas follows:

Q(θ) = Mprivacy(Tθ) + λMutility(Tθ).(3)

This loss is used for training the encoder by adjusting θ.

Our optimization problem is to determine the set of parame-

ters

such that the loss metric deﬁned in Eq (3) is minimized.

We solve this optimization , i.e.,

θ∗= arg min Q(θ)

, numeri-

cally via the stochastic gradient descent (SGD) method [21],

G=∇Q(θitr), θitr+1 =hθitr, G.(4)

Eq. (4) shows the gradient of the loss function with respect

to weights of the encoder, as well as the weight update step.

Here, h(.)is a gradient-based optimizer.

Once the encoder is trained, it can be utilized by individual

data-owners as a task-based lossy compressor to encode their

data and to enable the release of data for collaborative training.

A. Neural Estimation of Performance Scores

We utilize neural estimation of MI to numerically approxi-

mate the privacy and utility scores. For this, we re-write the

privacy and utility scores in Eq. (1)-(2), as follows:

Mutility(T) = I[L(x); T(x)] −H[L(x)],

Mprivacy(T) = H[S(x)] −I[S(x); T(x)].(5)

Note that the terms

H[L(x)]

and

H[S(x)]

can be computed

using the closed-form formulation of Shannon entropy and

given the empirical distribution of the public or private labels.

Nevertheless, as they do not depend on the encoder, they vanish

in the gradient.

For training the lossy encoder, we use a set of samples

{x, L(x), S(x)}x∈P

, such that

P ⊂ X \ D

. The underlying

distributions are unknown (e.g.,

P[L(x)|x]

which characterizes

a perfect classiﬁer, and

P[S(x)|T(x)]

which characterizes a

computationally unbounded adversary). Consequently, MI is

difﬁcult to compute for a ﬁnite dataset of high-dimensional

inputs [

]. Thus, we consider tractable variational lower

bounds that approximate MI [15], [18].

Let us consider two random variables

α∈ A

and

β∈

. By deﬁnition, MI can be expressed in terms of the KL-

divergence (a measure of distance between two distributions

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1InfoShape:Task-BasedNeuralDataShapingviaMutualInformationHomaEsfahanizadeh,WilliamWu,ManyaGhobadi,ReginaBarzilay,andMurielM´edardEECSDepartment,MassachusettsInstituteofTechnology(MIT),Cambridge,MA02139Abstract—Theuseofmutualinformationasatoolinprivatedatasharinghasremainedanopenchallengeduetothedif...

展开>> 收起<<

1 InfoShape Task-Based Neural Data Shaping via Mutual Information.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 InfoShape Task-Based Neural Data Shaping via Mutual Information

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: