TabLeak Tabular Data Leakage in Federated Learning Mark Vero1Mislav Balunovi c2Dimitar I. Dimitrov2Martin Vechev2 Abstract

2025-05-02 0 0 6.96MB 33 页 10玖币

侵权投诉

TabLeak: Tabular Data Leakage in Federated Learning

Mark Vero 1Mislav Balunovi´

c2Dimitar I. Dimitrov 2Martin Vechev 2

Abstract

While federated learning (FL) promises to pre-

serve privacy, recent works in the image and text

domains have shown that training updates leak

private client data. However, most high-stakes ap-

plications of FL (e.g., in healthcare and ﬁnance)

use tabular data, where the risk of data leakage

has not yet been explored. A successful attack

for tabular data must address two key challenges

unique to the domain: (i) obtaining a solution to

a high-variance mixed discrete-continuous opti-

mization problem, and (ii) enabling human assess-

ment of the reconstruction as unlike for image

and text data, direct human inspection is not pos-

sible. In this work we address these challenges

and propose TabLeak, the ﬁrst comprehensive

reconstruction attack on tabular data. TabLeak

is based on two key contributions: (i) a method

which leverages a softmax relaxation and pooled

ensembling to solve the optimization problem,

and (ii) an entropy-based uncertainty quantiﬁca-

tion scheme to enable human assessment. We

evaluate TabLeak on four tabular datasets for both

FedSGD and FedAvg training protocols, and show

that it successfully breaks several settings pre-

viously deemed safe. For instance, we extract

large subsets of private data at

>90%

accuracy

even at the large batch size of

128

. Our ﬁndings

demonstrate that current high-stakes tabular FL is

excessively vulnerable to leakage attacks.

1. Introduction

Federated Learning (McMahan et al., 2017) (FL) has

emerged as the most prominent approach to training ma-

chine learning models collaboratively without requiring sen-

sitive data of different parties to be collected in a central

Department of Information Technology and Electrical Engi-

neering, ETH Zurich, Zurich, Switzerland

Department of Com-

puter Science, ETH Zurich, Zurich, Switzerland. Correspondence

to: Mark Vero <mveroe@ethz.ch>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Figure 1: Comparison of image, text, and tabular data re-

construction. While the attack success can be judged by

human inspection in images and text, for tabular data it is

not possible, as both reconstructions look plausible. The im-

age reconstruction example is taken from Yin et al. (2021).

database. While prior work has examined privacy leakage

from exchanged updates in FL on images (Zhu et al., 2019;

Geiping et al., 2020; Yin et al., 2021) and text (Deng et al.,

2021; Dimitrov et al., 2022a; Gupta et al., 2022), many

applications of FL involve tabular datasets incorporating

highly sensitive personal data such as ﬁnancial information

and health status (Borisov et al., 2021; Long et al., 2021;

Rieke et al., 2020). However, as no prior work has studied

the issue of privacy leakage in tabular data, we are unaware

of the true extent of its risks. This is also a cause of con-

cern for US and UK public institutions which have recently

launched a $1.6 mil. prize competition

to develop privacy-

preserving FL solutions for ﬁnancial fraud detection and

infection risk prediction, both being tabular datasets.

Ingredients of a Data Leakage Attack A successful at-

tack builds on two pillars: (i) ability to reconstruct private

data from client updates with high accuracy, and (ii) a mech-

anism that allows a human to assess the obtained reconstruc-

tions without knowledge of the true data. Advancing along

the ﬁrst pillar typically requires leveraging the unique as-

pects of the given domain, e.g., image attacks employ image

priors (Geiping et al., 2020; Yin et al., 2021), while attacks

on text make use of pre-trained language models (Dimitrov

et al., 2022a; Gupta et al., 2022). However, in the image and

text domains, the second pillar naturally comes for free, as

the credibility of the obtained data can be assessed simply

by human inspection, in contrast to tabular data, where this

is not possible, as illustrated in Fig. 1.

1https://petsprizechallenges.com/

arXiv:2210.01785v2 [cs.LG] 7 Jul 2023

TabLeak: Tabular Data Leakage in Federated Learning

Figure 2: Overview of TabLeak. Our approach transforms the optimization problem into a fully continuous one by optimizing

continuous versions of the discrete features, obtained by applying softmax (Attack Step 1, middle boxes), resulting in

candidate solutions (Attack Step 1, bottom). Then, we pool together an ensemble of

different solutions

z1, z2, ..., zN

obtained from the optimization to reduce the variance of the reconstruction (Attack Step 2). Finally, we assess the quality of

the reconstruction by computing the entropy from the feature distributions in the ensemble (Assessment).

Key Challenges A strong attack for tabular data must ad-

dress two unique challenges, one along each pillar: (i) due

to the presence of both discrete and continuous features,

the attack needs to solve a mixed discrete-continuous op-

timization problem of high variance, and (ii) unlike with

image and text data, assessing the quality of the reconstruc-

tion is no longer possible via human inspection, requiring a

mechanism to quantify the uncertainty of the reconstruction.

This Work In this work we propose the ﬁrst comprehen-

sive attack on tabular data, TabLeak, addressing the above

challenges. Using our attack, we conduct the ﬁrst compre-

hensive evaluation of the privacy risks posed by data leak-

age in tabular FL. We provide an overview of our approach

in Fig. 2, showing the reconstruction of a client’s private data

point

x= [male, 18, white]

, from the corresponding

update

∇f

received by the server. We tackle the ﬁrst chal-

lenge in two steps. In Attack Step 1, we create

separate

optimization problems with different initializations. We

transform the mixed discrete-continuous optimization prob-

lem into a fully continuous one using a softmax relaxation.

Once optimization completes, in Attack Step 2, we reduce

the variance of the ﬁnal reconstruction by pooling over the

different solutions. To address challenge (ii, Assessment),

we rely on the observation that when the

reconstructions

agree on a certain feature, it tends to be reconstructed well.

We measure the agreement using entropy. In our example,

sex and age exhibit a low entropy reconstruction and are

also correct. Meanwhile, the high disagreement over the

race feature is indicative of its incorrect reconstruction.

Comparing our domain-speciﬁc attack with prior works

adapted from other domains on both FL protocols, FedSGD

and FedAvg in various settings on four popular tabular

datasets, we reveal the high vulnerability of such systems

on tabular data, even in scenarios previously deemed as safe.

We observe that on small batch sizes tabular FL systems are

nearly transparent, where most attacks recover

>90%

the private data. Further, our attack retrieves

70.8%

84.9%

of the client data at the practically relevant batch size of

on the examined datasets, improving by

12.7%

14.5%

on prior art. Additionally, even on batch sizes as large as

128

, we show how an adversary can recover a quarter of the

private data well above

90%

accuracy; leading to alarming

conclusions about the privacy of FL on tabular data.

Main Contributions Our main contributions are:

•

First effective domain-speciﬁc data leakage attack on

tabular data called TabLeak, enabling novel insights

into the unique aspect of tabular data leakage.

•

An effective uncertainty quantiﬁcation scheme, en-

abling the assessment of obtained samples and allow-

ing an attacker to extract highly accurate subsets of

features even from poor reconstructions.

•

An extensive experimental evaluation, revealing the

excessively high vulnerability of FL with tabular data

by successfully conducting attacks even in setups pre-

viously deemed safe.

TabLeak: Tabular Data Leakage in Federated Learning

2. Background and Related Work

Federated Learning FL is a framework developed to fa-

cilitate the distributed training of a parametric model while

preserving the privacy of the data at source (McMahan et al.,

2017). Formally, we have a parametric function

fθ(x) = y

where

are the parameters. Given a dataset as the union

of private datasets of clients

S=SK

k=1 Sk

, we now wish

to ﬁnd a

θ∗

such that

NP(xi,yi)∈S L(fθ∗(xi), yi)

is min-

imized, without ﬁrst collecting the dataset

in a central

database. McMahan et al. (2017) propose two training algo-

rithms: FedSGD (a similar algorithm was also proposed by

Shokri & Shmatikov (2015)) and FedAvg, that allow for the

distributed training of

fθ

, while keeping the data partitions

at client sources. The two protocols differ in how the

clients compute their local updates in each step of training.

In FedSGD, each client calculates the update gradient with

respect to a randomly selected batch of their own data and

shares it with the server. During FedAvg, the clients con-

duct a few epochs of local training on their own data before

sharing their resulting parameters with the server. In each

case, after the server has received the gradients/parameters

from the clients, it aggregates them, updates the model, and

broadcasts it to the clients; concluding an FL training step.

Data Leakage Attacks Although the design goal of FL

was to preserve the privacy of clients’ data, recent work has

uncovered substantial vulnerabilities. Melis et al. (2019)

ﬁrst presented how one can infer certain properties of the

clients’ data. Later, Zhu et al. (2019) demonstrated that an

honest-but-curious server can use the current state of the

model and the received updates to reconstruct the clients’

data, breaking the privacy promise of FL. Under this threat

model, there has been extensive research on designing tai-

lored attacks for images (Geiping et al., 2020; Zhao et al.,

2020; Geng et al., 2021; Huang et al., 2021; Jin et al., 2021;

Balunovi´

c et al., 2021; Yin et al., 2021; Jeon et al., 2021;

Dimitrov et al., 2022b) and natural language (Deng et al.,

2021; Dimitrov et al., 2022a; Gupta et al., 2022). However,

no prior work has comprehensively dealt with data leakage

attacks on tabular data, despite its signiﬁcance in real-world

high-stakes applications (Borisov et al., 2021). While, Wu

et al. (2022) describe an attack on tabular data where a ma-

licious client learns some distributional information from

other clients, they do not reconstruct any private data points.

Some works also consider a threat scenario where a mali-

cious server may change the model or the updates sent to

the clients (Fowl et al., 2021; Wen et al., 2022); but in this

work we focus on the honest-but-curious setting.

In FedSGD, given the gradient

∇θL(fθ(x), y)

of some

client (shorthand:

g(x, y)

), we solve the following opti-

mization problem to retrieve the client’s private data

(x, y)

ˆx, ˆy= arg min

x′,y′

E(g(x, y), g(x′, y′)) + λR(x′).(1)

In Eq. 1 we denote the gradient matching loss as

and

is an optional regularizer for the reconstruction. The

work of Zhu et al. (2019) used the mean squared error for

on which Geiping et al. (2020) improved using the cosine

similarity loss. Zhao et al. (2020) ﬁrst demonstrated that

the private labels

can be estimated before solving Eq. 1,

reducing the complexity of Eq. 1 and improving the attack

results. Their method was later extended to batches by Yin

et al. (2021) and reﬁned by Geng et al. (2021). Eq. 1 is

typically solved using continuous optimization tools such

as L-BFGS (Liu & Nocedal, 1989) and Adam (Kingma

& Ba, 2015). Although analytical approaches exist, they

do not generalize to batches with more than a single data

point (Zhu & Blaschko, 2021).

Domain-Speciﬁc Attacks Depending on the data domain,

distinct tailored alterations to Eq. 1 have been proposed

in the literature, e.g., using the total variation regularizer

for images (Geiping et al., 2020) and exploiting pre-trained

language models in language tasks (Dimitrov et al., 2022a;

Gupta et al., 2022). These mostly non-transferable domain-

speciﬁc solutions are necessary as each domain poses unique

challenges. Our work is ﬁrst to identify and tackle the key

challenges to data leakage in the tabular domain.

Privacy Threat of Tabular FL Regulations and personal

interests prevent institutions from sharing privacy-sensitive

tabular data, such as STI and drug test results, social security

numbers, credit scores, and passwords. To this end, FL was

proposed to enable inter-owner usage of such data. However,

in a strict sense, if FL on tabular data leaks any private

information, it does not fulﬁll its original design purpose,

severely undermining trust in institutions employing such

solutions. In our work we show that tabular FL, in fact,

leaks large amounts of private information.

Mixed Type Tabular Data Mixed type tabular data is

commonly used in healthcare, ﬁnance, and social sci-

ences, which entail high-stakes privacy-critical applica-

tions (Borisov et al., 2021). Here, data is collected in a

table with mostly human-interpretable columns, e.g., age,

and race of an individual. Formally, let

x∈ X

be one row of

data and let

contain

discrete columns and

continuous

columns, i.e.,

X=D1× · · · × DK× U1× · · · × UL

, where

Di⊂N

and

Ui⊂R

. For processing with neural networks,

discrete features are usually one-hot encoded, while continu-

ous features are preserved. The one-hot encoding of the

-th

discrete feature

is a binary vector

i(x)

of length

|Di|

that has a single non-zero entry at the position marking the

encoded category. We retrieve the represented category by

taking the argmax of

i(x)

(projection to obtain

). Using

the described encoding, one row of data

x∈ X

is encoded

as:

c(x) = cD

1(x), . . . , cD

K(x), xC

1, . . . , xC

L

, containing

d:=L+PK

i=1 |Di|entries.

TabLeak: Tabular Data Leakage in Federated Learning

3. Tabular Leakage

In this section, we brieﬂy summarize the challenges in tab-

ular leakage and present our solution to these, followed by

our end-to-end reconstruction attack.

Key Challenges In the tabular domain, a strong attack has

to address two unique challenges: (i) the presence of both

categorical and continuous features requires the attacker to

solve a signiﬁcantly harder mixed discrete-continuous opti-

mization problem of higher variance (addressed in Sec. 3.1.1

and Sec. 3.1.2), and (ii) as exempliﬁed previously in Fig. 1,

in contrast to images and text, it is hard for an unassisted

adversary to assess the credibility of the reconstructed data

in the tabular domain (addressed in Sec. 3.2).

3.1. Building a Strong Base Attack

We solve challenge (i) by introducing two components

to our attack; a softmax relaxation to turn the mixed

discrete-continuous problem into a fully continuous one

(see Sec. 3.1.1), and pooled ensembling to reduce the vari-

ance in the ﬁnal reconstruction (see Sec. 3.1.2).

3.1.1. THE SOFTMAX RELAXATION

In accordance with prior literature on data leakage attacks,

we aim to conduct the optimization in continuous domain.

For this we employ the softmax relaxation, which turns

the hard mixed discrete-continuous optimization problem

into a fully continuous one. This drastically reduces its

complexity, while still facilitating the recovery of correct

discrete structures.

The recovery of one-hot vectors requires the integer con-

straints of all entries taking values in

{0,1}

and summing

to one. Relaxing the integer constraints by allowing the

reconstructed entries to take real values in

[0,1]

, we are

still left with a constrained optimization problem not well

suited for popular continuous optimization tools, such as

Adam (Kingma & Ba, 2015). Therefore, we aim to implic-

itly enforce the constraints introduced above.

For this, we extend the method of Zhu et al. (2019) used

for inverting the discrete labels when jointly optimizing for

both the labels and the data. Let

z∈Rd

be our approximate

intermediate solution for the true one-hot encoded data

c(x)

during optimization. Then we can implicitly enforce all

constraints described above by applying a softmax to

for

all ibetween 1 and K,i.e., deﬁne:

σ(zD

i)[j]:=exp(zD

i[j])

P|Di|

k=1 exp(zD

i[k]) ∀j∈ Di.(2)

Therefore, in each round of optimization we will have the

following approximation of the true data point:

c(x)≈

σ(z) = σ(zD

1), . . . , σ(zD

K), zC

1, . . . , zC

L

. In order to

Figure 3: Maximum similarity matching of a sample

ˆxi

batch size

from the collection of reconstructions to the

best-loss sample ˆxbest.

preserve notational simplicity, we write

σ(z)

to mean the

application of softmax to each group of entries representing

a given categorical variable separately. Inverting a batch of

data, the softmax is applied in parallel to the batch points.

3.1.2. POOLED ENSEMBLING

In general, the data leakage optimization problem possesses

multiple local minima (Zhu & Blaschko, 2021) and is sen-

sitive to initialization (Wei et al., 2020). Additionally, we

observed and conﬁrmed in a targeted experiment in App. E

that in tabular data the mix of discrete and continuous fea-

tures introduces further variance, in contrast to image and

text, where the problem is fully continuous or fully discrete,

respectively. We alleviate this problem by running inde-

pendent optimization processes with different initializations

and ensembling their results through feature-wise pooling.

Exploiting the structural regularity of tabular data, we can

combine independent reconstructions to obtain an improved

and more robust ﬁnal estimate of the true data by applying

feature-wise pooling. Formally, we run

independent

rounds of optimization with

i.i.d.

initializations recovering

potentially different reconstructions

{σ(zj)}N

j=1

. Then, we

obtain a ﬁnal estimate of the true encoded data, denoted as

σ(ˆz)

, by pooling across these reconstructions in parallel for

each batch-point and feature:

σD

i(ˆz) = pool σD

i(zj)N

j=1∀i∈[K](3)

ˆzC

i=pool (zC

i)jN

j=1∀i∈[L].(4)

Where the pool(·)operation can be any permutation invari-

ant mapping. In our attack we use median pooling.

However, the above equations can not be applied in a

straight-forward manner as soon as we aim to reconstruct

batches containing more than just a single data point. As

the batch-gradient is an average of the per-sample gradients,

when running the leakage attack we may retrieve the batch-

points in a different order at every optimization instance.

Hence, it is not immediately clear how we can combine the

obtained samples; i.e., we need to reorder each batch such

that their rows match to each other, and only then we can

pool. We reorder by ﬁrst selecting the sample that produced

the best reconstruction loss at the end of optimization

ˆzbest

TabLeak: Tabular Data Leakage in Federated Learning

with projection

ˆxbest

. Then, we match the rows of every

other sample in the collection with respect to

ˆxbest

. Con-

cretely, we calculate the similarity (shown in Eq. 6 in Sec. 4)

between each pair of rows of

ˆxbest

and another sample

ˆxi

in the collection and ﬁnd the maximum similarity reorder-

ing of the rows with the help of bipartite matching solved

by the Hungarian algorithm (Kuhn, 1955). This process

is depicted in Fig. 3. Repeating this for each sample, we

reorder the entire collection with respect to the best-loss

sample, effectively reversing the permutation differences

in the independent reconstructions. Finally, we can apply

feature-wise pooling for each row over the collection.

3.2. Assessment via Entropy

We now address challenge (ii), assessing reconstructions.

To recap, it is close-to-impossible for an uninformed ad-

versary to assess the quality of the obtained private sample

when it comes to tabular data, as almost any reconstruction

may constitute a credible data point when projected back to

mixed discrete-continuous space. This challenge does not

arise as prominently in the image (or text) domain, because

one can easily judge by looking at a picture, if it is just noise

or an actual image, as exempliﬁed in Fig. 1. To address this

issue, we propose to estimate the reconstruction uncertainty

by looking at the level of agreement over a certain feature

for different reconstructions. Concretely, given a collection

of leaked samples as in Sec. 3.1.2, we can observe the dis-

tribution of each feature over the samples. Intuitively, if

this distribution is "peaky", i.e., concentrates the mass heav-

ily on a certain value, then we can assume that the feature

has been reconstructed correctly, whereas if there is high

disagreement between the reconstructed samples, we can

assume that this feature’s recovered ﬁnal value should not

be trusted. We can quantify this by measuring the entropy of

the feature distributions induced by the recovered samples.

Discrete Features Let

p(ˆxD

i)m:=1

NCountj(ˆxD

ij =m)

be the relative frequency of projected reconstructions of

the

-th discrete feature of value

in the ensemble. Then,

we can calculate the normalized entropy of the feature as

i=−1

log |Di|PDi

m=1 p(ˆxD

i)mlog p(ˆxD

i)m

. Note that the

normalization allows for comparing features with different

domain sizes, i.e., it ensures that

i∈[0,1]

, as

H(k)∈

[0,log |K|]for any ﬁnite discrete random variable k∈ K.

Continuous Features In case of continuous features, we

calculate the entropy by ﬁrst making the standard assump-

tion that the errors of the reconstructed continuous features

follow a Gaussian distribution. As such, we ﬁrst estimate the

sample variance

ˆσ2

for the

-th continuous feature and then

plug it in to calculate the entropy of the corresponding Gaus-

sian:

i=1

2+1

2log 2πˆσ2

. Cross-feature comparability

can be achieved by scaling all features, e.g., standardization.

Algorithm 1 TabLeak against training by FedSGD

function SINGLEINVERSION (Neural Network:

fθ

Client Gradient:

g(c(x), y)

, Reconstructed Labels:

ˆy

Initial Reconstruction:

, Iterations:

, # Discrete

Features: K)

2: for tin 0,1, . . . , T −1do

3: for kin 1,2, . . . , K do

4: σ(zD

kj )←softmax(zD

kj )

5: end for

6: zt+1

j←zt

j−η∇zECS (g(c(x), y), g(σ(zt

j),ˆy))

7: end for

8: return zT

9: end function

10:

11:

function TABLEAK (Neural Network:

fθ

, Client Gra-

dient:

g(c(x), y)

, Reconstructed Labels:

ˆy

, Ensemble

Size: N, Iterations: T, # Discrete Features: K)

12: z0

jN

j=1 ∼ U[0,1]d

13: for jin 1,2, . . . , N do

14: zT

j←

SINGLEINVERSION(

fθ

g(c(x), y)

ˆy

T,K)

15: end for

16: ˆzbest ←arg minzT

jECS (g(c(x), y), g(σ(zT

j),ˆy))

17: σ(ˆz)←MATCHANDPOOL(σ(zT

j)N

j=1 ,ˆzbest)

18: ¯

HD, HC←

CALCULATEENTROPY(

σ(zT

j)N

j=1

)

19: ˆx←PROJECT(σ(ˆz))

20: return ˆx,¯

HD,HC

21: end function

3.3. Combined Attack

Following Geiping et al. (2020), we use the cosine similarity

loss as our reconstruction objective, deﬁned as:

ECS (z):= 1 −⟨g(c(x), y), g(σ(z),ˆy)⟩

∥g(c(x), y)∥2∥g(σ(z),ˆy)∥2

,(5)

where

(x, y)

are the true data,

ˆy

are the labels reconstructed

beforehand, and we optimize for

. Our end-to-end attack,

TabLeak is shown in Alg. 1. First, we reconstruct the la-

bels using the label reconstruction method of Geng et al.

(2021) and input them into our attack. Then, we initialize

independent dummy samples for an ensemble of size

(Line 12). Starting from each initial sample we optimize

independently (Lines 13-15) via the SINGLEINVERSION

function. In each optimization step, we apply the softmax

relaxation of Sec. 3.1.1, and let the optimizer differentiate

through it (Line 4). After the optimization processes have

reached the maximum number of allowed iterations

, we

identify the sample ˆzbest producing the best reconstruction

loss (Line 16). Using

ˆzbest

, we match and pool to obtain the

ﬁnal encoded reconstruction

σ(ˆz)

in Line 17 as described in

Sec. 3.1.2. Finally, we return the projected private data re-

construction

ˆx

and the corresponding feature-entropies

and

, quantifying the uncertainty in the leaked sample.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TabLeak:TabularDataLeakageinFederatedLearningMarkVero1MislavBalunovi´c2DimitarI.Dimitrov2MartinVechev2AbstractWhilefederatedlearning(FL)promisestopre-serveprivacy,recentworksintheimageandtextdomainshaveshownthattrainingupdatesleakprivateclientdata.However,mosthigh-stakesap-plicationsofFL(e.g.,inheal...

展开>> 收起<<

TabLeak Tabular Data Leakage in Federated Learning Mark Vero1Mislav Balunovi c2Dimitar I. Dimitrov2Martin Vechev2 Abstract.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TabLeak Tabular Data Leakage in Federated Learning Mark Vero1Mislav Balunovi c2Dimitar I. Dimitrov2Martin Vechev2 Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: