TabLeak Tabular Data Leakage in Federated Learning Mark Vero1Mislav Balunovi c2Dimitar I. Dimitrov2Martin Vechev2 Abstract

2025-05-02 0 0 6.96MB 33 页 10玖币
侵权投诉
TabLeak: Tabular Data Leakage in Federated Learning
Mark Vero 1Mislav Balunovi´
c2Dimitar I. Dimitrov 2Martin Vechev 2
Abstract
While federated learning (FL) promises to pre-
serve privacy, recent works in the image and text
domains have shown that training updates leak
private client data. However, most high-stakes ap-
plications of FL (e.g., in healthcare and finance)
use tabular data, where the risk of data leakage
has not yet been explored. A successful attack
for tabular data must address two key challenges
unique to the domain: (i) obtaining a solution to
a high-variance mixed discrete-continuous opti-
mization problem, and (ii) enabling human assess-
ment of the reconstruction as unlike for image
and text data, direct human inspection is not pos-
sible. In this work we address these challenges
and propose TabLeak, the first comprehensive
reconstruction attack on tabular data. TabLeak
is based on two key contributions: (i) a method
which leverages a softmax relaxation and pooled
ensembling to solve the optimization problem,
and (ii) an entropy-based uncertainty quantifica-
tion scheme to enable human assessment. We
evaluate TabLeak on four tabular datasets for both
FedSGD and FedAvg training protocols, and show
that it successfully breaks several settings pre-
viously deemed safe. For instance, we extract
large subsets of private data at
>90%
accuracy
even at the large batch size of
128
. Our findings
demonstrate that current high-stakes tabular FL is
excessively vulnerable to leakage attacks.
1. Introduction
Federated Learning (McMahan et al., 2017) (FL) has
emerged as the most prominent approach to training ma-
chine learning models collaboratively without requiring sen-
sitive data of different parties to be collected in a central
1
Department of Information Technology and Electrical Engi-
neering, ETH Zurich, Zurich, Switzerland
2
Department of Com-
puter Science, ETH Zurich, Zurich, Switzerland. Correspondence
to: Mark Vero <mveroe@ethz.ch>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Figure 1: Comparison of image, text, and tabular data re-
construction. While the attack success can be judged by
human inspection in images and text, for tabular data it is
not possible, as both reconstructions look plausible. The im-
age reconstruction example is taken from Yin et al. (2021).
database. While prior work has examined privacy leakage
from exchanged updates in FL on images (Zhu et al., 2019;
Geiping et al., 2020; Yin et al., 2021) and text (Deng et al.,
2021; Dimitrov et al., 2022a; Gupta et al., 2022), many
applications of FL involve tabular datasets incorporating
highly sensitive personal data such as financial information
and health status (Borisov et al., 2021; Long et al., 2021;
Rieke et al., 2020). However, as no prior work has studied
the issue of privacy leakage in tabular data, we are unaware
of the true extent of its risks. This is also a cause of con-
cern for US and UK public institutions which have recently
launched a $1.6 mil. prize competition
1
to develop privacy-
preserving FL solutions for financial fraud detection and
infection risk prediction, both being tabular datasets.
Ingredients of a Data Leakage Attack A successful at-
tack builds on two pillars: (i) ability to reconstruct private
data from client updates with high accuracy, and (ii) a mech-
anism that allows a human to assess the obtained reconstruc-
tions without knowledge of the true data. Advancing along
the first pillar typically requires leveraging the unique as-
pects of the given domain, e.g., image attacks employ image
priors (Geiping et al., 2020; Yin et al., 2021), while attacks
on text make use of pre-trained language models (Dimitrov
et al., 2022a; Gupta et al., 2022). However, in the image and
text domains, the second pillar naturally comes for free, as
the credibility of the obtained data can be assessed simply
by human inspection, in contrast to tabular data, where this
is not possible, as illustrated in Fig. 1.
1https://petsprizechallenges.com/
1
arXiv:2210.01785v2 [cs.LG] 7 Jul 2023
TabLeak: Tabular Data Leakage in Federated Learning
Figure 2: Overview of TabLeak. Our approach transforms the optimization problem into a fully continuous one by optimizing
continuous versions of the discrete features, obtained by applying softmax (Attack Step 1, middle boxes), resulting in
N
candidate solutions (Attack Step 1, bottom). Then, we pool together an ensemble of
N
different solutions
z1, z2, ..., zN
obtained from the optimization to reduce the variance of the reconstruction (Attack Step 2). Finally, we assess the quality of
the reconstruction by computing the entropy from the feature distributions in the ensemble (Assessment).
Key Challenges A strong attack for tabular data must ad-
dress two unique challenges, one along each pillar: (i) due
to the presence of both discrete and continuous features,
the attack needs to solve a mixed discrete-continuous op-
timization problem of high variance, and (ii) unlike with
image and text data, assessing the quality of the reconstruc-
tion is no longer possible via human inspection, requiring a
mechanism to quantify the uncertainty of the reconstruction.
This Work In this work we propose the first comprehen-
sive attack on tabular data, TabLeak, addressing the above
challenges. Using our attack, we conduct the first compre-
hensive evaluation of the privacy risks posed by data leak-
age in tabular FL. We provide an overview of our approach
in Fig. 2, showing the reconstruction of a client’s private data
point
x= [male, 18, white]
, from the corresponding
update
f
received by the server. We tackle the first chal-
lenge in two steps. In Attack Step 1, we create
N
separate
optimization problems with different initializations. We
transform the mixed discrete-continuous optimization prob-
lem into a fully continuous one using a softmax relaxation.
Once optimization completes, in Attack Step 2, we reduce
the variance of the final reconstruction by pooling over the
different solutions. To address challenge (ii, Assessment),
we rely on the observation that when the
N
reconstructions
agree on a certain feature, it tends to be reconstructed well.
We measure the agreement using entropy. In our example,
sex and age exhibit a low entropy reconstruction and are
also correct. Meanwhile, the high disagreement over the
race feature is indicative of its incorrect reconstruction.
Comparing our domain-specific attack with prior works
adapted from other domains on both FL protocols, FedSGD
and FedAvg in various settings on four popular tabular
datasets, we reveal the high vulnerability of such systems
on tabular data, even in scenarios previously deemed as safe.
We observe that on small batch sizes tabular FL systems are
nearly transparent, where most attacks recover
>90%
of
the private data. Further, our attack retrieves
70.8%
-
84.9%
of the client data at the practically relevant batch size of
32
on the examined datasets, improving by
12.7%
-
14.5%
on prior art. Additionally, even on batch sizes as large as
128
, we show how an adversary can recover a quarter of the
private data well above
90%
accuracy; leading to alarming
conclusions about the privacy of FL on tabular data.
Main Contributions Our main contributions are:
First effective domain-specific data leakage attack on
tabular data called TabLeak, enabling novel insights
into the unique aspect of tabular data leakage.
An effective uncertainty quantification scheme, en-
abling the assessment of obtained samples and allow-
ing an attacker to extract highly accurate subsets of
features even from poor reconstructions.
An extensive experimental evaluation, revealing the
excessively high vulnerability of FL with tabular data
by successfully conducting attacks even in setups pre-
viously deemed safe.
2
TabLeak: Tabular Data Leakage in Federated Learning
2. Background and Related Work
Federated Learning FL is a framework developed to fa-
cilitate the distributed training of a parametric model while
preserving the privacy of the data at source (McMahan et al.,
2017). Formally, we have a parametric function
fθ(x) = y
,
where
θ
are the parameters. Given a dataset as the union
of private datasets of clients
S=SK
k=1 Sk
, we now wish
to find a
θ
such that
1
NP(xi,yi)∈S L(fθ(xi), yi)
is min-
imized, without first collecting the dataset
S
in a central
database. McMahan et al. (2017) propose two training algo-
rithms: FedSGD (a similar algorithm was also proposed by
Shokri & Shmatikov (2015)) and FedAvg, that allow for the
distributed training of
fθ
, while keeping the data partitions
Sk
at client sources. The two protocols differ in how the
clients compute their local updates in each step of training.
In FedSGD, each client calculates the update gradient with
respect to a randomly selected batch of their own data and
shares it with the server. During FedAvg, the clients con-
duct a few epochs of local training on their own data before
sharing their resulting parameters with the server. In each
case, after the server has received the gradients/parameters
from the clients, it aggregates them, updates the model, and
broadcasts it to the clients; concluding an FL training step.
Data Leakage Attacks Although the design goal of FL
was to preserve the privacy of clients’ data, recent work has
uncovered substantial vulnerabilities. Melis et al. (2019)
first presented how one can infer certain properties of the
clients’ data. Later, Zhu et al. (2019) demonstrated that an
honest-but-curious server can use the current state of the
model and the received updates to reconstruct the clients’
data, breaking the privacy promise of FL. Under this threat
model, there has been extensive research on designing tai-
lored attacks for images (Geiping et al., 2020; Zhao et al.,
2020; Geng et al., 2021; Huang et al., 2021; Jin et al., 2021;
Balunovi´
c et al., 2021; Yin et al., 2021; Jeon et al., 2021;
Dimitrov et al., 2022b) and natural language (Deng et al.,
2021; Dimitrov et al., 2022a; Gupta et al., 2022). However,
no prior work has comprehensively dealt with data leakage
attacks on tabular data, despite its significance in real-world
high-stakes applications (Borisov et al., 2021). While, Wu
et al. (2022) describe an attack on tabular data where a ma-
licious client learns some distributional information from
other clients, they do not reconstruct any private data points.
Some works also consider a threat scenario where a mali-
cious server may change the model or the updates sent to
the clients (Fowl et al., 2021; Wen et al., 2022); but in this
work we focus on the honest-but-curious setting.
In FedSGD, given the gradient
θL(fθ(x), y)
of some
client (shorthand:
g(x, y)
), we solve the following opti-
mization problem to retrieve the client’s private data
(x, y)
:
ˆx, ˆy= arg min
x,y
E(g(x, y), g(x, y)) + λR(x).(1)
In Eq. 1 we denote the gradient matching loss as
E
and
R
is an optional regularizer for the reconstruction. The
work of Zhu et al. (2019) used the mean squared error for
E
,
on which Geiping et al. (2020) improved using the cosine
similarity loss. Zhao et al. (2020) first demonstrated that
the private labels
y
can be estimated before solving Eq. 1,
reducing the complexity of Eq. 1 and improving the attack
results. Their method was later extended to batches by Yin
et al. (2021) and refined by Geng et al. (2021). Eq. 1 is
typically solved using continuous optimization tools such
as L-BFGS (Liu & Nocedal, 1989) and Adam (Kingma
& Ba, 2015). Although analytical approaches exist, they
do not generalize to batches with more than a single data
point (Zhu & Blaschko, 2021).
Domain-Specific Attacks Depending on the data domain,
distinct tailored alterations to Eq. 1 have been proposed
in the literature, e.g., using the total variation regularizer
for images (Geiping et al., 2020) and exploiting pre-trained
language models in language tasks (Dimitrov et al., 2022a;
Gupta et al., 2022). These mostly non-transferable domain-
specific solutions are necessary as each domain poses unique
challenges. Our work is first to identify and tackle the key
challenges to data leakage in the tabular domain.
Privacy Threat of Tabular FL Regulations and personal
interests prevent institutions from sharing privacy-sensitive
tabular data, such as STI and drug test results, social security
numbers, credit scores, and passwords. To this end, FL was
proposed to enable inter-owner usage of such data. However,
in a strict sense, if FL on tabular data leaks any private
information, it does not fulfill its original design purpose,
severely undermining trust in institutions employing such
solutions. In our work we show that tabular FL, in fact,
leaks large amounts of private information.
Mixed Type Tabular Data Mixed type tabular data is
commonly used in healthcare, finance, and social sci-
ences, which entail high-stakes privacy-critical applica-
tions (Borisov et al., 2021). Here, data is collected in a
table with mostly human-interpretable columns, e.g., age,
and race of an individual. Formally, let
x∈ X
be one row of
data and let
X
contain
K
discrete columns and
L
continuous
columns, i.e.,
X=D1× · · · × DK× U1× · · · × UL
, where
DiN
and
UiR
. For processing with neural networks,
discrete features are usually one-hot encoded, while continu-
ous features are preserved. The one-hot encoding of the
i
-th
discrete feature
xD
i
is a binary vector
cD
i(x)
of length
|Di|
that has a single non-zero entry at the position marking the
encoded category. We retrieve the represented category by
taking the argmax of
cD
i(x)
(projection to obtain
x
). Using
the described encoding, one row of data
x∈ X
is encoded
as:
c(x) = cD
1(x), . . . , cD
K(x), xC
1, . . . , xC
L
, containing
d:=L+PK
i=1 |Di|entries.
3
TabLeak: Tabular Data Leakage in Federated Learning
3. Tabular Leakage
In this section, we briefly summarize the challenges in tab-
ular leakage and present our solution to these, followed by
our end-to-end reconstruction attack.
Key Challenges In the tabular domain, a strong attack has
to address two unique challenges: (i) the presence of both
categorical and continuous features requires the attacker to
solve a significantly harder mixed discrete-continuous opti-
mization problem of higher variance (addressed in Sec. 3.1.1
and Sec. 3.1.2), and (ii) as exemplified previously in Fig. 1,
in contrast to images and text, it is hard for an unassisted
adversary to assess the credibility of the reconstructed data
in the tabular domain (addressed in Sec. 3.2).
3.1. Building a Strong Base Attack
We solve challenge (i) by introducing two components
to our attack; a softmax relaxation to turn the mixed
discrete-continuous problem into a fully continuous one
(see Sec. 3.1.1), and pooled ensembling to reduce the vari-
ance in the final reconstruction (see Sec. 3.1.2).
3.1.1. THE SOFTMAX RELAXATION
In accordance with prior literature on data leakage attacks,
we aim to conduct the optimization in continuous domain.
For this we employ the softmax relaxation, which turns
the hard mixed discrete-continuous optimization problem
into a fully continuous one. This drastically reduces its
complexity, while still facilitating the recovery of correct
discrete structures.
The recovery of one-hot vectors requires the integer con-
straints of all entries taking values in
{0,1}
and summing
to one. Relaxing the integer constraints by allowing the
reconstructed entries to take real values in
[0,1]
, we are
still left with a constrained optimization problem not well
suited for popular continuous optimization tools, such as
Adam (Kingma & Ba, 2015). Therefore, we aim to implic-
itly enforce the constraints introduced above.
For this, we extend the method of Zhu et al. (2019) used
for inverting the discrete labels when jointly optimizing for
both the labels and the data. Let
zRd
be our approximate
intermediate solution for the true one-hot encoded data
c(x)
during optimization. Then we can implicitly enforce all
constraints described above by applying a softmax to
zD
i
for
all ibetween 1 and K,i.e., define:
σ(zD
i)[j]:=exp(zD
i[j])
P|Di|
k=1 exp(zD
i[k]) j∈ Di.(2)
Therefore, in each round of optimization we will have the
following approximation of the true data point:
c(x)
σ(z) = σ(zD
1), . . . , σ(zD
K), zC
1, . . . , zC
L
. In order to
Figure 3: Maximum similarity matching of a sample
ˆxi
of
batch size
4
from the collection of reconstructions to the
best-loss sample ˆxbest.
preserve notational simplicity, we write
σ(z)
to mean the
application of softmax to each group of entries representing
a given categorical variable separately. Inverting a batch of
data, the softmax is applied in parallel to the batch points.
3.1.2. POOLED ENSEMBLING
In general, the data leakage optimization problem possesses
multiple local minima (Zhu & Blaschko, 2021) and is sen-
sitive to initialization (Wei et al., 2020). Additionally, we
observed and confirmed in a targeted experiment in App. E
that in tabular data the mix of discrete and continuous fea-
tures introduces further variance, in contrast to image and
text, where the problem is fully continuous or fully discrete,
respectively. We alleviate this problem by running inde-
pendent optimization processes with different initializations
and ensembling their results through feature-wise pooling.
Exploiting the structural regularity of tabular data, we can
combine independent reconstructions to obtain an improved
and more robust final estimate of the true data by applying
feature-wise pooling. Formally, we run
N
independent
rounds of optimization with
i.i.d.
initializations recovering
potentially different reconstructions
{σ(zj)}N
j=1
. Then, we
obtain a final estimate of the true encoded data, denoted as
σ(ˆz)
, by pooling across these reconstructions in parallel for
each batch-point and feature:
σD
i(ˆz) = pool σD
i(zj)N
j=1i[K](3)
ˆzC
i=pool (zC
i)jN
j=1i[L].(4)
Where the pool(·)operation can be any permutation invari-
ant mapping. In our attack we use median pooling.
However, the above equations can not be applied in a
straight-forward manner as soon as we aim to reconstruct
batches containing more than just a single data point. As
the batch-gradient is an average of the per-sample gradients,
when running the leakage attack we may retrieve the batch-
points in a different order at every optimization instance.
Hence, it is not immediately clear how we can combine the
obtained samples; i.e., we need to reorder each batch such
that their rows match to each other, and only then we can
pool. We reorder by first selecting the sample that produced
the best reconstruction loss at the end of optimization
ˆzbest
,
4
TabLeak: Tabular Data Leakage in Federated Learning
with projection
ˆxbest
. Then, we match the rows of every
other sample in the collection with respect to
ˆxbest
. Con-
cretely, we calculate the similarity (shown in Eq. 6 in Sec. 4)
between each pair of rows of
ˆxbest
and another sample
ˆxi
in the collection and find the maximum similarity reorder-
ing of the rows with the help of bipartite matching solved
by the Hungarian algorithm (Kuhn, 1955). This process
is depicted in Fig. 3. Repeating this for each sample, we
reorder the entire collection with respect to the best-loss
sample, effectively reversing the permutation differences
in the independent reconstructions. Finally, we can apply
feature-wise pooling for each row over the collection.
3.2. Assessment via Entropy
We now address challenge (ii), assessing reconstructions.
To recap, it is close-to-impossible for an uninformed ad-
versary to assess the quality of the obtained private sample
when it comes to tabular data, as almost any reconstruction
may constitute a credible data point when projected back to
mixed discrete-continuous space. This challenge does not
arise as prominently in the image (or text) domain, because
one can easily judge by looking at a picture, if it is just noise
or an actual image, as exemplified in Fig. 1. To address this
issue, we propose to estimate the reconstruction uncertainty
by looking at the level of agreement over a certain feature
for different reconstructions. Concretely, given a collection
of leaked samples as in Sec. 3.1.2, we can observe the dis-
tribution of each feature over the samples. Intuitively, if
this distribution is "peaky", i.e., concentrates the mass heav-
ily on a certain value, then we can assume that the feature
has been reconstructed correctly, whereas if there is high
disagreement between the reconstructed samples, we can
assume that this feature’s recovered final value should not
be trusted. We can quantify this by measuring the entropy of
the feature distributions induced by the recovered samples.
Discrete Features Let
p(ˆxD
i)m:=1
NCountj(ˆxD
ij =m)
be the relative frequency of projected reconstructions of
the
i
-th discrete feature of value
m
in the ensemble. Then,
we can calculate the normalized entropy of the feature as
¯
HD
i=1
log |Di|PDi
m=1 p(ˆxD
i)mlog p(ˆxD
i)m
. Note that the
normalization allows for comparing features with different
domain sizes, i.e., it ensures that
¯
HD
i[0,1]
, as
H(k)
[0,log |K|]for any finite discrete random variable k∈ K.
Continuous Features In case of continuous features, we
calculate the entropy by first making the standard assump-
tion that the errors of the reconstructed continuous features
follow a Gaussian distribution. As such, we first estimate the
sample variance
ˆσ2
i
for the
i
-th continuous feature and then
plug it in to calculate the entropy of the corresponding Gaus-
sian:
HC
i=1
2+1
2log 2πˆσ2
i
. Cross-feature comparability
can be achieved by scaling all features, e.g., standardization.
Algorithm 1 TabLeak against training by FedSGD
1:
function SINGLEINVERSION (Neural Network:
fθ
,
Client Gradient:
g(c(x), y)
, Reconstructed Labels:
ˆy
,
Initial Reconstruction:
z0
j
, Iterations:
T
, # Discrete
Features: K)
2: for tin 0,1, . . . , T 1do
3: for kin 1,2, . . . , K do
4: σ(zD
kj )softmax(zD
kj )
5: end for
6: zt+1
jzt
jηzECS (g(c(x), y), g(σ(zt
j),ˆy))
7: end for
8: return zT
i
9: end function
10:
11:
function TABLEAK (Neural Network:
fθ
, Client Gra-
dient:
g(c(x), y)
, Reconstructed Labels:
ˆy
, Ensemble
Size: N, Iterations: T, # Discrete Features: K)
12: z0
jN
j=1 ∼ U[0,1]d
13: for jin 1,2, . . . , N do
14: zT
j
SINGLEINVERSION(
fθ
,
g(c(x), y)
,
ˆy
,
z0
j
,
T,K)
15: end for
16: ˆzbest arg minzT
jECS (g(c(x), y), g(σ(zT
j),ˆy))
17: σ(ˆz)MATCHANDPOOL(σ(zT
j)N
j=1 ,ˆzbest)
18: ¯
HD, HC
CALCULATEENTROPY(
σ(zT
j)N
j=1
)
19: ˆxPROJECT(σ(ˆz))
20: return ˆx,¯
HD,HC
21: end function
3.3. Combined Attack
Following Geiping et al. (2020), we use the cosine similarity
loss as our reconstruction objective, defined as:
ECS (z):= 1 g(c(x), y), g(σ(z),ˆy)
g(c(x), y)2g(σ(z),ˆy)2
,(5)
where
(x, y)
are the true data,
ˆy
are the labels reconstructed
beforehand, and we optimize for
z
. Our end-to-end attack,
TabLeak is shown in Alg. 1. First, we reconstruct the la-
bels using the label reconstruction method of Geng et al.
(2021) and input them into our attack. Then, we initialize
N
independent dummy samples for an ensemble of size
N
(Line 12). Starting from each initial sample we optimize
independently (Lines 13-15) via the SINGLEINVERSION
function. In each optimization step, we apply the softmax
relaxation of Sec. 3.1.1, and let the optimizer differentiate
through it (Line 4). After the optimization processes have
reached the maximum number of allowed iterations
T
, we
identify the sample ˆzbest producing the best reconstruction
loss (Line 16). Using
ˆzbest
, we match and pool to obtain the
final encoded reconstruction
σ(ˆz)
in Line 17 as described in
Sec. 3.1.2. Finally, we return the projected private data re-
construction
ˆx
and the corresponding feature-entropies
¯
HD
and
HC
, quantifying the uncertainty in the leaked sample.
5
摘要:

TabLeak:TabularDataLeakageinFederatedLearningMarkVero1MislavBalunovi´c2DimitarI.Dimitrov2MartinVechev2AbstractWhilefederatedlearning(FL)promisestopre-serveprivacy,recentworksintheimageandtextdomainshaveshownthattrainingupdatesleakprivateclientdata.However,mosthigh-stakesap-plicationsofFL(e.g.,inheal...

展开>> 收起<<
TabLeak Tabular Data Leakage in Federated Learning Mark Vero1Mislav Balunovi c2Dimitar I. Dimitrov2Martin Vechev2 Abstract.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:6.96MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注