
TabLeak: Tabular Data Leakage in Federated Learning
2. Background and Related Work
Federated Learning FL is a framework developed to fa-
cilitate the distributed training of a parametric model while
preserving the privacy of the data at source (McMahan et al.,
2017). Formally, we have a parametric function
fθ(x) = y
,
where
θ
are the parameters. Given a dataset as the union
of private datasets of clients
S=SK
k=1 Sk
, we now wish
to find a
θ∗
such that
1
NP(xi,yi)∈S L(fθ∗(xi), yi)
is min-
imized, without first collecting the dataset
S
in a central
database. McMahan et al. (2017) propose two training algo-
rithms: FedSGD (a similar algorithm was also proposed by
Shokri & Shmatikov (2015)) and FedAvg, that allow for the
distributed training of
fθ
, while keeping the data partitions
Sk
at client sources. The two protocols differ in how the
clients compute their local updates in each step of training.
In FedSGD, each client calculates the update gradient with
respect to a randomly selected batch of their own data and
shares it with the server. During FedAvg, the clients con-
duct a few epochs of local training on their own data before
sharing their resulting parameters with the server. In each
case, after the server has received the gradients/parameters
from the clients, it aggregates them, updates the model, and
broadcasts it to the clients; concluding an FL training step.
Data Leakage Attacks Although the design goal of FL
was to preserve the privacy of clients’ data, recent work has
uncovered substantial vulnerabilities. Melis et al. (2019)
first presented how one can infer certain properties of the
clients’ data. Later, Zhu et al. (2019) demonstrated that an
honest-but-curious server can use the current state of the
model and the received updates to reconstruct the clients’
data, breaking the privacy promise of FL. Under this threat
model, there has been extensive research on designing tai-
lored attacks for images (Geiping et al., 2020; Zhao et al.,
2020; Geng et al., 2021; Huang et al., 2021; Jin et al., 2021;
Balunovi´
c et al., 2021; Yin et al., 2021; Jeon et al., 2021;
Dimitrov et al., 2022b) and natural language (Deng et al.,
2021; Dimitrov et al., 2022a; Gupta et al., 2022). However,
no prior work has comprehensively dealt with data leakage
attacks on tabular data, despite its significance in real-world
high-stakes applications (Borisov et al., 2021). While, Wu
et al. (2022) describe an attack on tabular data where a ma-
licious client learns some distributional information from
other clients, they do not reconstruct any private data points.
Some works also consider a threat scenario where a mali-
cious server may change the model or the updates sent to
the clients (Fowl et al., 2021; Wen et al., 2022); but in this
work we focus on the honest-but-curious setting.
In FedSGD, given the gradient
∇θL(fθ(x), y)
of some
client (shorthand:
g(x, y)
), we solve the following opti-
mization problem to retrieve the client’s private data
(x, y)
:
ˆx, ˆy= arg min
x′,y′
E(g(x, y), g(x′, y′)) + λR(x′).(1)
In Eq. 1 we denote the gradient matching loss as
E
and
R
is an optional regularizer for the reconstruction. The
work of Zhu et al. (2019) used the mean squared error for
E
,
on which Geiping et al. (2020) improved using the cosine
similarity loss. Zhao et al. (2020) first demonstrated that
the private labels
y
can be estimated before solving Eq. 1,
reducing the complexity of Eq. 1 and improving the attack
results. Their method was later extended to batches by Yin
et al. (2021) and refined by Geng et al. (2021). Eq. 1 is
typically solved using continuous optimization tools such
as L-BFGS (Liu & Nocedal, 1989) and Adam (Kingma
& Ba, 2015). Although analytical approaches exist, they
do not generalize to batches with more than a single data
point (Zhu & Blaschko, 2021).
Domain-Specific Attacks Depending on the data domain,
distinct tailored alterations to Eq. 1 have been proposed
in the literature, e.g., using the total variation regularizer
for images (Geiping et al., 2020) and exploiting pre-trained
language models in language tasks (Dimitrov et al., 2022a;
Gupta et al., 2022). These mostly non-transferable domain-
specific solutions are necessary as each domain poses unique
challenges. Our work is first to identify and tackle the key
challenges to data leakage in the tabular domain.
Privacy Threat of Tabular FL Regulations and personal
interests prevent institutions from sharing privacy-sensitive
tabular data, such as STI and drug test results, social security
numbers, credit scores, and passwords. To this end, FL was
proposed to enable inter-owner usage of such data. However,
in a strict sense, if FL on tabular data leaks any private
information, it does not fulfill its original design purpose,
severely undermining trust in institutions employing such
solutions. In our work we show that tabular FL, in fact,
leaks large amounts of private information.
Mixed Type Tabular Data Mixed type tabular data is
commonly used in healthcare, finance, and social sci-
ences, which entail high-stakes privacy-critical applica-
tions (Borisov et al., 2021). Here, data is collected in a
table with mostly human-interpretable columns, e.g., age,
and race of an individual. Formally, let
x∈ X
be one row of
data and let
X
contain
K
discrete columns and
L
continuous
columns, i.e.,
X=D1× · · · × DK× U1× · · · × UL
, where
Di⊂N
and
Ui⊂R
. For processing with neural networks,
discrete features are usually one-hot encoded, while continu-
ous features are preserved. The one-hot encoding of the
i
-th
discrete feature
xD
i
is a binary vector
cD
i(x)
of length
|Di|
that has a single non-zero entry at the position marking the
encoded category. We retrieve the represented category by
taking the argmax of
cD
i(x)
(projection to obtain
x
). Using
the described encoding, one row of data
x∈ X
is encoded
as:
c(x) = cD
1(x), . . . , cD
K(x), xC
1, . . . , xC
L
, containing
d:=L+PK
i=1 |Di|entries.
3