Federated Boosted Decision Trees with Differential Privacy_2

2025-04-27 0 0 1.53MB 23 页 10玖币
侵权投诉
Federated Boosted Decision Trees with Dierential Privacy
Samuel Maddock
University of Warwick
Graham Cormode
Meta AI
Tianhao Wang
University of Virginia
Carsten Maple
University of Warwick
Somesh Jha
University of Wisconsin-Madison
ABSTRACT
There is great demand for scalable, secure, and ecient privacy-
preserving machine learning models that can be trained over dis-
tributed data. While deep learning models typically achieve the
best results in a centralized non-secure setting, dierent models can
excel when privacy and communication constraints are imposed. In-
stead, tree-based approaches such as XGBoost have attracted much
attention for their high performance and ease of use; in particular,
they often achieve state-of-the-art results on tabular data. Conse-
quently, several recent works have focused on translating Gradient
Boosted Decision Tree (GBDT) models like XGBoost into feder-
ated settings, via cryptographic mechanisms such as Homomorphic
Encryption (HE) and Secure Multi-Party Computation (MPC). How-
ever, these do not always provide formal privacy guarantees, or
consider the full range of hyperparameters and implementation
settings. In this work, we implement the GBDT model under Dier-
ential Privacy (DP). We propose a general framework that captures
and extends existing approaches for dierentially private decision
trees. Our framework of methods is tailored to the federated setting,
and we show that with a careful choice of techniques it is possible to
achieve very high utility while maintaining strong levels of privacy.
KEYWORDS
Gradient Boosting, Dierential Privacy, Federated Learning
1 INTRODUCTION
It is well known that machine learning models can leak private in-
formation about individuals in the training set [
14
,
58
]. Dierential
privacy (DP) [
21
] is a popular denition that has been developed to
mitigate such privacy risks and has become the dominant notion
of privacy in recent years. Much of the current research on private
machine learning is focused on training deep learning models with
dierential privacy [
1
,
35
,
41
,
60
]. DP is often combined with feder-
ated learning, where data resides on client devices, and only small
information about model updates is collected from clients, in order
to further minimize the privacy risk [36].
While deep learning models are powerful for a range of real-
world tasks in a centralized setting, they are sometimes beaten by
“simpler” models on tabular datasets. One such competitor is Gradi-
ent Boosted Decision Trees (GBDTs) [
22
,
31
,
59
] [
32
]. GBDT meth-
ods build an ensemble of weak decision trees that incrementally
correct for past mistakes in training to improve predictions. Many
GBDT frameworks such as XGBoost [
16
], LightGBM [
38
], and Cat-
Boost [
19
] have seen widespread industry adoption [
6
,
11
,
40
,
44
].
Full version of a paper to appear at ACM CCS’22
Author correspondence to s.maddock@warwick.ac.uk
Work was done while part-time at Meta AI
GBDT methods are an attractive alternative to deep learning due
to their speed, scalability, ease of use, and impressive performance
on tabular datasets.
Recent works have studied GBDT implementations such as XG-
Boost under secure training in the federated setting [
17
,
18
,
24
,
48
].
These methods typically rely on cryptographic techniques such
as Homomorphic Encryption (HE) or Secure Multi-Party Compu-
tation (MPC). While this allows secure joint training of a GBDT
model without any participant directly releasing their data, the
end model may not necessarily be private and will not guarantee
formal dierential privacy (DP) [
23
]. For instance, in the case of
decision trees, split decisions in a tree can directly reveal sensitive
information regarding the training set. Moreover, such reliance on
heavyweight cryptographic techniques such as HE or MPC often
makes methods computationally intensive or require a large num-
ber of communication rounds, making them impractical to scale
beyond more than a few participants [17, 42].
In parallel, many works have studied decision tree models under
the central model of DP [
25
,
52
,
65
]. Most studies focus on training
random forest (RF) models and there has been little research to
explore trade-os between gradient boosting and DP; those that do
often use central DP mechanisms that cannot easily be extended to
federated settings [
33
]. It therefore remains an open problem to im-
plement GBDTs in the federated DP setting, and show how to obtain
utility comparable to their centralized non-private counterparts.
Our focus is on DP-GBDT methods that operate within the fed-
erated setting via lightweight MPC methods such as secure aggre-
gation [
7
,
10
]. This setting has recently risen to prominence, as it
promises an attractive trade-o between the computational e-
ciency of central DP techniques and the security of cryptographic
methods. Recent federated works that consider GBDTs have pro-
posed methods under the local model of DP, but due to the use of
local noise, incur a signicant loss in utility [43, 61, 62].
In this paper, we bring together existing methods under a unied
framework where we propose techniques to satisfy DP that are
well suited to the federated setting. We nd that by dissecting the
GBDT algorithm into its constituent parts and carefully considering
the options for each component of the algorithm we can identify
specic combinations that achieve the best balance of privacy and
utility. We also emphasise variants that can train such private GBDT
models in only a small number of communication rounds, which is
of particular importance to the federated setting.
Our high-level nding is that it is possible to achieve high perfor-
mance with GBDT models, even comparable to that of non-private
methods. In order to do so, one must allocate privacy budget to
the quantities that are most important for the learning process. For
example, we show that spending such budget on computing split
decisions of trees is not as important as spending it on the leaf
arXiv:2210.02910v1 [cs.CR] 6 Oct 2022
weights. Using our ndings under the ecient privacy accounting
of Rényi Dierential Privacy (RDP) leads to performance that is far
closer to the non-private setting than seen in previous works.
Our main contributions are as follows:
A clear and concise framework for dierentially private gra-
dient boosting with decision trees. We deconstruct the GBDT
algorithm into ve main components, showing how to fed-
erate each component while satisfying Rényi Dierential
Privacy (RDP). We present a unifying approach, capturing
recently proposed DP tree-based models as special cases.
A new set of techniques for improving the utility of private
federated GBDT models. For example, we propose a private
method for discretising continuous features that makes as
much use of the private training information as possible,
incurring little additional privacy cost. Additionally, we ex-
plore batching weight updates, showing it is possible to
maintain competitive model performance while reducing
the number of communication rounds needed.
An extensive set of experiments on a range of benchmark
datasets exploring the trade-os between various options
in our framework. By evaluating the choices in each of the
components of our framework,we nd a clear dominant ap-
proach is formed by adapting and simplifying the GBDT
algorithm while combining it with our improved split candi-
date method. We show it is possible to achieve higher utility
than state-of-the-art (SOTA) DP-RF and DP-GBDT methods
on a range of datasets with reasonable levels of privacy.
We provide open-source code at https://github.com/Samuel-
Maddock/federated-boosted-dp-trees
Roadmap. In Section 2 we outline technical preliminaries required
to understand dierentially private GBDTs before covering related
works in Section 3. In Section 4 and 5 we describe our framework
for DP-GBDTs, tting existing methods within this and proposing
combinations to study. In Section 6 we provide extensive experi-
mental evaluations, comparing our methods to existing baselines
within our framework before concluding with Section 7.
2 PRELIMINARIES
2.1 Gradient Boosted Decision Trees (GBDT)
Tree-based ensemble methods form a collection of
𝑇
decision trees
that predict ˆ
𝑦𝑖for each input 𝒙𝑖:
ˆ
𝑦𝑖:=𝑓(𝒙𝑖)=Í𝑇
𝑡=1𝑓𝑡(𝒙𝑖)
For a specic tree
𝑓𝑡
let
𝐿𝑡
denote the number of leaf nodes. Each
leaf node of a tree contains a weight, which will be the output of
the tree for observations that are classied into that leaf. We denote
𝒘(𝑡)R𝐿𝑡as the vector of leaf node weights for a tree 𝑓𝑡.
GBDT methods train trees sequentially making use of past pre-
dictions to correct for mistakes. This is in contrast to random forest
(RF) methods that train
𝑇
trees in parallel, averaging the weights
of trees for the nal prediction.
For a set of examples
𝐷={(𝒙𝑖, 𝑦𝑖)}𝑛
𝑖=1
with corresponding
predictions {ˆ
𝑦𝑖}𝑛
𝑖=1the GBDT objective function is dened as
L(𝑓)=Í𝑛
𝑖=1(𝑦𝑖,ˆ
𝑦𝑖) + Í𝑇
𝑡=1Ω(𝑓𝑡)(1)
where
is a twice-dierentiable loss function, typically the cross-
entropy loss (binary classication) or squared-error loss (regres-
sion). The term
Ω(𝑓𝑡)=𝛾𝐿𝑡+𝜆
2||𝒘(𝑡)||2
2
is a form of regularisation
such that
𝛾
0penalises the size of the tree and
𝜆
0penalises
the magnitude of weights. This regularisation term is present in
the popular XGBoost algorithm but is often omitted in other GBDT
variants; we adopt it for our experimental study.
Equation
(1)
evades direct optimization. Rather, GBDT models
are trained sequentially based on previous models. At step
𝑡
we can
dene the model prediction
ˆ
𝑦(𝑡)
𝑖=Í𝑡
𝑘=1𝑓𝑘(𝒙𝑖)=ˆ
𝑦(𝑡1)
𝑖+𝑓𝑡(𝒙𝑖)
.
The objective for optimising 𝑓𝑡becomes
L(𝑡)(𝑓𝑡)=Í𝑛
𝑖=1(𝑦𝑖,ˆ
𝑦(𝑡1)
𝑖+𝑓𝑡(𝒙𝑖)) + Ω(𝑓𝑡)(2)
For step
𝑡
we are concerned with nding a tree
𝑓𝑡
that minimises
(2)
. Since
𝑓𝑡
is not dierentiable we can use a Taylor approxima-
tion. Taking the rst-order approximation leads to the standard
Gradient Boosting Machine (GBM) method. Taking a second-order
approximation leads to Newton boosting as used by XGBoost [
16
].
When taking a rst-order approximation we obtain
L(𝑡)(𝑓𝑡) Í𝑛
𝑖=1(𝑦𝑖,ˆ
𝑦(𝑡1)) +𝑔(𝑡)
𝑖𝑓𝑡(𝒙𝑖)+Ω
where
𝑔(𝑡)
𝑖=𝜕
𝜕ˆ
𝑦(𝑡1)
𝑖
(𝑦𝑖,ˆ
𝑦(𝑡1)
𝑖)
is the gradient of the loss function
at the start of step
𝑡
. By considering the index sets of examples
mapped to leaf node
𝑙
i.e.,
𝐼𝑙={𝑖|𝒙𝑖belongs to leaf 𝑙of 𝑓𝑡}
one can
show by expanding the above and dierentiating with respect to
𝑤(𝑡)
𝑙that the optimal leaf weight is
𝑤(𝑡)
𝑙
=Í𝑖𝐼𝑙𝑔(𝑡)
𝑖
|𝐼𝑙|+𝜆(3)
We denote this as a gradient weight update. Taking a second-order
approximation of (2) instead gives
L(𝑡)(𝑓𝑡) Í𝑛
𝑖=1(𝑦𝑖,ˆ
𝑦(𝑡1)) +𝑔(𝑡)
𝑖𝑓𝑡(𝒙𝑖) +(𝑡)
𝑖
𝑓2
𝑡(𝒙𝑖)
2+Ω(𝑓𝑡)
where
(𝑡)
𝑖=𝜕2
𝜕(ˆ
𝑦(𝑡1)
𝑖)2(𝑦𝑖,ˆ
𝑦(𝑡1)
𝑖)
is the Hessian of the loss at the
start of step
𝑡
. As before one can show that the optimal weights
this time are
𝑤(𝑡)
𝑙
=Í𝑖𝐼𝑙𝑔(𝑡)
𝑖
Í𝑖𝐼𝑙(𝑡)
𝑖+𝜆(4)
which we denote as a Newton weight update. Substituting optimal
weights from either the rst or second-order approximation into
Equation
(2)
leads to quantities that can be used to measure a split
score. In other words, when considering a split option that partitions
examples into disjoint index sets
𝐼=𝐼1𝐼2
, the split score is a
measure of how useful a split is for classication. The split score
for Newton updates can be computed as
𝑆𝑆 (𝐼1, 𝐼2)=1
2(Í𝑖𝐼1𝑔𝑖)2
Í𝑖𝐼1𝑖+𝜆+(Í𝑖𝐼2𝑔𝑖)2
Í𝑖𝐼2𝑖+𝜆(Í𝑖𝐼𝑔𝑖)2
Í𝑖𝐼𝑖+𝜆𝛾(5)
In practice to form such split options, GBDT methods often dis-
cretize continuous features (e.g., via quantiles) into
𝑄
split candi-
dates. In order to handle categorical features, GBDT methods like
XGBoost typically transform them e.g., via a one-hot encoding. In
either case, this leads to splits of the form
𝐼={𝑖
:
𝑥𝑖 𝑗 𝑠𝑗
𝑞}
for a split candidate
𝑠𝑗
𝑞
. Equation
(5)
can then be used to greedily
2
choose the feature split-candidate pair with the largest score when
growing the tree structure during training.
2.2 Dierential Privacy
Dierential Privacy (DP) is a formal denition of privacy that guar-
antees the output of a data analysis does not depend signicantly
on a single individual’s data item. Such a denition can be based
on the notion of privacy loss.
Denition 2.1 (Privacy Loss Random Variable). Given a randomised
mechanism
M
:
X → Y
we dene the privacy loss random vari-
able 𝐿M,𝑥,𝑥over “neighbouring” datasets 𝑥, 𝑥 ∈ X as
𝐿M,𝑥,𝑥=log 𝑝M(𝑥)(𝑋)
𝑝M(𝑥)(𝑋)
where
𝑋∼ M(𝑥)
and
𝑝M(·)
is the density of the mechanism
applied to the respective dataset.
We take neighbouring datasets
𝑥, 𝑥 ∈ X
to mean that
𝑥
and
𝑥
dier on a single individual. The privacy loss allows us to succinctly
describe dierential privacy.
Denition 2.2 (Dierential Privacy in terms of privacy loss [
5
]). We
say that a randomised mechanism
M
:
X → Y
satises
(𝜖, 𝛿)
-DP
if for any adjacent datasets 𝑥, 𝑥 ∈ X
P(𝐿M,𝑥,𝑥𝜖) 𝛿
The privacy parameter
𝜖
is referred to as the privacy budget. When
𝛿=
0, we say that
M
satises
𝜖
-DP. In this work we only consider
privacy guarantees where 𝛿>0i.e., the case of approximate-DP.
While
(𝜖, 𝛿)
-DP is a useful denition of privacy it does not allow
us to tightly quantify the privacy loss from the composition of mul-
tiple mechanisms [
37
]. This is particularly important in machine
learning where we wish to use mechanisms many times over the
same dataset to train models. Instead, the notion of Rényi Dier-
ential Privacy (RDP) provides a succinct way to track the privacy
loss from a composition of multiple mechanisms by representing
privacy guarantees through moments of the privacy loss.
Denition 2.3 (Rényi Dierential Privacy [
49
]). A mechanism
M
:
X → Y
is said to satisfy
(𝛼, 𝜏 )
-RDP if the following holds for
any two adjacent datasets 𝑥, 𝑥 ∈ X
Eh𝐿(𝛼1)
M,𝑥,𝑥iexp((𝛼1)𝜏)
One of the simplest and most widely-used mechanisms to guar-
antee (𝛼, 𝜏 )-RDP is the Gaussian mechanism.
Fact 2.1 (Gaussian Mechanism [
21
,
49
]). The Gaussian mecha-
nism M:X R𝑚of the form
M(𝑥)=𝑞(𝑥) + 𝑁(0,Δ2(𝑞)2𝜎2𝐼𝑚)
satises (𝛼, 𝜏 )-RDP with 𝜏=𝛼
2𝜎2and
Δ2(𝑞)=max𝑥,𝑥𝑞(𝑥) −𝑞(𝑥)2
The quantity
Δ2(𝑞)
is the
𝐿2
-sensitivity of the query
𝑞
. The above
shows that in order to make a real-valued query
𝑞
dierentially
private we just need to add suitably calibrated Gaussian noise.
An attractive property of this formulation of DP is that it is easy
to reason about the privacy of an analysis where mechanisms are
used multiple times on the same dataset.
Fact 2.2 (Parallel Composition). Given a dataset
𝑋
, a disjoint
partition
𝑋=𝑋1𝑋2· ·· 𝑋𝑘
and a mechanism
M
that satises
(𝛼, 𝜏)
-RDP. Then the mechanism
M(𝑋)
:
=(M(𝑋1), . . . , M(𝑋𝑘))
satises (𝛼, 𝜏 )-RDP.
Fact 2.3 (Seqential Composition). If
M1
and
M2
are
(𝛼, 𝜏1)
-
RDP and
(𝛼, 𝜏2)
-RDP respectively then the mechanism that releases
(M1(·),M2(·)) is (𝛼, 𝜏1+𝜏2)-RDP.
Fact 2.4 (Post-Processing). If
M
is an
(𝛼, 𝜏)
-RDP mechanism
and
𝑓
is any function that does not depend on any private data then
𝑓(M(·)) is also (𝛼, 𝜏)-RDP.
Sequential composition tells us that using a mechanism multiple
times on the same data leads to an increase in privacy loss. In the
case of composing
𝑘
Gaussian mechanisms, we must increase the
noise added through 𝜎by the order of 𝑘under RDP.
In practice, we care about obtaining the more meaningful notion
of
(𝜖, 𝛿)
-DP. When working with RDP we can rely on conversion
lemmas such as those presented in [
13
] to convert between
(𝛼, 𝜏)
-
RDP and
(𝜖, 𝛿)
-DP. In our implementations, we use the analytical
moment accountant developed by Wang et al. to provide tight
numerical accounting of the privacy loss under RDP [63].
It is common to x the privacy parameters
(𝜖, 𝛿)
before the
analysis and then minimises
𝜎
over a range of
𝛼
values to obtain the
smallest such noise needed to guarantee the chosen level of privacy.
We use the autodp package
1
to verify our accounting provides the
correct
(𝜖, 𝛿)
-DP guarantee. An additional benet of working with
RDP is then that our framework easily extends to other mechanisms
that satisfy RDP such as the Skellam mechanism which may be
more suited to distributed settings [2].
2.3 The Federated Model of Computation
Federated Learning (FL) has become a popular paradigm for large-
scale distributed training of machine learning models [
36
]. In this
work, we consider the horizontal setting, where a set of participants
each hold a local dataset over the same space of
𝑚
features. We
assume that there are
𝑛
data items in total and we consider the
problem of training a dierentially private GBDT model over the
distributed dataset. A powerful tool is secure aggregation, which
allows the computation of a sum without revealing any intermediate
values [
7
,
10
]. Specically, when each participant
𝑃𝑘
has a number
𝑥𝑘Z
, secure aggregation computes the result
Í𝑘𝑥𝑘
securely
without any participant directly sharing their 𝑥𝑘.
Our focus is on a framework that combines secure aggregation
with DP to securely and privately train GBDT models. For the
rest of this paper, we present algorithms as if the data were held
centrally, with the understanding that all the operations we use
can be performed in the federated model (with rounding to xed
precision)
2
. This means that we avoid techniques designed for
central evaluation such as the exponential mechanism [
33
,
45
,
65
].
Threat Model:
In this work, in common with many other works
in the federated setting, we assume an honest-but-curious model,
where the clients do not trust others with their raw data. We study
1https://github.com/yuxiangw/autodp
2
The rounding introduces a small amount of imprecision in representing values, but
this is overwhelmed by the noise added for privacy.
3
Table 1: Summary of our Private Federated GBDT Framework
Component Methods Privacy Cost (in terms of 𝜅𝑠, 𝜅𝑤, 𝜅𝑐)
(C1) Split Method Histogram-based (Hist) (§4.3.1) 𝜅𝑠=𝑇𝑚𝑑
Partially Random (PR) (§4.3.2) 𝜅𝑠=𝑇𝑚𝑑, does not require construction of a histogram
Totally Random (TR) (§4.3.2) 𝜅𝑠=0
(C2) Weight Update Averaging (§4.4.1) If using a Hist or PR 𝜅𝑤=0otherwise 𝜅𝑤=𝑇
Gradient (§4.4.2)
Newton (§4.4.3)
(C3) Split Candidate Uniform, Log (§4.5.1) Data-independent, 𝜅𝑐=0
Quantiles (non-private) (§4.5.1) N/A
Iterative Hessian (IH) (§4.5.2) If using Hist, 𝜅𝑐=0. If using TR, with 𝑠rounds of IH, 𝜅𝑐=𝑠𝑚
(A1) Feature Interactions Cyclical 𝑘-way (§5.1) If using Hist or PR, 𝜅𝑠=𝑇 𝑘𝑑, if 𝑘=1then 𝜅𝑠=𝑇.
Random 𝑘-way (§5.1) If using TR with IH then 𝜅𝑐=𝑠𝑘
(A2) Batched Updates 𝐵=1(Boosting) (§5.2) Post-processing, no eect on privacy
𝐵=𝑇(RF-type predictions) (§5.2)
𝐵=𝑝·𝑇for some 𝑝∈ (0,1)(§5.2)
the aggregating server’s knowledge based on the information gath-
ered from clients. While there is potential for clients to attempt to
disrupt the protocol, we leave the detailed study of more malicious
threat models and model poisoning to future work. In order to com-
bine secure aggregation with DP, we act as if there were a trusted
central server that securely aggregates quantities and adds the re-
quired DP noise before sending the updated (private) model back to
participants (as assumed in [
47
]). In practice, we can eliminate the
need for a central server by well-established implementations of
secure computation that rely on techniques from secure multi-party
computation, either among a small number of honest-but-curious
servers, or via clients working with small groups of neighbors and
a single untrusted server [
7
]. Sucient noise for DP guarantees
can be added by honest-but-curious servers, or introduced by each
client adding a small amount of discrete noise, such that the total
noise across clients adds up to the desired volume [
8
,
9
,
15
,
53
57
].
3 RELATED WORK
Dierentially private decision trees have been well studied in the
central setting with a strong focus on random forest (RF) models [
25
,
27
,
65
]. However, the boosted approach (i.e., private GBDT models)
has been less well-explored. Recently, federated XGBoost models
have been presented, with most works focused on secure training
via cryptographic primitives such as Homomorphic Encryption
(HE) and Secure Multi-Party Computation (MPC) and with no DP
guarantees [17, 18, 24].
Some related works (e.g., [
61
]) study XGBoost in a federated
setting with local DP (LDP) guarantees. The closest work to ours
in this regard is the FEVERLESS method [
62
], which translates
the XGBoost algorithm into the vertical federated setting using
secure aggregation and the Gaussian mechanism. In particular,
FEVERLESS securely aggregates gradient information into a private
histogram which is used to compute split scores and leaf weights
(Equations
(4)
and
(5)
). A certain subset of the participants are
chosen as “noise leaders” to add Gaussian noise to their gradients
information before aggregating to achieve an overall DP guarantee
after securely aggregating across all participants. As we will see, the
main disadvantage of directly translating the XGBoost algorithm
in this way is the high privacy cost of repeatedly computing split
scores. This results in having to add more noise into split score/leaf
weight calculations and a lower utility model.
To reduce this privacy cost, one can consider making split de-
cisions independently of the data. These so-called totally random
(TR) trees have been studied in both the non-private and private
settings with random forests [
26
,
29
]. In the private setting, pro-
posed methods often use central DP mechanisms that are hard to
federate [
3
,
27
]. For example, Fletcher and Islam [
27
] propose a DP-
RF method that utilises the exponential mechanism to output the
majority label in leaf nodes under the notion of smooth sensitivity,
which is unsuited to the federated setting.
In this work, we also consider TR trees as an option under our
framework but for a federated and private GBDT model. To the
best of our knowledge, the only other work that considers private
boosting with random trees is that of Nori et al. [
51
]. They consider
a central DP setting with a focus on training private explainable
models via Explainable Boosting Machine (EBMs). We compare the
technical dierences in Section 5.1 and empirically in Section 6.6.
4 PRIVATE GBDT FRAMEWORK
In this section, we perform a comprehensive investigation of the
main components needed to train GBDT models in the federated
setting. We propose a framework of methods for training DP-GBDT
models by identifying three main components that require DP noise
and two additional components that interact with these. The full
framework is summarized in Table 1.
We explain the various options in each component and how
they aect privacy guarantees and conclude by instantiating re-
lated work into the framework before empirically evaluating meth-
ods in Section 6. A particular strategy we highlight is replacing
data-dependent choices with random or uniform choices. Although
counter-intuitive, it often holds that the privacy “cost” of tting
the choices to the data is not made up for by the utility gain, and
picking among a set of random options is sucient for good results.
This is evaluated in our experimental study.
4
Algorithm 1 General GBDT
Input:
Number of trees
𝑇
, maximum depth
𝑑
, number of split
candidates 𝑄, privacy parameters 𝜖, 𝛿
1:
For each feature
𝑗=
1
, . . . , 𝑚
generate
𝑄
split candidates
𝑆𝑗:={𝑠𝑗
1, . . . , 𝑠 𝑗
𝑄}(C3)
2: Initialise the forest T ← ∅
3: for 𝑡=1, . . . , 𝑇 do
4:
For each
(𝑥𝑖, 𝑦𝑖) 𝐷
compute the required gradient
information (𝑔𝑖, ℎ𝑖)based on ˆ
𝑦(𝑡1)
𝑖(C2)
5:
Choose a subset of features
𝐹(𝑡)⊆ {
1
, . . . , 𝑚}
with
|𝐹(𝑡)|=𝑘for the current tree 𝑓𝑡(A1)
6: while depth of the current node (in 𝑓𝑡) is 𝑑do
7:
Choose a feature split candidate pair
(𝑗, 𝑠 𝑗
𝑞)
from
𝐹(𝑡)
(C1)
8:
Split the current node with observations
𝐼
into two
child nodes with index sets
𝐼={𝑖
:
𝑥𝑖 𝑗 𝑠𝑗
𝑞}
and
𝐼>=𝐼\𝐼
9:
Repeat (6)-(9) recursing separately on the child nodes
10:
For each leaf
𝑙
calculate a weight
𝑤(𝑡)
𝑙
from the examples
in the leaf according to the chosen update method (C2)
11: Update predictions ˆ
𝑦(𝑡)
𝑖or batch updates (A2)
12: Add the 𝑡th tree 𝑓𝑡to the ensemble, T=T ∪ {𝑓𝑡}
13: return the trained forest T
For simplicity, we assume that each participant holds a single
data item
(𝒙𝑖, 𝑦𝑖)
with
𝑛
participants (data items) in total. We ad-
ditionally assume that we have (publicly) known bounds on each
feature. All of these assumptions can be easily removed, potentially
with some additional privacy cost. Table 7 in the appendix displays
commonly used notation for convenience.
4.1 A General Recipe
In order to train the GBDT algorithm outlined in Section 2.1 we only
need to specify a few core choices: How to pick split candidates
(for discretizing continuous features), calculate the split scores at
each internal node, and compute the leaf weights for prediction.
One can note from Equations
(4)
and
(5)
that the leaf weights and
split scores only depend on the sum of gradients and Hessians at
an internal or leaf node of a decision tree. It is therefore natural to
utilise secure aggregation as a tool to federate the GBDT algorithm.
In Algorithm 1 we present the general GBDT algorithm assuming
these quantities can be gathered. Looking closely at Algorithm 1,
the only time we need to directly query participants’ data is when
we compute the three quantities just mentioned.
Based on this we divide the general algorithm into 3 core com-
ponents that require some form of DP noise: Split Methods
(C1)
,
Weight Updates
(C2)
, and Split Candidates
(C3)
. These are the core
components required for training a GBDT model. We also consider
two additional aspects to specify when training a GBDT model: Fea-
ture Interactions
(A1)
and Batched Updates
(A2)
. These are aspects
that interact with the core components but do not require any addi-
tional noise. To reason about the privacy guarantees of our GBDT
framework, we introduce some variables to count the number of
queries needed when training a GBDT model with
𝑇
trees. Let
𝜅𝑐
denote the number of queries needed to calculate split candidates;
𝜅𝑠
for the queries needed to calculate inner node splits; and
𝜅𝑤
for the queries to calculate leaf weights. Counting the number of
queries needed for each component is enough to give a privacy
guarantee for Algorithm 1.
Theorem 4.1. Suppose that each mechanism for the framework
components satises
(𝛼, 𝜏𝑐),(𝛼, 𝜏𝑠),(𝛼, 𝜏𝑤)
-RDP respectively. Then
the GBDT algorithm satises
(𝛼, 𝜏 )
-RDP with
𝜏=𝜅𝑐𝜏𝑐+𝜅𝑠𝜏𝑠+𝜅𝑤𝜏𝑤
.
The above simply follows from the sequential composition prop-
erties of RDP. In our experimental study we utilise the Gaussian
mechanism for each core component, hence
𝜏=(𝜅𝑐+𝜅𝑠+𝜅𝑤)𝛼
2𝜎2
and so
𝜎=𝑂𝜖(𝜅𝑐+𝜅𝑠+𝜅𝑤)
. This shows that if we can minimise
the number of queries that each main component requires, then we
reduce the amount of noise we add to the learning process while
still maintaining privacy. Various methods aect the privacy cost
in dierent ways. The privacy implications for dierent choices in
terms of 𝜅𝑐, 𝜅𝑠, 𝜅𝑤are shown in Table 1.
4.2 Federating GBDTs
At the start of building the
𝑡
-th tree, each participant calculates
gradient information
(𝑔(𝑡)
𝑖, ℎ(𝑡)
𝑖)
for their examples. Throughout
the training of a single tree, to calculate the desired components we
can rely on querying data in the form
𝑞(𝐼)=Í𝑖𝐼𝑔(𝑡)
𝑖,Í𝑖𝐼(𝑡)
𝑖
over some set of observations
𝐼
, e.g., all observations in a specic
tree node. To do this securely, we can apply secure aggregation to
aggregate gradient information at the various stages that require it
in Algorithm 1 (C1- C3).
In order to apply the Gaussian mechanism, we must bound the
sensitivity of such a query function. In this case, we need bounds
on the gradient quantities
𝑔(𝑡)
𝑖, ℎ(𝑡)
𝑖
. Our focus in this paper is
on binary-classication problems. In binary-classication our loss
function is of the form
(𝑦𝑖,ˆ
𝑦𝑖)=𝑦𝑖log(ˆ
𝑦𝑖)+(
1
𝑦𝑖)log(
1
ˆ
𝑦𝑖)
(i.e., binary cross-entropy) and has gradients
𝑔𝑖∈ [−
1
,
1
]
and
Hessians
𝑖∈ [
0
,1
4]
. Hence the sensitivity of aggregating gradient
information is
Δ2(𝑞)=1+1
16 =17
4
.If the chosen loss function
has unbounded gradient information (e.g., regression problems) we
can employ gradient clipping (similarly to DP-SGD) to obtain a
bounded sensitivity [1].
The computational and communication costs of these steps are
low. Decision tree-based methods are often preferred for their ease
of construction, and this translates to the federated setting: each
client computes its local updates (e.g., gradients and Hessians) and
shares these through secure aggregation. The communication costs
are linear in the size of the updates computed, which are fairly low
dimensional: we quantify this in the subsequent sections.
4.3 Component 1: Split Methods
4.3.1 Greedy Approach: Histogram-Based. As described in Section
2.1, the standard GBDT algorithm will calculate
𝑄
split-scores for
5
摘要:

FederatedBoostedDecisionTreeswithDifferentialPrivacy∗SamuelMaddock†UniversityofWarwickGrahamCormodeMetaAITianhaoWang‡UniversityofVirginiaCarstenMapleUniversityofWarwickSomeshJha‡UniversityofWisconsin-MadisonABSTRACTThereisgreatdemandforscalable,secure,andefficientprivacy-preservingmachinelearningmod...

展开>> 收起<<
Federated Boosted Decision Trees with Differential Privacy_2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.53MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注