OpBoost A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization

2025-05-02 0 0 1.31MB 20 页 10玖币
侵权投诉
OpBoost: A Vertical Federated Tree Boosting Framework Based
on Order-Preserving Desensitization
Xiaochen Li
Zhejiang University
Yuke Hu
Zhejiang University
Weiran Liu
Alibaba Group
Hanwen Feng
Alibaba Group
Li Peng
Alibaba Group
Yuan Hong
University of Connecticut
Kui Ren
Zhejiang University
Zhan Qin
Zhejiang University
ABSTRACT
Vertical Federated Learning (FL) is a new paradigm that enables
users with non-overlapping attributes of the same data samples to
jointly train a model without directly sharing the raw data. Never-
theless, recent works show that it’s still not sucient to prevent
privacy leakage from the training process or the trained model.
This paper focuses on studying the privacy-preserving tree boost-
ing algorithms under the vertical FL. The existing solutions based
on cryptography involve heavy computation and communication
overhead and are vulnerable to inference attacks. Although the
solution based on Local Dierential Privacy (LDP) addresses the
above problems, it leads to the low accuracy of the trained model.
This paper explores to improve the accuracy of the widely de-
ployed tree boosting algorithms satisfying dierential privacy un-
der vertical FL. Specically, we introduce a framework called Op-
Boost. Three order-preserving desensitization algorithms satisfying
a variant of LDP called distance-based LDP (dLDP) are designed to
desensitize the training data. In particular, we optimize the dLDP
denition and study ecient sampling distributions to further im-
prove the accuracy and eciency of the proposed algorithms. The
proposed algorithms provide a trade-o between the privacy of
pairs with large distance and the utility of desensitized values. Com-
prehensive evaluations show that OpBoost has a better performance
on prediction accuracy of trained models compared with existing
LDP approaches on reasonable settings. Our code is open source.
1
1 INTRODUCTION
Federated Learning (FL) [
47
] is an emerging paradigm that enables
multiple parties to jointly train a machine learning model without
revealing their private data to each other. According to the way of
data partitioning, FL can be classied into two categories: Horizon-
tal FL and Vertical FL [
65
]. Horizontal FL considers the scenarios
where dierent data samples with the same features are distributed
among dierent parties. Vertical FL works when dierent parties
hold the same data samples with disjoint features. Vertically dis-
tributed datasets are very common in real-world scenarios. One
typical example of vertical FL is shown in Figure 1. A credit insti-
tution collaborates with an E-commerce company and a bank to
Both authors contributed equally to this work.
Corresponding author.
1https://github.com/alibaba-edu/mpc4j/tree/main/mpc4j-sml-opboost
Email:{xiaochenli, yukehu, kuiren, qinzhan@zju.edu.cn, weiran.lwr, fenghanwen.fhw,
jerry.pl@alibaba-inc.com, yuan.hong@uconn.edu}
Credit Rating Prediction Model
Bank E-commerce
company
Credit
institution
Figure 1: An example of Vertical FL.
train a model to predict the labels, i.e., users’ credit ratings, based
on features, i.e., shopping records and revenue.
Tree boosting algorithms (e.g., GBDT [
33
]) are popular super-
vised ML algorithms that enjoy high eciency and interpretability.
They are widely used in prediction tasks based on heterogeneous
features, i.e., revenue, age, in the scenarios, i.e., credit, price forecast
[
16
,
20
,
24
,
41
]. Recent work [
36
] shows that when the training data
mainly includes heterogeneous features, GBDT outperforms state-
of-the-art deep learning solutions. However, most existing tree
boosting algorithms are proposed in the centralized setting, which
train the model with direct access to the whole training data. With
the increasing public privacy concerns and the promulgation of
privacy regulations (e.g., GDPR), this centralized setting limits the
widespread deployment of tree boosting algorithms. Therefore, it be-
comes an essential problem to design practical privacy-preserving
vertical federated tree boosting algorithms.
To address this problem, there are some solutions based on two
dierent technologies: Cryptography, and Local Dierential Pri-
vacy (LDP). Framework SecureBoost [
21
] and its subsequent work
VF2
Boost [
34
] are based on additive homomorphic encryption. Al-
though well-designed engineering optimization is performed, a
large number of homomorphic operations still inevitably cause par-
ticipants to suer a prohibitively computational overhead. Besides,
each party shares the true order of their feature values for training
the model. To our knowledge, some attacks have been proposed to
use auxiliary information to infer the distribution of original values
based on the order of values [
13
,
14
,
27
]. Abspoel et al. [
9
] present a
simple vertical federated boosting framework based on Multi-Party
Computation (MPC) protocols. Nevertheless, MPC protocols cause
a heavy communication overhead. Although the order of feature
values is not visible to any party in MPC protocol, the trained model
is not a privacy-preserving model, which is vulnerable to inference
attack [
42
,
51
,
54
]. Tian et al. [
55
] propose FederBoost, which pro-
vides LDP privacy protection for each party’s feature values and
can resist all the aforementioned attacks. FederBoost is shown to
arXiv:2210.01318v1 [cs.LG] 4 Oct 2022
Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin
be much more ecient than the MPC-based and Encryption-based
solutions, which is favorable in real-world applications. However,
the introduction of randomness leads to a serious loss of the relative
order information, and results in low accuracy of the trained model.
Observation.
The process of building the decision tree in tree
boosting algorithms is to constantly nd the split points of the
features, and this process only depends on the order of feature
values rather than the exact values. In existing solutions, each
party is essentially sending the order of feature values to the party
holding the label for training the model. It’s necessary to provide
privacy protection for the order of feature values since training the
model with the true order of the feature values can leak the original
private values. However, the mechanisms satisfying LDP are usually
designed to provide privacy protection for values without order (e.g.,
enumeration values). They perturb the private values to achieve
the same degree of indistinguishability for any pair of values in
the data domain. Meanwhile, this causes desensitized values to lose
too much order information, which seriously reduces the accuracy
of the trained model. In fact, people require dierent degrees of
indistinguishability for pairs of values with dierent distances. For
example, an employee doesn’t mind being revealed that his income
is less than his boss, but he minds being known by others to be
less than his colleagues. Therefore, it is more suitable to provide
dierent degrees of indistinguishability for pairs with dierent
distances. Meanwhile, the relative order of value pairs, especially
those that are far apart, can be preserved with a high probability.
Another observation is that the existing distance-based LDP
(dLDP) denition has limitations. There is only one privacy pa-
rameter 𝜖in the existing dLDP denition, which allocates privacy
budgets based on
𝑙1
distance between two values. Given the total
privacy budget
𝜖
, one might want its private value to be as in-
distinguishable as possible from its nearby values, and not mind
weakening the indistinguishability from the values farther away.
However, the existing dLDP denition cannot achieve this privacy
requirement. Specically, increasing
𝜖
can increase the probability
of the desensitized value falling near the true value, but at the same
time its distribution near the true value is more concentrated, and
vice versa. If we can increase the probability of the desensitized
value falling in a specic area around the true value, but atten the
probability distribution in this area, an optimized output probability
distribution of desensitized value can be obtained.
Contribution. Our contributions are summarized below.
Proposal of OpBoost: We propose a novel framework called Op-
Boost for privacy-preserving vertical federated tree boosting. Within
the framework, we design three order-preserving desensitization
algorithms for desensitizing the training data. Dierent from the
existing LDP-based solution, the desensitized training data satisfy a
variant of LDP called dLDP. It can preserve more relative order in-
formation of desensitized values than LDP while providing privacy
protection. When strong indistinguishability is required for close
values, i.e.,
𝜖=
0
.
08 for value pairs with distance
𝑡=
1, OpBoost
can still achieve accuracy close to that without protection for both
classication and regression tasks. For example, for a classica-
tion task, OpBoost achieves 60% when no protection is 87%, while
the LDP-based solution is close to 10%. Meanwhile, OpBoost also
retains the advantages of LDP-based solutions over Cryptography-
based solutions. The total communication overhead of each party
is about
𝑂(𝑟𝑛 log𝑛)
bits, whereas
𝑂(𝑟𝑛𝑘 log2𝑛)
is required in the
MPC-based solution (
𝑛
,
𝑘
,
𝑟
are the number of samples, values’ bits,
and features, respectively). Moreover, we replace the exponential
mechanism with the (bounded) Laplace mechanism to reduce the
computational complexity of desensitizing a sensitive value to
𝑂(
1
)
.
Optimizing existing dLDP denition: We also optimize the existing
dLDP denition in order to break through its limitations. We divide
the data domain into several partitions with the length of
𝜃
. Then
we introduce two privacy parameters
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
to adjust the
probability distribution of desensitized value falling in dierent
partitions and the probability distribution within one partition,
respectively. We prove that the existing denition is just a special
case where
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
satisfy a xed proportional relationship.
We can always get higher accuracy than the existing dLDP under
the same privacy guarantee by adjusting the ratio of
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
.
Introducing new order-preserving metrics: In addition to quantify-
ing the privacy of the order-preserving desensitization algorithms
with dLDP, we also introduce new theoretical and experimental
metrics to quantify the order information preserved by desensitized
values. We dene that the proposed order-preserving desensitiza-
tion algorithms are probabilistic order-preserving in theory. The
probability of any pair of desensitized values preserving the origi-
nal relative order is at least
𝛾
. Besides, we introduce the weighted-
Kendall coecient weighted by distance to evaluate the order of
desensitized feature values experimentally.
Comprehensive Evaluation: We conduct comprehensive theoret-
ical and experimental evaluations to analyze the performance of
OpBoost, including all the designed order-preserving desensitiza-
tion algorithms. We evaluate the order preservation of the desen-
sitized values using all introduced metrics. We also conduct the
experiments on public datasets used for Binary Classication,Mul-
tiple Classication, and Regression tasks. Both GBDT and XGBoost
are implemented in OpBoost. The experimental results show that
OpBoost achieves the prediction accuracy close to and even higher
than plain models, i.e.,1.0003
×
improvement over plain model of
XGBoost, which is superior to the existing LDP approaches.
2 PRELIMINARIES
2.1 Dierential Privacy
Dierential privacy [
28
] is the de facto privacy denition of data
disclosure, preventing attempts from learning private information
about any individual in a data release. In this work, we are interested
in local dierential privacy [
40
], which allows each user to perturb
his sensitive data using a randomization mechanism
M
such that
the perturbed results from dierent data values will be “close".
Denition 2.1. (Local Dierential Privacy, LDP). An algorithm
M
satises
𝜖
-LDP, where
𝜖
0, if and only if for any input
𝑣, 𝑣 D
,
and any output 𝑦𝑅𝑎𝑛𝑔𝑒 (M), we have
Pr [M(𝑣)=𝑦]𝑒𝜖Pr M (𝑣)=𝑦.
The parameter
𝜖
above is called the privacy budget; the smaller
𝜖
means stronger privacy protection is provided. On the other hand,
since all pairs of sensitive data shall satisfy
𝜖
-privacy guarantee
for the same
𝜖
, it may hide too much information about a dataset,
such that utility might be insucient for certain applications. The
distance-based LDP [
11
,
18
,
38
] is proposed to improve the utility,
OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization
which measures the level of privacy guarantee between any pair
of sensitive data based on their distance. We use
𝑙1
distance in this
paper, and the denition of distance-based local dierential privacy
is dened as follows.
Denition 2.2. (Distance-based Local Dierential Privacy, dLDP).
An algorithm
M
satises
𝜖
-dLDP, if and only if for any input
𝑥, 𝑥
Dsuch that |𝑥𝑥| 𝑡, and any output 𝑦𝑅𝑎𝑛𝑔𝑒 (M), we have
Pr [M(𝑥)=𝑦]𝑒𝑡𝜖 ·Pr M(𝑥)=𝑦,
where
𝑡𝜖
controls the level of indistinguishability between out-
puts of
M(𝑥)
and
M(𝑥)
. The indistinguishability decreases as the
distance 𝑡between 𝑥and 𝑥increases.
2.2 Order-Preserving Desensitization
Algorithm
In some application scenarios (e.g., recommender system, range
query), the accuracy of the algorithm mainly depends on the order
of the dataset. It would be desirable that the numerical order of
sensitive data is somehow preserved after desensitizing. A lot of
order-preserving desensitization algorithms are proposed in cryp-
tographic studies [
10
,
14
,
43
,
44
], in which the order is rigorously
preserved after desensitization. The formal denition of the order-
preserving desensitization algorithm is as follows.
Denition 2.3. (Order-Preserving Desensitization Algorithm). De-
note
𝑋=𝑥1, 𝑥2, ..., 𝑥𝑛(𝑖.𝑥𝑖N)
as the sensitive sequence, and
𝑌=𝑦1, 𝑦2, ...,𝑦𝑛(𝑖.𝑦𝑖N)
be the noisy sequence output by a
desensitization algorithm
R
, where
𝑦𝑖=R(𝑥𝑖)
. The algorithm
R
is
order-preserving if and only if the following conditions are satised:
𝑖, 𝑗. 𝑥𝑖>𝑥𝑗𝑦𝑖>𝑦𝑗, 𝑎𝑛𝑑
𝑖, 𝑗. 𝑦𝑖>𝑦𝑗𝑥𝑖𝑥𝑗.
However, rigorous order itself could be leveraged by attackers to
perform attacks (e.g., big-jump attack[
14
], inter-column correlation-
based attack [
27
], multinomial attack [
13
], inference attack [
42
,
51
]).
These attacks use auxiliary information to estimate the distribu-
tion of the original values and then correlate them with the de-
sensitized values based on their order. Besides, there is a lack of
widely accepted cryptography tools to quantify how much privacy
is compromised through attacks. The notion of dierential privacy
can help with these predicaments. It provides a rigorous upper
bound for information disclosure and turns deterministic output
into probabilistic results. Hence, we extend a relaxed version of
the order-preserving notion called Probabilistic Order-Preserving in
Denition 2.4.
Denition 2.4. (Probabilistic Order-Preserving Desensitization Al-
gorithm). Denote
𝑋=𝑥1, 𝑥2, ..., 𝑥𝑛(𝑖.𝑥𝑖N)
as the sensitive
sequence, and
𝑌=𝑦1, 𝑦2, ...,𝑦𝑛(𝑖.𝑦𝑖N)
be the noisy sequence
output by a desensitization algorithm
R
, where
𝑦𝑖=R(𝑥𝑖)
. The
algorithm
R
is probabilistic order-preserving if and only if the
following conditions are satised:
𝑖, 𝑗. 𝑥𝑖>𝑥𝑗𝑃𝑟 [𝑦𝑖>𝑦𝑗] 𝛾(𝑡), 𝑤ℎ𝑒𝑟𝑒 𝛾 (𝑡) ∈ [0,1],|𝑥𝑖𝑥𝑗| 𝑡.
Here,
𝛾
is a function related to the distance between
𝑥𝑖
and
𝑥𝑗
.
The denition satises rigorous order-preserving desensitization
in Denition 2.3 when
𝛾(𝑡)=
1for any
𝑡
. The algorithms satisfying
probabilistic order-preserving preserve the relative order of partial
pairs of values rather than all pairs with
𝛾(𝑡)<
1. Specically,
the probabilistic order-preserving desensitization algorithms can
be achieved by adding carefully selected random noise to sensi-
tive values. Meanwhile, randomness can provide provable privacy
guarantee to resist all aforementioned attacks based on auxiliary
information. All the proposed desensitization algorithms are proba-
bilistic order-preserving. Moreover, we prove that these algorithms
all satisfy dLDP.
2.3 Gradient Tree Boosting
The term "gradient tree boosting" originates from the paper by
Friedman et al. [
32
]. Each iteration of training involves incremen-
tally adjusting the gradient to t the residual, with the goal of
minimizing the loss function. There are some gradient tree boost-
ing algorithms have been widely used such as GBDT [
33
], XGBoost
[20], where XGBoost is an ecient implementation of GBDT.
We analyze the process of XGBoost building decision trees to
help understand that only the order of features’ values is necessary
during the process of gradient tree boosting. The algorithm con-
tinually nds the split point with the greatest gain after splitting.
We denote that
𝐼𝐿
and
𝐼𝑅
are the sample sets of left and right nodes
after splitting,
𝑔𝑖
and
𝑖
are gradients,
𝜆
and
𝜔
are regularization
parameters, the gain of the split is given by
𝐺𝑠𝑝𝑙𝑖𝑡 =1
2[(Í𝑖𝐼𝐿𝑔𝑖)2
Í𝑖𝐼𝐿𝑖+𝜆+(Í𝑖𝐼𝑅𝑔𝑖)2
Í𝑖𝐼𝑅𝑖+𝜆(Í𝑖𝐼𝑔𝑖)2
Í𝑖𝐼𝑖+𝜆] 𝜔.
Note that all the variables we need for calculating
𝐺𝑠𝑝𝑙𝑖𝑡
can be
derived only from the order of features’ values. Thus we can build
the boosting tree without knowing the exact values of each feature.
We can decide which node should a sample fall in based on this
tree, but can not know the value that this node represents, which
made it a "partial tree". The accurate predictions can be achieved
with the assistance of the parties holding corresponding features.
3 SYSTEM OVERVIEW
3.1 Architecture
The training dataset is vertically partitioned and is distributed
among dierent users’ devices. Each user holds dierent features of
samples but overlapping sample IDs and only one user holds labels.
Since the label is essential for supervised learning, the user holding
labels is generally the central node responsible for aggregating
information and updating the model. Therefore, our framework
focuses on guiding the information exchange between other users
and the user holding labels, rather than sharing labels among all
users. For the sake of simplicity, we divide users into two parties.
Party A.
Party A refers to the user who holds the label of the
training samples. It may also hold several features. In the training
process, Party A acts as a central server to exchange information
with Party B and train the model. The trained model is only stored
in Party A, and Party A is responsible for using it to predict the
new samples.
Party B.
We dene Party B as the set of users who only hold several
features of the samples. Party B acts as a client that participates in
training by exchanging necessary information with Party A.
Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin
ID
F1 F2
300.5
40.5
230
305
107.2
20.1
500
200
F3 F4
500
200.3
100.1
305.6
30.5 20
403 100
ID F5
Label
30.5 0
23 1
17.2 1
50 0
F1 F2
61 10
47 62
23 5
100
41
ID
F3
F4
100
41
21 63
8 6
81 3
Pre-processing
Pre-processing
ID
F1 F2
65 6
50 56
36 8
80 37
Desensitization
ID
F3
F4
100
41
21 63
86
81 3
Desensitization
F1
31
2 4
12
4 3
Replace with
Ordinal Numbers
Replace with
Ordinal Numbers
ID
F3
F4
43
2 4
12
3 1
F1
3 2
F1
65 8
F1
841
F1
1 3
Party B Party A
Figure 2: Training process of OpBoost.
3.2 Execution Workow
In the following, we describe the process of training decision trees
with OpBoost in detail, and also explain how to predict with these
trained decision trees.
Training Process.
The overall process can be summarized into
three steps, which are shown in Figure 2.
First, Party B desensitizes the local features before communicat-
ing with Party A, which is specied as follows.
Pre-processing for Feature Values. We focus on numerical features
and categorical features that have natural ordering between cate-
gories. The categorical features with no distance between values
are not included, and the tree boosting algorithms handle them
dierently. These features are usually encoded by one-hot en-
coding, and the encoded values can be desensitized by existing
LDP mechanisms [
31
,
59
]. As the ordinal categorical values can
be mapped to discrete numerical values, w.l.o.g., we assume that
all features are numerical values, i.e., continuous or discrete nu-
merical values. Besides, since the specic values do not aect
the structure of the decision tree, it suces to remap features
coming with diverse distance metrics to a unied discrete data
domain for the subsequent distance-based privacy-preserving
algorithm to work with.
Desensitize Values with Order-preserving Desensitization Algo-
rithm. We design several order-preserving desensitization al-
gorithms that satisfy dLDP, and give guidance to help Party B
choose algorithm to desensitize features’ values.
Replace the values with serial numbers. Party B replaces all the
desensitized values of features with their corresponding ordinal
numbers and then sends them to Party A.
Second, Party A nds the best split points over the features after
collecting all features’ information from Party B. Specically, Party
A does not know the values of split points for features stored in
Party B. It records the ordinal numbers as the split points.
Finally, Party A sends all order numbers of split points to Party
B to get their desensitized values. Then Party B sends the specic
desensitized values of corresponding split points back to Party A,
and Party A updates the tree models.
Prediction.
After the above training steps, Party A can obtain a
complete decision tree model for predicting new samples. Party
Variable Description
DFinite and numerical input data domain
DFinite and discrete data domain after mapping
𝐿/𝑅Lower/Upper bound of D
𝑡Distance between values in D
𝜃Length of a partition
P𝑚𝑚th partition of D
𝜖𝑝𝑟 𝑡 /𝜖𝑛𝑒𝑟
Parameter for Adjusting the privacy budget
between dierent partitions/within one partition
𝛼Ratio of 𝜖𝑝𝑟 𝑡 and 𝜃𝜖𝑛𝑒𝑟
𝛾Lower bound of order-preserving probability
Table 1: Important Notations
A can independently predict the new samples stored locally (non-
private), or continue to cooperate with Party B to predict new
samples (private). All the new samples need to do the same pre-
processing as the training samples before desensitization or being
input into the model.
Note that Party B is not required to be online all the time in both
training and prediction procedures. It can go oine after sending
all features’ information and values of split points to Party A. In
addition, Party A can utilize the trained decision tree to indepen-
dently perform the prediction tasks with non-private samples or
desensitized private samples.
4 PROPOSED ALGORITHMS
4.1 Pre-Processing for Feature Values
Since the training samples come from multiple parties, the tree
boosting algorithms usually preprocess the values of all features
before training, i.e., fulling the missing values, handling wrong
values. In addition, we present an additional preprocessing step
to improve the privacy and utility of desensitization algorithms.
Firstly, there are some existing works that propose that implement-
ing dierential privacy mechanisms over oating-point numerical
values is vulnerable to privacy attacks [
39
,
50
]. Secondly, note that
the privacy guarantee provided by OpBoost satises distance-based
LDP. It’s necessary to normalize the values of dierent features in
a unied distance unit. To address these issues, we map numerical
values of all dierent features into a unied discrete value domain.
We show the details in the following.
摘要:

OpBoost:AVerticalFederatedTreeBoostingFrameworkBasedonOrder-PreservingDesensitizationXiaochenLi∗ZhejiangUniversityYukeHu∗ZhejiangUniversityWeiranLiuAlibabaGroupHanwenFengAlibabaGroupLiPengAlibabaGroupYuanHongUniversityofConnecticutKuiRenZhejiangUniversityZhanQin†ZhejiangUniversityABSTRACTVerticalFed...

展开>> 收起<<
OpBoost A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1.31MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注