OpBoost A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization

2025-05-02 0 0 1.31MB 20 页 10玖币

侵权投诉

OpBoost: A Vertical Federated Tree Boosting Framework Based

on Order-Preserving Desensitization

Xiaochen Li∗

Zhejiang University

Yuke Hu∗

Zhejiang University

Weiran Liu

Alibaba Group

Hanwen Feng

Alibaba Group

Li Peng

Alibaba Group

Yuan Hong

University of Connecticut

Kui Ren

Zhejiang University

Zhan Qin†

Zhejiang University

ABSTRACT

Vertical Federated Learning (FL) is a new paradigm that enables

users with non-overlapping attributes of the same data samples to

jointly train a model without directly sharing the raw data. Never-

theless, recent works show that it’s still not sucient to prevent

privacy leakage from the training process or the trained model.

This paper focuses on studying the privacy-preserving tree boost-

ing algorithms under the vertical FL. The existing solutions based

on cryptography involve heavy computation and communication

overhead and are vulnerable to inference attacks. Although the

solution based on Local Dierential Privacy (LDP) addresses the

above problems, it leads to the low accuracy of the trained model.

This paper explores to improve the accuracy of the widely de-

ployed tree boosting algorithms satisfying dierential privacy un-

der vertical FL. Specically, we introduce a framework called Op-

Boost. Three order-preserving desensitization algorithms satisfying

a variant of LDP called distance-based LDP (dLDP) are designed to

desensitize the training data. In particular, we optimize the dLDP

denition and study ecient sampling distributions to further im-

prove the accuracy and eciency of the proposed algorithms. The

proposed algorithms provide a trade-o between the privacy of

pairs with large distance and the utility of desensitized values. Com-

prehensive evaluations show that OpBoost has a better performance

on prediction accuracy of trained models compared with existing

LDP approaches on reasonable settings. Our code is open source.

1 INTRODUCTION

Federated Learning (FL) [

] is an emerging paradigm that enables

multiple parties to jointly train a machine learning model without

revealing their private data to each other. According to the way of

data partitioning, FL can be classied into two categories: Horizon-

tal FL and Vertical FL [

]. Horizontal FL considers the scenarios

where dierent data samples with the same features are distributed

among dierent parties. Vertical FL works when dierent parties

hold the same data samples with disjoint features. Vertically dis-

tributed datasets are very common in real-world scenarios. One

typical example of vertical FL is shown in Figure 1. A credit insti-

tution collaborates with an E-commerce company and a bank to

∗Both authors contributed equally to this work.

†Corresponding author.

1https://github.com/alibaba-edu/mpc4j/tree/main/mpc4j-sml-opboost

Email:{xiaochenli, yukehu, kuiren, qinzhan@zju.edu.cn, weiran.lwr, fenghanwen.fhw,

jerry.pl@alibaba-inc.com, yuan.hong@uconn.edu}

Credit Rating Prediction Model

Bank E-commerce

company

Credit

institution

Figure 1: An example of Vertical FL.

train a model to predict the labels, i.e., users’ credit ratings, based

on features, i.e., shopping records and revenue.

Tree boosting algorithms (e.g., GBDT [

]) are popular super-

vised ML algorithms that enjoy high eciency and interpretability.

They are widely used in prediction tasks based on heterogeneous

features, i.e., revenue, age, in the scenarios, i.e., credit, price forecast

[

]. Recent work [

] shows that when the training data

mainly includes heterogeneous features, GBDT outperforms state-

of-the-art deep learning solutions. However, most existing tree

boosting algorithms are proposed in the centralized setting, which

train the model with direct access to the whole training data. With

the increasing public privacy concerns and the promulgation of

privacy regulations (e.g., GDPR), this centralized setting limits the

widespread deployment of tree boosting algorithms. Therefore, it be-

comes an essential problem to design practical privacy-preserving

vertical federated tree boosting algorithms.

To address this problem, there are some solutions based on two

dierent technologies: Cryptography, and Local Dierential Pri-

vacy (LDP). Framework SecureBoost [

] and its subsequent work

VF2

Boost [

] are based on additive homomorphic encryption. Al-

though well-designed engineering optimization is performed, a

large number of homomorphic operations still inevitably cause par-

ticipants to suer a prohibitively computational overhead. Besides,

each party shares the true order of their feature values for training

the model. To our knowledge, some attacks have been proposed to

use auxiliary information to infer the distribution of original values

based on the order of values [

]. Abspoel et al. [

] present a

simple vertical federated boosting framework based on Multi-Party

Computation (MPC) protocols. Nevertheless, MPC protocols cause

a heavy communication overhead. Although the order of feature

values is not visible to any party in MPC protocol, the trained model

is not a privacy-preserving model, which is vulnerable to inference

attack [

]. Tian et al. [

] propose FederBoost, which pro-

vides LDP privacy protection for each party’s feature values and

can resist all the aforementioned attacks. FederBoost is shown to

arXiv:2210.01318v1 [cs.LG] 4 Oct 2022

Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin

be much more ecient than the MPC-based and Encryption-based

solutions, which is favorable in real-world applications. However,

the introduction of randomness leads to a serious loss of the relative

order information, and results in low accuracy of the trained model.

Observation.

The process of building the decision tree in tree

boosting algorithms is to constantly nd the split points of the

features, and this process only depends on the order of feature

values rather than the exact values. In existing solutions, each

party is essentially sending the order of feature values to the party

holding the label for training the model. It’s necessary to provide

privacy protection for the order of feature values since training the

model with the true order of the feature values can leak the original

private values. However, the mechanisms satisfying LDP are usually

designed to provide privacy protection for values without order (e.g.,

enumeration values). They perturb the private values to achieve

the same degree of indistinguishability for any pair of values in

the data domain. Meanwhile, this causes desensitized values to lose

too much order information, which seriously reduces the accuracy

of the trained model. In fact, people require dierent degrees of

indistinguishability for pairs of values with dierent distances. For

example, an employee doesn’t mind being revealed that his income

is less than his boss, but he minds being known by others to be

less than his colleagues. Therefore, it is more suitable to provide

dierent degrees of indistinguishability for pairs with dierent

distances. Meanwhile, the relative order of value pairs, especially

those that are far apart, can be preserved with a high probability.

Another observation is that the existing distance-based LDP

(dLDP) denition has limitations. There is only one privacy pa-

rameter 𝜖in the existing dLDP denition, which allocates privacy

budgets based on

𝑙1

distance between two values. Given the total

privacy budget

𝜖

, one might want its private value to be as in-

distinguishable as possible from its nearby values, and not mind

weakening the indistinguishability from the values farther away.

However, the existing dLDP denition cannot achieve this privacy

requirement. Specically, increasing

𝜖

can increase the probability

of the desensitized value falling near the true value, but at the same

time its distribution near the true value is more concentrated, and

vice versa. If we can increase the probability of the desensitized

value falling in a specic area around the true value, but atten the

probability distribution in this area, an optimized output probability

distribution of desensitized value can be obtained.

Contribution. Our contributions are summarized below.

Proposal of OpBoost: We propose a novel framework called Op-

Boost for privacy-preserving vertical federated tree boosting. Within

the framework, we design three order-preserving desensitization

algorithms for desensitizing the training data. Dierent from the

existing LDP-based solution, the desensitized training data satisfy a

variant of LDP called dLDP. It can preserve more relative order in-

formation of desensitized values than LDP while providing privacy

protection. When strong indistinguishability is required for close

values, i.e.,

𝜖=

08 for value pairs with distance

𝑡=

1, OpBoost

can still achieve accuracy close to that without protection for both

classication and regression tasks. For example, for a classica-

tion task, OpBoost achieves 60% when no protection is 87%, while

the LDP-based solution is close to 10%. Meanwhile, OpBoost also

retains the advantages of LDP-based solutions over Cryptography-

based solutions. The total communication overhead of each party

is about

𝑂(𝑟𝑛 log𝑛)

bits, whereas

𝑂(𝑟𝑛𝑘 log2𝑛)

is required in the

MPC-based solution (

𝑛

𝑘

𝑟

are the number of samples, values’ bits,

and features, respectively). Moreover, we replace the exponential

mechanism with the (bounded) Laplace mechanism to reduce the

computational complexity of desensitizing a sensitive value to

𝑂(

)

Optimizing existing dLDP denition: We also optimize the existing

dLDP denition in order to break through its limitations. We divide

the data domain into several partitions with the length of

𝜃

. Then

we introduce two privacy parameters

𝜖𝑝𝑟𝑡

and

𝜖𝑛𝑒𝑟

to adjust the

probability distribution of desensitized value falling in dierent

partitions and the probability distribution within one partition,

respectively. We prove that the existing denition is just a special

case where

𝜖𝑝𝑟𝑡

and

𝜖𝑛𝑒𝑟

satisfy a xed proportional relationship.

We can always get higher accuracy than the existing dLDP under

the same privacy guarantee by adjusting the ratio of

𝜖𝑝𝑟𝑡

and

𝜖𝑛𝑒𝑟

Introducing new order-preserving metrics: In addition to quantify-

ing the privacy of the order-preserving desensitization algorithms

with dLDP, we also introduce new theoretical and experimental

metrics to quantify the order information preserved by desensitized

values. We dene that the proposed order-preserving desensitiza-

tion algorithms are probabilistic order-preserving in theory. The

probability of any pair of desensitized values preserving the origi-

nal relative order is at least

𝛾

. Besides, we introduce the weighted-

Kendall coecient weighted by distance to evaluate the order of

desensitized feature values experimentally.

Comprehensive Evaluation: We conduct comprehensive theoret-

ical and experimental evaluations to analyze the performance of

OpBoost, including all the designed order-preserving desensitiza-

tion algorithms. We evaluate the order preservation of the desen-

sitized values using all introduced metrics. We also conduct the

experiments on public datasets used for Binary Classication,Mul-

tiple Classication, and Regression tasks. Both GBDT and XGBoost

are implemented in OpBoost. The experimental results show that

OpBoost achieves the prediction accuracy close to and even higher

than plain models, i.e.,1.0003

improvement over plain model of

XGBoost, which is superior to the existing LDP approaches.

2 PRELIMINARIES

2.1 Dierential Privacy

Dierential privacy [

] is the de facto privacy denition of data

disclosure, preventing attempts from learning private information

about any individual in a data release. In this work, we are interested

in local dierential privacy [

], which allows each user to perturb

his sensitive data using a randomization mechanism

such that

the perturbed results from dierent data values will be “close".

Denition 2.1. (Local Dierential Privacy, LDP). An algorithm

satises

𝜖

-LDP, where

𝜖≥

0, if and only if for any input

𝑣, 𝑣 ′∈D

and any output 𝑦∈𝑅𝑎𝑛𝑔𝑒 (M), we have

Pr [M(𝑣)=𝑦]≤𝑒𝜖Pr M (𝑣′)=𝑦.

The parameter

𝜖

above is called the privacy budget; the smaller

𝜖

means stronger privacy protection is provided. On the other hand,

since all pairs of sensitive data shall satisfy

𝜖

-privacy guarantee

for the same

𝜖

, it may hide too much information about a dataset,

such that utility might be insucient for certain applications. The

distance-based LDP [

] is proposed to improve the utility,

OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization

which measures the level of privacy guarantee between any pair

of sensitive data based on their distance. We use

𝑙1

distance in this

paper, and the denition of distance-based local dierential privacy

is dened as follows.

Denition 2.2. (Distance-based Local Dierential Privacy, dLDP).

An algorithm

satises

𝜖

-dLDP, if and only if for any input

𝑥, 𝑥 ′∈

Dsuch that |𝑥−𝑥′| ≤ 𝑡, and any output 𝑦∈𝑅𝑎𝑛𝑔𝑒 (M), we have

Pr [M(𝑥)=𝑦]≤𝑒𝑡𝜖 ·Pr M(𝑥′)=𝑦,

where

𝑡𝜖

controls the level of indistinguishability between out-

puts of

M(𝑥)

and

M(𝑥′)

. The indistinguishability decreases as the

distance 𝑡between 𝑥and 𝑥′increases.

2.2 Order-Preserving Desensitization

Algorithm

In some application scenarios (e.g., recommender system, range

query), the accuracy of the algorithm mainly depends on the order

of the dataset. It would be desirable that the numerical order of

sensitive data is somehow preserved after desensitizing. A lot of

order-preserving desensitization algorithms are proposed in cryp-

tographic studies [

], in which the order is rigorously

preserved after desensitization. The formal denition of the order-

preserving desensitization algorithm is as follows.

Denition 2.3. (Order-Preserving Desensitization Algorithm). De-

note

𝑋=𝑥1, 𝑥2, ..., 𝑥𝑛(∀𝑖.𝑥𝑖∈N)

as the sensitive sequence, and

𝑌=𝑦1, 𝑦2, ...,𝑦𝑛(∀𝑖.𝑦𝑖∈N)

be the noisy sequence output by a

desensitization algorithm

, where

𝑦𝑖=R(𝑥𝑖)

. The algorithm

order-preserving if and only if the following conditions are satised:

∀𝑖, 𝑗. 𝑥𝑖>𝑥𝑗⇒𝑦𝑖>𝑦𝑗, 𝑎𝑛𝑑

∀𝑖, 𝑗. 𝑦𝑖>𝑦𝑗⇒𝑥𝑖≥𝑥𝑗.

However, rigorous order itself could be leveraged by attackers to

perform attacks (e.g., big-jump attack[

], inter-column correlation-

based attack [

], multinomial attack [

], inference attack [

]).

These attacks use auxiliary information to estimate the distribu-

tion of the original values and then correlate them with the de-

sensitized values based on their order. Besides, there is a lack of

widely accepted cryptography tools to quantify how much privacy

is compromised through attacks. The notion of dierential privacy

can help with these predicaments. It provides a rigorous upper

bound for information disclosure and turns deterministic output

into probabilistic results. Hence, we extend a relaxed version of

the order-preserving notion called Probabilistic Order-Preserving in

Denition 2.4.

Denition 2.4. (Probabilistic Order-Preserving Desensitization Al-

gorithm). Denote

𝑋=𝑥1, 𝑥2, ..., 𝑥𝑛(∀𝑖.𝑥𝑖∈N)

as the sensitive

sequence, and

𝑌=𝑦1, 𝑦2, ...,𝑦𝑛(∀𝑖.𝑦𝑖∈N)

be the noisy sequence

output by a desensitization algorithm

, where

𝑦𝑖=R(𝑥𝑖)

. The

algorithm

is probabilistic order-preserving if and only if the

following conditions are satised:

∀𝑖, 𝑗. 𝑥𝑖>𝑥𝑗⇒𝑃𝑟 [𝑦𝑖>𝑦𝑗] ≥ 𝛾(𝑡), 𝑤ℎ𝑒𝑟𝑒 𝛾 (𝑡) ∈ [0,1],|𝑥𝑖−𝑥𝑗| ≤ 𝑡.

Here,

𝛾

is a function related to the distance between

𝑥𝑖

and

𝑥𝑗

The denition satises rigorous order-preserving desensitization

in Denition 2.3 when

𝛾(𝑡)=

1for any

𝑡

. The algorithms satisfying

probabilistic order-preserving preserve the relative order of partial

pairs of values rather than all pairs with

𝛾(𝑡)<

1. Specically,

the probabilistic order-preserving desensitization algorithms can

be achieved by adding carefully selected random noise to sensi-

tive values. Meanwhile, randomness can provide provable privacy

guarantee to resist all aforementioned attacks based on auxiliary

information. All the proposed desensitization algorithms are proba-

bilistic order-preserving. Moreover, we prove that these algorithms

all satisfy dLDP.

2.3 Gradient Tree Boosting

The term "gradient tree boosting" originates from the paper by

Friedman et al. [

]. Each iteration of training involves incremen-

tally adjusting the gradient to t the residual, with the goal of

minimizing the loss function. There are some gradient tree boost-

ing algorithms have been widely used such as GBDT [

], XGBoost

[20], where XGBoost is an ecient implementation of GBDT.

We analyze the process of XGBoost building decision trees to

help understand that only the order of features’ values is necessary

during the process of gradient tree boosting. The algorithm con-

tinually nds the split point with the greatest gain after splitting.

We denote that

𝐼𝐿

and

𝐼𝑅

are the sample sets of left and right nodes

after splitting,

𝑔𝑖

and

ℎ𝑖

are gradients,

𝜆

and

𝜔

are regularization

parameters, the gain of the split is given by

𝐺𝑠𝑝𝑙𝑖𝑡 =1

2[(Í𝑖∈𝐼𝐿𝑔𝑖)2

Í𝑖∈𝐼𝐿ℎ𝑖+𝜆+(Í𝑖∈𝐼𝑅𝑔𝑖)2

Í𝑖∈𝐼𝑅ℎ𝑖+𝜆−(Í𝑖∈𝐼𝑔𝑖)2

Í𝑖∈𝐼ℎ𝑖+𝜆] − 𝜔.

Note that all the variables we need for calculating

𝐺𝑠𝑝𝑙𝑖𝑡

can be

derived only from the order of features’ values. Thus we can build

the boosting tree without knowing the exact values of each feature.

We can decide which node should a sample fall in based on this

tree, but can not know the value that this node represents, which

made it a "partial tree". The accurate predictions can be achieved

with the assistance of the parties holding corresponding features.

3 SYSTEM OVERVIEW

3.1 Architecture

The training dataset is vertically partitioned and is distributed

among dierent users’ devices. Each user holds dierent features of

samples but overlapping sample IDs and only one user holds labels.

Since the label is essential for supervised learning, the user holding

labels is generally the central node responsible for aggregating

information and updating the model. Therefore, our framework

focuses on guiding the information exchange between other users

and the user holding labels, rather than sharing labels among all

users. For the sake of simplicity, we divide users into two parties.

Party A.

Party A refers to the user who holds the label of the

training samples. It may also hold several features. In the training

process, Party A acts as a central server to exchange information

with Party B and train the model. The trained model is only stored

in Party A, and Party A is responsible for using it to predict the

new samples.

Party B.

We dene Party B as the set of users who only hold several

features of the samples. Party B acts as a client that participates in

training by exchanging necessary information with Party A.

Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin

F1 F2

…



300.5

40.5

…

230

305

…



107.2

20.1

…

500

200

…

F3 F4

…

500

200.3

…



100.1

305.6

…

30.5 20

…

403 100

…

ID F5 …

Label

30.5 …0

23 …1

17.2 …1

50 …0

F1 F2

…

61 10

…

47 62

…

23 5

…



100

…



100

…

21 63

…

8 6

…

81 3

…

Pre-processing

F1 F2

…

65 6

…

50 56

…

36 8

…

80 37

…

Desensitization

…



100

…

21 63

…

86

…

81 3

…

Desensitization

…

31

…

2 4

…

12

…

4 3

…

Replace with

Ordinal Numbers

Replace with

Ordinal Numbers

…

43

…

2 4

…

12

…

3 1

…

3 2

65 8

841

1 3

①

②

③

Party B Party A

Figure 2: Training process of OpBoost.

3.2 Execution Workow

In the following, we describe the process of training decision trees

with OpBoost in detail, and also explain how to predict with these

trained decision trees.

Training Process.

The overall process can be summarized into

three steps, which are shown in Figure 2.

First, Party B desensitizes the local features before communicat-

ing with Party A, which is specied as follows.

•

Pre-processing for Feature Values. We focus on numerical features

and categorical features that have natural ordering between cate-

gories. The categorical features with no distance between values

are not included, and the tree boosting algorithms handle them

dierently. These features are usually encoded by one-hot en-

coding, and the encoded values can be desensitized by existing

LDP mechanisms [

]. As the ordinal categorical values can

be mapped to discrete numerical values, w.l.o.g., we assume that

all features are numerical values, i.e., continuous or discrete nu-

merical values. Besides, since the specic values do not aect

the structure of the decision tree, it suces to remap features

coming with diverse distance metrics to a unied discrete data

domain for the subsequent distance-based privacy-preserving

algorithm to work with.

•

Desensitize Values with Order-preserving Desensitization Algo-

rithm. We design several order-preserving desensitization al-

gorithms that satisfy dLDP, and give guidance to help Party B

choose algorithm to desensitize features’ values.

•

Replace the values with serial numbers. Party B replaces all the

desensitized values of features with their corresponding ordinal

numbers and then sends them to Party A.

Second, Party A nds the best split points over the features after

collecting all features’ information from Party B. Specically, Party

A does not know the values of split points for features stored in

Party B. It records the ordinal numbers as the split points.

Finally, Party A sends all order numbers of split points to Party

B to get their desensitized values. Then Party B sends the specic

desensitized values of corresponding split points back to Party A,

and Party A updates the tree models.

Prediction.

After the above training steps, Party A can obtain a

complete decision tree model for predicting new samples. Party

Variable Description

DFinite and numerical input data domain

D⊥Finite and discrete data domain after mapping

𝐿/𝑅Lower/Upper bound of D⊥

𝑡Distance between values in D⊥

𝜃Length of a partition

P𝑚𝑚th partition of D⊥

𝜖𝑝𝑟 𝑡 /𝜖𝑛𝑒𝑟

Parameter for Adjusting the privacy budget

between dierent partitions/within one partition

𝛼Ratio of 𝜖𝑝𝑟 𝑡 and 𝜃𝜖𝑛𝑒𝑟

𝛾Lower bound of order-preserving probability

Table 1: Important Notations

A can independently predict the new samples stored locally (non-

private), or continue to cooperate with Party B to predict new

samples (private). All the new samples need to do the same pre-

processing as the training samples before desensitization or being

input into the model.

Note that Party B is not required to be online all the time in both

training and prediction procedures. It can go oine after sending

all features’ information and values of split points to Party A. In

addition, Party A can utilize the trained decision tree to indepen-

dently perform the prediction tasks with non-private samples or

desensitized private samples.

4 PROPOSED ALGORITHMS

4.1 Pre-Processing for Feature Values

Since the training samples come from multiple parties, the tree

boosting algorithms usually preprocess the values of all features

before training, i.e., fulling the missing values, handling wrong

values. In addition, we present an additional preprocessing step

to improve the privacy and utility of desensitization algorithms.

Firstly, there are some existing works that propose that implement-

ing dierential privacy mechanisms over oating-point numerical

values is vulnerable to privacy attacks [

]. Secondly, note that

the privacy guarantee provided by OpBoost satises distance-based

LDP. It’s necessary to normalize the values of dierent features in

a unied distance unit. To address these issues, we map numerical

values of all dierent features into a unied discrete value domain.

We show the details in the following.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OpBoost:AVerticalFederatedTreeBoostingFrameworkBasedonOrder-PreservingDesensitizationXiaochenLi∗ZhejiangUniversityYukeHu∗ZhejiangUniversityWeiranLiuAlibabaGroupHanwenFengAlibabaGroupLiPengAlibabaGroupYuanHongUniversityofConnecticutKuiRenZhejiangUniversityZhanQin†ZhejiangUniversityABSTRACTVerticalFed...

展开>> 收起<<

OpBoost A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OpBoost A Vertical Federated Tree Boosting Framework Based on Order-Preserving Desensitization

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: