Xiaochen Li, Yuke Hu, Weiran Liu, Hanwen Feng, Li Peng, Yuan Hong, Kui Ren, and Zhan Qin
be much more ecient than the MPC-based and Encryption-based
solutions, which is favorable in real-world applications. However,
the introduction of randomness leads to a serious loss of the relative
order information, and results in low accuracy of the trained model.
Observation.
The process of building the decision tree in tree
boosting algorithms is to constantly nd the split points of the
features, and this process only depends on the order of feature
values rather than the exact values. In existing solutions, each
party is essentially sending the order of feature values to the party
holding the label for training the model. It’s necessary to provide
privacy protection for the order of feature values since training the
model with the true order of the feature values can leak the original
private values. However, the mechanisms satisfying LDP are usually
designed to provide privacy protection for values without order (e.g.,
enumeration values). They perturb the private values to achieve
the same degree of indistinguishability for any pair of values in
the data domain. Meanwhile, this causes desensitized values to lose
too much order information, which seriously reduces the accuracy
of the trained model. In fact, people require dierent degrees of
indistinguishability for pairs of values with dierent distances. For
example, an employee doesn’t mind being revealed that his income
is less than his boss, but he minds being known by others to be
less than his colleagues. Therefore, it is more suitable to provide
dierent degrees of indistinguishability for pairs with dierent
distances. Meanwhile, the relative order of value pairs, especially
those that are far apart, can be preserved with a high probability.
Another observation is that the existing distance-based LDP
(dLDP) denition has limitations. There is only one privacy pa-
rameter 𝜖in the existing dLDP denition, which allocates privacy
budgets based on
𝑙1
distance between two values. Given the total
privacy budget
𝜖
, one might want its private value to be as in-
distinguishable as possible from its nearby values, and not mind
weakening the indistinguishability from the values farther away.
However, the existing dLDP denition cannot achieve this privacy
requirement. Specically, increasing
𝜖
can increase the probability
of the desensitized value falling near the true value, but at the same
time its distribution near the true value is more concentrated, and
vice versa. If we can increase the probability of the desensitized
value falling in a specic area around the true value, but atten the
probability distribution in this area, an optimized output probability
distribution of desensitized value can be obtained.
Contribution. Our contributions are summarized below.
Proposal of OpBoost: We propose a novel framework called Op-
Boost for privacy-preserving vertical federated tree boosting. Within
the framework, we design three order-preserving desensitization
algorithms for desensitizing the training data. Dierent from the
existing LDP-based solution, the desensitized training data satisfy a
variant of LDP called dLDP. It can preserve more relative order in-
formation of desensitized values than LDP while providing privacy
protection. When strong indistinguishability is required for close
values, i.e.,
𝜖=
0
.
08 for value pairs with distance
𝑡=
1, OpBoost
can still achieve accuracy close to that without protection for both
classication and regression tasks. For example, for a classica-
tion task, OpBoost achieves 60% when no protection is 87%, while
the LDP-based solution is close to 10%. Meanwhile, OpBoost also
retains the advantages of LDP-based solutions over Cryptography-
based solutions. The total communication overhead of each party
is about
𝑂(𝑟𝑛 log𝑛)
bits, whereas
𝑂(𝑟𝑛𝑘 log2𝑛)
is required in the
MPC-based solution (
𝑛
,
𝑘
,
𝑟
are the number of samples, values’ bits,
and features, respectively). Moreover, we replace the exponential
mechanism with the (bounded) Laplace mechanism to reduce the
computational complexity of desensitizing a sensitive value to
𝑂(
1
)
.
Optimizing existing dLDP denition: We also optimize the existing
dLDP denition in order to break through its limitations. We divide
the data domain into several partitions with the length of
𝜃
. Then
we introduce two privacy parameters
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
to adjust the
probability distribution of desensitized value falling in dierent
partitions and the probability distribution within one partition,
respectively. We prove that the existing denition is just a special
case where
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
satisfy a xed proportional relationship.
We can always get higher accuracy than the existing dLDP under
the same privacy guarantee by adjusting the ratio of
𝜖𝑝𝑟𝑡
and
𝜖𝑛𝑒𝑟
.
Introducing new order-preserving metrics: In addition to quantify-
ing the privacy of the order-preserving desensitization algorithms
with dLDP, we also introduce new theoretical and experimental
metrics to quantify the order information preserved by desensitized
values. We dene that the proposed order-preserving desensitiza-
tion algorithms are probabilistic order-preserving in theory. The
probability of any pair of desensitized values preserving the origi-
nal relative order is at least
𝛾
. Besides, we introduce the weighted-
Kendall coecient weighted by distance to evaluate the order of
desensitized feature values experimentally.
Comprehensive Evaluation: We conduct comprehensive theoret-
ical and experimental evaluations to analyze the performance of
OpBoost, including all the designed order-preserving desensitiza-
tion algorithms. We evaluate the order preservation of the desen-
sitized values using all introduced metrics. We also conduct the
experiments on public datasets used for Binary Classication,Mul-
tiple Classication, and Regression tasks. Both GBDT and XGBoost
are implemented in OpBoost. The experimental results show that
OpBoost achieves the prediction accuracy close to and even higher
than plain models, i.e.,1.0003
×
improvement over plain model of
XGBoost, which is superior to the existing LDP approaches.
2 PRELIMINARIES
2.1 Dierential Privacy
Dierential privacy [
28
] is the de facto privacy denition of data
disclosure, preventing attempts from learning private information
about any individual in a data release. In this work, we are interested
in local dierential privacy [
40
], which allows each user to perturb
his sensitive data using a randomization mechanism
M
such that
the perturbed results from dierent data values will be “close".
Denition 2.1. (Local Dierential Privacy, LDP). An algorithm
M
satises
𝜖
-LDP, where
𝜖≥
0, if and only if for any input
𝑣, 𝑣 ′∈D
,
and any output 𝑦∈𝑅𝑎𝑛𝑔𝑒 (M), we have
Pr [M(𝑣)=𝑦]≤𝑒𝜖Pr M (𝑣′)=𝑦.
The parameter
𝜖
above is called the privacy budget; the smaller
𝜖
means stronger privacy protection is provided. On the other hand,
since all pairs of sensitive data shall satisfy
𝜖
-privacy guarantee
for the same
𝜖
, it may hide too much information about a dataset,
such that utility might be insucient for certain applications. The
distance-based LDP [
11
,
18
,
38
] is proposed to improve the utility,