Coresets for Vertical Federated Learning Regularized Linear Regression and K-Means Clustering

2025-05-06 0 0 1.02MB 32 页 10玖币
侵权投诉
Coresets for Vertical Federated Learning:
Regularized Linear Regression and K-Means
Clustering
Lingxiao Huang
Nanjing University
huanglingxiao1990@126.com
Zhize Li
Carnegie Mellon University
zhizeli@cmu.edu
Jialin Sun
Fudan University
sunjl20@fudan.edu.cn
Haoyu Zhao
Princeton Univeristy
haoyu@princeton.edu
Abstract
Vertical federated learning (VFL), where data features are stored in multiple parties
distributively, is an important area in machine learning. However, the communica-
tion complexity for VFL is typically very high. In this paper, we propose a unified
framework by constructing coresets in a distributed fashion for communication-
efficient VFL. We study two important learning tasks in the VFL setting: regular-
ized linear regression and
k
-means clustering, and apply our coreset framework
to both problems. We theoretically show that using coresets can drastically alle-
viate the communication complexity, while nearly maintain the solution quality.
Numerical experiments are conducted to corroborate our theoretical findings.
1 Introduction
Federated learning (FL) [
54
,
40
,
44
,
36
,
68
] is a learning framework where multiple clients/parties
collaboratively train a machine learning model under the coordination of a central server without
exposing their raw data (i.e., each party’s raw data is stored locally and not transferred). There are
two large categories of FL, horizontal federated learning (HFL) and vertical federated learning (VFL),
based on the distribution characteristics of the data. In HFL, different parties usually hold different
datasets but all datasets share the same features; while in VFL, all parties use the same dataset but
different parties hold different subsets of the features (see Figure 1a).
Compared to HFL, VFL [
74
,
50
,
71
] is generally harder and requires more communication: as a
single party cannot observe the full features, it requires communication with other parties to compute
the loss and the gradient of a single data. This will result in two potential problems: (i) it may require
a huge amount of communication to jointly train the machine learning model when the dataset is large;
and (ii) the procedure of VFL transfers the information of local data and may cause privacy leakage.
Most of the VFL literature focus on the privacy issue, and designing secure training procedure for
different machine learning models in the VFL setting [
28
,
74
,
72
,
10
]. However, the communication
efficiency of the training procedure in VFL is somewhat underexplored. For unsupervised clustering
problems, Ding et al.
[19]
propose constant approximation schemes for
k
-means clustering, and their
communication complexity is linear in terms of the dataset size. For linear regression, although the
communciaiton complexity can be improved to sublinear via sampling, such as SGD-type uniform
sampling for the dataset [
50
,
74
], the final performance is not comparable to that using the whole
Alphabetical order.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14664v1 [cs.LG] 26 Oct 2022
dataset. Thus previous algorithms usually do not scale or perform well to the big data scenarios.
2
This leads us to consider the following question:
How to train machine learning models using sublinear communication complexity
in terms of the dataset size without sacrificing the performance in the vertical
federated learning (VFL) setting?
In this paper, we try to answer this question, and our method is based on the notion of coreset [
27
,
22
,
23
]. Roughly speaking, coreset can be viewed as a small data summary of the original dataset, and the
machine learning model trained on the coreset performs similarly to the model trained using the full
dataset. Therefore, as long as we can obtain a coreset in the VFL setting in a communication-efficient
way, we can then run existing algorithms on the coreset instead of the full dataset.
Our contribution
We study the communication-efficient methods for vertical federated learning
with an emphasis on scalability, and design a general paradigm through the lens of coreset. Concretely,
we have the following key contributions:
1.
We design a unified framework for coreset construction in the vertical federated learning
setting (Section 3), which can help reduce the communication complexity (Theorem 2.5).
2.
We study the regularized linear regression (Definition 2.1) and
k
-means (Definition 2.2)
problems in the VFL setting, and apply our unified coreset construction framework to them.
We show that we can get
ε
-approximation for these two problems using only
o(n)
sublinear
communications under mild conditions, where
n
is the size of the dataset (Section 4 and 5).
3.
We conduct numerical experiments to validate our theoretical results. Our numerical experi-
ments corroborate our findings that using coresets can drastically reduce the communication
complexity, while maintaining the quality of the solution (Section 6). Moreover, compared
to uniform sampling, applying our coresets can achieve a better solution with the same or
smaller communication complexity.
1.1 More related works
Federated learning
Federated learning was introduced by McMahan et al.
[54]
, and received
increasing attention in recent years. There exist many works studied in the horizontal federated
learning (HFL) setting, such as algorithms with multiple local update steps [
54
,
17
,
39
,
25
,
56
,
77
] .
There are also many algorithms with communication compression [
38
,
55
,
47
,
45
,
24
,
46
,
60
,
21
,
76
,
61, 78] and algorithms with privacy preserving [70, 30, 79, 64, 48].
Vertical federated learning
Due to the difficulties of VFL, people designed VFL algorithms for
some particular machine learning models, including linear regression [
50
,
74
], logistic regression [
75
,
73
,
29
], gradient boosting trees [
63
,
11
,
10
], and
k
-means [
19
]. For
k
-means, Ding et al.
[19]
proposed an algorithm that computes the global centers based on the product of local centers, which
requires
O(nT )
communication complexity. For linear regression, Liu et al.
[50]
and Yang et al.
[74]
used uniform sampling to get unbiased gradient estimation and improved the communication
efficiency, but the performance may not be good compared to that without sampling. Yang et al.
[73]
also applied uniform sampling to quasi-Newton algorithm and improved communication complexity
for logistic regression. People also studied other settings in VFL, e.g., how to align the data among
different parties [
62
], how to adopt asynchronous training [
9
,
26
], and how to defend against attacks
in VFL [
49
,
53
]. In this work, we aim to develop communication-efficient algorithms to handle
large-scale VFL problems.
Coreset
Coresets have been applied to a large family of problems in machine learning and statistics,
including clustering [
22
,
7
,
31
,
15
,
16
], regression [
20
,
43
,
6
,
13
,
34
,
12
], low rank approximation [
14
],
and mixture model [
52
,
33
]. Specifically, Chhaya et al.
[12]
investigated coreset construction for reg-
ularized regression with different norms. Feldman and Langberg
[22]
, Braverman et al.
[7]
proposed
an importance sampling framework for coreset construction for clustering (including
k
-means). The
coreset size for
k
-means clustering has been improved by several following works [
31
,
15
,
16
] to
˜
O(kε4)
, and Cohen-Addad et al.
[16]
proved a lower bound of size
Ω(ε2k)
. Due to the mergable
2In our numerical experiments (Section 6), we provide some results to justify this claim.
2
Central server
Party 1 Party 2 Party T
⋯ ⋯
𝑿(1) 𝑿(2)
𝑿(𝑇), or
(𝑿 𝑇, 𝒚)
𝑛data
(a) Communication model in VFL
Central server
Party 1 Party 2 Party T
𝑚(𝑇) idx
𝑆(1) 𝑆(2) 𝑆(𝑇)
𝑆 = 𝑆(𝑖)
(b) General coreset framework in VFL
Figure 1: Illustration of coreset construction in VFL
property of coresets, there have been studies on coreset construction in the distributed/horizontal
setting [2, 58, 1, 51]. To our knowledge, we are the first to consider coreset construction in VFL.
2 Problem Formulation/Model
In this section, we formally define our problems: coresets for vertical regularized linear regression
and coresets for vertical k-means clustering (Problem 1).
Vertical federated learning model.
We first introduce the model of vertical federated learning
(VFL). Let
XRd
be a dataset of size
n
that is vertically separated stored in
T
data parties (
T2
).
Concretely, we represent each point
xiX
by
xi= (x(1)
i,...,x(T)
i)
where
x(j)
iRdj
(
j[T]
),
and each party
j[T]
holds a local dataset
X(j)=nx(j)
ioi[n]
. Note that
Pj[T]dj=d
.
Additionally, if there is a label
yiR
for each point
xiX
, we assume the label vector
yRn
is stored in Party
T
. The objective of vertical federated learning is to collaboratively solve certain
training problems in the central server with a total communication complexity as small as possible.
Similar to Ding et al.
[19
, Figure 1
]
, we only allow the communication between the central server
and each of the
T
parties, and require the central server to hold the final solution. Note that the
central server can also be replaced with any party in practice. For the communication complexity,
we assume that transporting an integer/floating-point costs 1 unit, and consequently, transporting a
d-dimensional vector costs dcommunication units. See Figure 1a for an illustration.
Vertical regularized linear regression and vertical k-means clustering.
In this paper, we con-
sider the following two important machine learning problems in the VFL model.
Definition 2.1
(
Vertical regularized linear regression (VRLR)).
Given a dataset
XRd
together
with labels
yRn
in the VFL model, a regularization function
R:RdR0
, the goal of the
vertical regularized linear regression problem (VRLR) is to compute a vector
θRd
in the server that
(approximately) minimizes
costR(X,θ) := Pi[n]costR
i(X,θ) = Pi[n](x>
iθyi)2+R(θ),
and the total communication complexity is as small as possible.
Definition 2.2
(
Vertical k-means clustering (VKMC)).
Given a dataset
XRd
in the VFL
model, an integer
k1
, let
C
denote the collection of all
k
-center sets
C∈ C
with
|C|=k
and
d(·,·)
denote the Euclidean distance. The goal of the vertical
k
-means clustering prob-
lem (VKMC) is to compute a
k
-center set
C∈ C
in the server that (approximately) minimizes
costC(X,C) := Pi[n]costC
i(X,C) = Pi[n]d(xi,C)2=Pi[n]mincCd(xi,c)2,
and the
total communication complexity is as small as possible.
Ding et al.
[19]
proposed a similar vertical
k
-means clustering problem and provided constant
approximation schemes. They additionally compute an assignment from all points
xi
to solution
C
, which requires a communication complexity of at least
Ω(nT )
. Due to huge
n
, directly solving
VRLR or VKMC is a non-trivial task and may need a large communication complexity. To this end,
we introduce a powerful data-reduction technique, called coresets [27, 22, 23].
3
Coresets for VRLR and VKMC.
Roughly speaking, a coreset is a small summary of the original
dataset, that approximates the learning objective for every possible choice of learning parameters. We
first define coresets for offline regularized linear regression and
k
-means clustering as follows. As
mentioned in Section 1.1, both problems have been well studied in the literature [
22
,
7
,
12
,
31
,
15
,
16
].
Definition 2.3
(
Coresets for offline regularized linear regression).
Given a dataset
XRd
together with labels
yRn
and
ε(0,1)
, a subset
S[n]
together with a weight function
w:SR0is called an ε-coreset for offline regularized linear regression if for any θRd,
costR(S, θ) := X
iS
w(i)·(x>
iθyi)2+R(θ)(1 ±ε)·costR(X,θ).
Definition 2.4
(
Coresets for offline k-means clustering).
Given a dataset
XRd
, an integer
k1
and
ε(0,1)
, a subset
S[n]
together with a weight function
w:SR0
is called an
ε-coreset for offline k-means clustering if for any k-center set CRd,
costC(S, C) := X
iS
w(i)·d(xi,C)2(1 ±ε)·costC(X,C).
Now we are ready to give the following main problem.
Problem 1
(
Coreset construction for VRLR and VKMC).
Given a dataset
XRd
(together
with labels
yRn
) in the VFL model and
ε(0,1)
, our goal is to construct an
ε
-coreset for
regularized linear regression (or
k
-means clustering) in the server, with as small communication
complexity as possible. See Figure 1b for an illustration.
Note that our coreset is a subset of indices which is slightly different from that in previous work [
27
,
22
,
23
], whose coreset consists of weighted points. This is because we would like to reduce
data transportation from parties to the server due to privacy considerations. Specifically, if the
communication schemes for VRLR and VKMC do not need to make data transportation, then we
can avoid data transportation by first applying our coreset construction scheme and then doing the
communication schemes based on the coreset. Moreover, we have the following theorem that shows
how coresets reduce the communication complexity in the VFL models, and the proof is in Section C.
Theorem 2.5
(
Coresets reduce the communication complexity for VRLR and VKMC).
Given
ε(0,1), suppose there exist
1.
a communication scheme
A
that given a (weighted) dataset
XRd
together with labels
yRn
in the VFL model, computes an
α
-approximate solution (
α1
) for VRLR (or
VKMC) in the server with a communication complexity Λ(n);
2.
a communication scheme
A0
that given a (weighted) dataset
XRd
together with labels
yRn
in the VFL model, constructs an
ε
-coreset for VRLR (or VKMC respecitively) of
size min the server with a communication complexity Λ0.
Then there exists a communication scheme that given a (weighted) dataset
XRd
together with
labels
yRn
in the VFL model, computes an
(1 + 3ε)α
-approximate solution (
α1
) for VRLR
(or VKMC respectively) in the server with a communication complexity Λ0+ 2mT + Λ(m).
Usually,
Λ(m) = Ω(mT )
and
Λ0
is small or comparable to
T m
(see Theorems 4.2 and 5.2 for
examples). Consequently, the total communication complexity by introducing coresets is dominated
by
Λ(m)
, which is much smaller compared to
Λ(n)
. Hence, coreset can efficiently reduce the
communication complexity with a slight sacrifice on the approximate ratio.
3 A Unified Scheme for VFL Coresets via Importance Sampling
In this section, we propose a unified communication scheme (Algorithm 1) that will be used as a
meta-algorithm for solving Problem 1. We assume each party
j[T]
holds a real number
g(j)
i0
for data
x(j)
i
in Algorithm 1, that will be computed locally for both VRLR (Algorithm 2) and VKMC
(Algorithm 3). There are three communication rounds in Algorithm 1. In the first round (Lines 2-4),
the server knows all “local total sensitivities”
G(j)
, takes samples of
[T]
with probability proportional
4
to
G(j)
, and sends
aj
to each party
j
, where
aj
is the number of local samples of party
j
for the second
round. In the second round (Lines 5-6), each party samples a collection
S(j)[n]
of size
aj
with
probability proportional to
g(j)
i
. The server achieves the union
S=Sj[T]S(j)
. In the third round
(Lines 7-8), the goal is to compute weights
w(i)
for all samples. In the end, we achieve a weighted
subset
(S, w)
. We propose the following theorem to analyze the performance of Algorithm 1 and
show that (S, w)is a coreset when size mis large enough.
Theorem 3.1
(
The performance of Algorithm 1).
The communication complexity of Algorithm 1
is O(mT ). Let ε, δ (0,1/2) and k1be an integer. We have
Let
ζ= max
i[n]
sup
θRd
costR
i(X,θ)
costR(X,θ)/Pj[T]g(j)
i
and
m=Oε2ζG(d2log(ζG) + log(1))
. With
probability at least 1δ,(S, w)is an ε-coreset for offline regularized linear regression.
Let
ζ= max
i[n]
sup
C∈C
costC
i(X,C)
costC(X,C)/Pj[T]g(j)
i
and
m=Oε2ζG(dk log(ζG) + log(1))
. With
probability at least 1δ,(S, w)is an ε-coreset for offline k-means clustering.
The proof can be found in Appendix D. The main idea is to show that Algorithm 1 simulates a
well-known importance sampling framework for offline coreset construction by [
22
,
7
]. The term
supθRd
costR
i(X,θ)
costR(X,θ)
(or
supC∈C
costC
i(X,C)
costC(X,C)
) is called the sensitivity of point
xi
for VRLR (or VKMC)
that represents the maximum contribution of
xi
over all possible parameters. Algorithm 1 aims to use
Pj[T]g(j)
i
to estimate the sensitivity of
xi
, and hence,
ζ
represents the maximum sensitivity gap
over all points. The performance of Algorithm 1 mainly depends on the quality of these estimations
Pj[T]g(j)
i
. As both
ζ
and the total sum
G=Pi[n],j[T]g(j)
i
become smaller, the required size
m
becomes smaller. Specifically, if both
ζ
and
G
only depends on parameters
k, d, T
, the coreset
size
m
is independent of
n
as expected. Combining with Theorem 2.5, we can heavily reduce the
communication complexity for VRLR or VKMC.
Algorithm 1 A unified importance sampling for coreset construction in the VFL model
Input: Each party j[T]holds data x(j)
itogether with a real number g(j)
i0, an integer m1
Output: a weighted collection S[n]of size |S| ≤ m
1: procedure DIS(m, {g(j)
i:i[n], j [T]})
2: Each party j[T]sends G(j)Pi[n]g(j)
ito the server. 1st round begins
3:
The server computes
G=Pj[T]G(j)
and samples a multiset
A[T]
of
m
samples, where
each sample j[T]is selected with probability G(j)/G.
4: The server sends aj#jAto each party j[T].1st round ends
5:
Each party
j[T]
samples a multiset
S(j)[n]
of size
aj
, where each sample
i[n]
is
selected with probability g(j)
i/G(j), and sends S(j)to the server. 2nd round begins
6: The server broadcasts a multiset SSj[T]S(j)to all parties. 2nd round ends
7: Each party j[T]sends G(j)=ng(j)
i:iSoto the server. 3rd round begins
8: The server computes weights w(i)G/|SPj[T]g(j)
ifor each iS.3rd round ends
9: return (S, w)
10: end procedure
Privacy issue.
We consider the privacy of the proposed scheme from two aspects: coreset construc-
tion and model training. As for the coreset construction part (Algorithm 1), the privacy leakage comes
from the "sensitivity score"
g(j)
i
of the data points in different parties. To tackle this problem, we
can use secure aggregation [
5
] to transport the sum
gi=PT
j=1 g(j)
i
to the server without revealing
the exact values of
g(j)
i
s (Line 7 of Algorithm 1). The server only knows
(S, w)
and
G(j)
s. For the
model training part, we can apply the secure VFL algorithms if existed, e.g., using homomorphic
encryption on SAGA for regression (it is an extension from SGD to SAGA [28]).
5
摘要:

CoresetsforVerticalFederatedLearning:RegularizedLinearRegressionandK-MeansClusteringLingxiaoHuangNanjingUniversityhuanglingxiao1990@126.comZhizeLiCarnegieMellonUniversityzhizeli@cmu.eduJialinSunFudanUniversitysunjl20@fudan.edu.cnHaoyuZhaoPrincetonUniveristyhaoyu@princeton.eduAbstractVerticalfede...

展开>> 收起<<
Coresets for Vertical Federated Learning Regularized Linear Regression and K-Means Clustering.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:1.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注