Coresets for Vertical Federated Learning Regularized Linear Regression and K-Means Clustering

2025-05-06 0 0 1.02MB 32 页 10玖币

侵权投诉

Coresets for Vertical Federated Learning:

Regularized Linear Regression and K-Means

Clustering

Lingxiao Huang∗

Nanjing University

huanglingxiao1990@126.com

Zhize Li∗

Carnegie Mellon University

zhizeli@cmu.edu

Jialin Sun∗

Fudan University

sunjl20@fudan.edu.cn

Haoyu Zhao∗

Princeton Univeristy

haoyu@princeton.edu

Abstract

Vertical federated learning (VFL), where data features are stored in multiple parties

distributively, is an important area in machine learning. However, the communica-

tion complexity for VFL is typically very high. In this paper, we propose a uniﬁed

framework by constructing coresets in a distributed fashion for communication-

efﬁcient VFL. We study two important learning tasks in the VFL setting: regular-

ized linear regression and

-means clustering, and apply our coreset framework

to both problems. We theoretically show that using coresets can drastically alle-

viate the communication complexity, while nearly maintain the solution quality.

Numerical experiments are conducted to corroborate our theoretical ﬁndings.

1 Introduction

Federated learning (FL) [

] is a learning framework where multiple clients/parties

collaboratively train a machine learning model under the coordination of a central server without

exposing their raw data (i.e., each party’s raw data is stored locally and not transferred). There are

two large categories of FL, horizontal federated learning (HFL) and vertical federated learning (VFL),

based on the distribution characteristics of the data. In HFL, different parties usually hold different

datasets but all datasets share the same features; while in VFL, all parties use the same dataset but

different parties hold different subsets of the features (see Figure 1a).

Compared to HFL, VFL [

] is generally harder and requires more communication: as a

single party cannot observe the full features, it requires communication with other parties to compute

the loss and the gradient of a single data. This will result in two potential problems: (i) it may require

a huge amount of communication to jointly train the machine learning model when the dataset is large;

and (ii) the procedure of VFL transfers the information of local data and may cause privacy leakage.

Most of the VFL literature focus on the privacy issue, and designing secure training procedure for

different machine learning models in the VFL setting [

]. However, the communication

efﬁciency of the training procedure in VFL is somewhat underexplored. For unsupervised clustering

problems, Ding et al.

[19]

propose constant approximation schemes for

-means clustering, and their

communication complexity is linear in terms of the dataset size. For linear regression, although the

communciaiton complexity can be improved to sublinear via sampling, such as SGD-type uniform

sampling for the dataset [

], the ﬁnal performance is not comparable to that using the whole

∗Alphabetical order.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14664v1 [cs.LG] 26 Oct 2022

dataset. Thus previous algorithms usually do not scale or perform well to the big data scenarios.

This leads us to consider the following question:

How to train machine learning models using sublinear communication complexity

in terms of the dataset size without sacriﬁcing the performance in the vertical

federated learning (VFL) setting?

In this paper, we try to answer this question, and our method is based on the notion of coreset [

]. Roughly speaking, coreset can be viewed as a small data summary of the original dataset, and the

machine learning model trained on the coreset performs similarly to the model trained using the full

dataset. Therefore, as long as we can obtain a coreset in the VFL setting in a communication-efﬁcient

way, we can then run existing algorithms on the coreset instead of the full dataset.

Our contribution

We study the communication-efﬁcient methods for vertical federated learning

with an emphasis on scalability, and design a general paradigm through the lens of coreset. Concretely,

we have the following key contributions:

We design a uniﬁed framework for coreset construction in the vertical federated learning

setting (Section 3), which can help reduce the communication complexity (Theorem 2.5).

We study the regularized linear regression (Deﬁnition 2.1) and

-means (Deﬁnition 2.2)

problems in the VFL setting, and apply our uniﬁed coreset construction framework to them.

We show that we can get

-approximation for these two problems using only

o(n)

sublinear

communications under mild conditions, where

is the size of the dataset (Section 4 and 5).

We conduct numerical experiments to validate our theoretical results. Our numerical experi-

ments corroborate our ﬁndings that using coresets can drastically reduce the communication

complexity, while maintaining the quality of the solution (Section 6). Moreover, compared

to uniform sampling, applying our coresets can achieve a better solution with the same or

smaller communication complexity.

1.1 More related works

Federated learning

Federated learning was introduced by McMahan et al.

[54]

, and received

increasing attention in recent years. There exist many works studied in the horizontal federated

learning (HFL) setting, such as algorithms with multiple local update steps [

] .

There are also many algorithms with communication compression [

61, 78] and algorithms with privacy preserving [70, 30, 79, 64, 48].

Vertical federated learning

Due to the difﬁculties of VFL, people designed VFL algorithms for

some particular machine learning models, including linear regression [

], logistic regression [

], gradient boosting trees [

], and

-means [

]. For

-means, Ding et al.

[19]

proposed an algorithm that computes the global centers based on the product of local centers, which

requires

O(nT )

communication complexity. For linear regression, Liu et al.

[50]

and Yang et al.

[74]

used uniform sampling to get unbiased gradient estimation and improved the communication

efﬁciency, but the performance may not be good compared to that without sampling. Yang et al.

[73]

also applied uniform sampling to quasi-Newton algorithm and improved communication complexity

for logistic regression. People also studied other settings in VFL, e.g., how to align the data among

different parties [

], how to adopt asynchronous training [

], and how to defend against attacks

in VFL [

]. In this work, we aim to develop communication-efﬁcient algorithms to handle

large-scale VFL problems.

Coreset

Coresets have been applied to a large family of problems in machine learning and statistics,

including clustering [

], regression [

], low rank approximation [

and mixture model [

]. Speciﬁcally, Chhaya et al.

[12]

investigated coreset construction for reg-

ularized regression with different norms. Feldman and Langberg

[22]

, Braverman et al.

[7]

proposed

an importance sampling framework for coreset construction for clustering (including

-means). The

coreset size for

-means clustering has been improved by several following works [

] to

O(kε−4)

, and Cohen-Addad et al.

[16]

proved a lower bound of size

Ω(ε−2k)

. Due to the mergable

2In our numerical experiments (Section 6), we provide some results to justify this claim.

Central server

Party 1 Party 2 Party T

⋯ ⋯

𝑿(1) 𝑿(2)

𝑿(𝑇), or

(𝑿 𝑇, 𝒚)

𝑛data

(a) Communication model in VFL

Central server

Party 1 Party 2 Party T

⋯⋯ 𝑚(𝑇) idx

𝑆(1) 𝑆(2) 𝑆(𝑇)

𝑆 = ራ𝑆(𝑖)

(b) General coreset framework in VFL

Figure 1: Illustration of coreset construction in VFL

property of coresets, there have been studies on coreset construction in the distributed/horizontal

setting [2, 58, 1, 51]. To our knowledge, we are the ﬁrst to consider coreset construction in VFL.

2 Problem Formulation/Model

In this section, we formally deﬁne our problems: coresets for vertical regularized linear regression

and coresets for vertical k-means clustering (Problem 1).

Vertical federated learning model.

We ﬁrst introduce the model of vertical federated learning

(VFL). Let

X⊂Rd

be a dataset of size

that is vertically separated stored in

data parties (

T≥2

Concretely, we represent each point

xi∈X

xi= (x(1)

i,...,x(T)

where

x(j)

i∈Rdj

(

j∈[T]

and each party

j∈[T]

holds a local dataset

X(j)=nx(j)

ioi∈[n]

. Note that

Pj∈[T]dj=d

Additionally, if there is a label

yi∈R

for each point

xi∈X

, we assume the label vector

y∈Rn

is stored in Party

. The objective of vertical federated learning is to collaboratively solve certain

training problems in the central server with a total communication complexity as small as possible.

Similar to Ding et al.

[19

, Figure 1

]

, we only allow the communication between the central server

and each of the

parties, and require the central server to hold the ﬁnal solution. Note that the

central server can also be replaced with any party in practice. For the communication complexity,

we assume that transporting an integer/ﬂoating-point costs 1 unit, and consequently, transporting a

d-dimensional vector costs dcommunication units. See Figure 1a for an illustration.

Vertical regularized linear regression and vertical k-means clustering.

In this paper, we con-

sider the following two important machine learning problems in the VFL model.

Deﬁnition 2.1

(

Vertical regularized linear regression (VRLR)).

Given a dataset

X⊂Rd

together

with labels

y∈Rn

in the VFL model, a regularization function

R:Rd→R≥0

, the goal of the

vertical regularized linear regression problem (VRLR) is to compute a vector

θ∈Rd

in the server that

(approximately) minimizes

costR(X,θ) := Pi∈[n]costR

i(X,θ) = Pi∈[n](x>

iθ−yi)2+R(θ),

and the total communication complexity is as small as possible.

Deﬁnition 2.2

(

Vertical k-means clustering (VKMC)).

Given a dataset

X⊂Rd

in the VFL

model, an integer

k≥1

, let

denote the collection of all

-center sets

C∈ C

with

|C|=k

and

d(·,·)

denote the Euclidean distance. The goal of the vertical

-means clustering prob-

lem (VKMC) is to compute a

-center set

C∈ C

in the server that (approximately) minimizes

costC(X,C) := Pi∈[n]costC

i(X,C) = Pi∈[n]d(xi,C)2=Pi∈[n]minc∈Cd(xi,c)2,

and the

total communication complexity is as small as possible.

Ding et al.

[19]

proposed a similar vertical

-means clustering problem and provided constant

approximation schemes. They additionally compute an assignment from all points

to solution

, which requires a communication complexity of at least

Ω(nT )

. Due to huge

, directly solving

VRLR or VKMC is a non-trivial task and may need a large communication complexity. To this end,

we introduce a powerful data-reduction technique, called coresets [27, 22, 23].

Coresets for VRLR and VKMC.

Roughly speaking, a coreset is a small summary of the original

dataset, that approximates the learning objective for every possible choice of learning parameters. We

ﬁrst deﬁne coresets for ofﬂine regularized linear regression and

-means clustering as follows. As

mentioned in Section 1.1, both problems have been well studied in the literature [

Deﬁnition 2.3

(

Coresets for ofﬂine regularized linear regression).

Given a dataset

X⊂Rd

together with labels

y∈Rn

and

ε∈(0,1)

, a subset

S⊆[n]

together with a weight function

w:S→R≥0is called an ε-coreset for ofﬂine regularized linear regression if for any θ∈Rd,

costR(S, θ) := X

i∈S

w(i)·(x>

iθ−yi)2+R(θ)∈(1 ±ε)·costR(X,θ).

Deﬁnition 2.4

(

Coresets for ofﬂine k-means clustering).

Given a dataset

X⊂Rd

, an integer

k≥1

and

ε∈(0,1)

, a subset

S⊆[n]

together with a weight function

w:S→R≥0

is called an

ε-coreset for ofﬂine k-means clustering if for any k-center set C⊂Rd,

costC(S, C) := X

i∈S

w(i)·d(xi,C)2∈(1 ±ε)·costC(X,C).

Now we are ready to give the following main problem.

Problem 1

(

Coreset construction for VRLR and VKMC).

Given a dataset

X⊂Rd

(together

with labels

y∈Rn

) in the VFL model and

ε∈(0,1)

, our goal is to construct an

-coreset for

regularized linear regression (or

-means clustering) in the server, with as small communication

complexity as possible. See Figure 1b for an illustration.

Note that our coreset is a subset of indices which is slightly different from that in previous work [

], whose coreset consists of weighted points. This is because we would like to reduce

data transportation from parties to the server due to privacy considerations. Speciﬁcally, if the

communication schemes for VRLR and VKMC do not need to make data transportation, then we

can avoid data transportation by ﬁrst applying our coreset construction scheme and then doing the

communication schemes based on the coreset. Moreover, we have the following theorem that shows

how coresets reduce the communication complexity in the VFL models, and the proof is in Section C.

Theorem 2.5

(

Coresets reduce the communication complexity for VRLR and VKMC).

Given

ε∈(0,1), suppose there exist

a communication scheme

that given a (weighted) dataset

X⊂Rd

together with labels

y∈Rn

in the VFL model, computes an

-approximate solution (

α≥1

) for VRLR (or

VKMC) in the server with a communication complexity Λ(n);

a communication scheme

that given a (weighted) dataset

X⊂Rd

together with labels

y∈Rn

in the VFL model, constructs an

-coreset for VRLR (or VKMC respecitively) of

size min the server with a communication complexity Λ0.

Then there exists a communication scheme that given a (weighted) dataset

X⊂Rd

together with

labels

y∈Rn

in the VFL model, computes an

(1 + 3ε)α

-approximate solution (

α≥1

) for VRLR

(or VKMC respectively) in the server with a communication complexity Λ0+ 2mT + Λ(m).

Usually,

Λ(m) = Ω(mT )

and

Λ0

is small or comparable to

T m

(see Theorems 4.2 and 5.2 for

examples). Consequently, the total communication complexity by introducing coresets is dominated

Λ(m)

, which is much smaller compared to

Λ(n)

. Hence, coreset can efﬁciently reduce the

communication complexity with a slight sacriﬁce on the approximate ratio.

3 A Uniﬁed Scheme for VFL Coresets via Importance Sampling

In this section, we propose a uniﬁed communication scheme (Algorithm 1) that will be used as a

meta-algorithm for solving Problem 1. We assume each party

j∈[T]

holds a real number

g(j)

i≥0

for data

x(j)

in Algorithm 1, that will be computed locally for both VRLR (Algorithm 2) and VKMC

(Algorithm 3). There are three communication rounds in Algorithm 1. In the ﬁrst round (Lines 2-4),

the server knows all “local total sensitivities”

G(j)

, takes samples of

[T]

with probability proportional

G(j)

, and sends

to each party

, where

is the number of local samples of party

for the second

round. In the second round (Lines 5-6), each party samples a collection

S(j)⊆[n]

of size

with

probability proportional to

g(j)

. The server achieves the union

S=Sj∈[T]S(j)

. In the third round

(Lines 7-8), the goal is to compute weights

w(i)

for all samples. In the end, we achieve a weighted

subset

(S, w)

. We propose the following theorem to analyze the performance of Algorithm 1 and

show that (S, w)is a coreset when size mis large enough.

Theorem 3.1

(

The performance of Algorithm 1).

The communication complexity of Algorithm 1

is O(mT ). Let ε, δ ∈(0,1/2) and k≥1be an integer. We have

•

Let

ζ= max

i∈[n]

sup

θ∈Rd

costR

i(X,θ)

costR(X,θ)/Pj∈[T]g(j)

and

m=Oε−2ζG(d2log(ζG) + log(1/δ))

. With

probability at least 1−δ,(S, w)is an ε-coreset for ofﬂine regularized linear regression.

•

Let

ζ= max

i∈[n]

sup

C∈C

costC

i(X,C)

costC(X,C)/Pj∈[T]g(j)

and

m=Oε−2ζG(dk log(ζG) + log(1/δ))

. With

probability at least 1−δ,(S, w)is an ε-coreset for ofﬂine k-means clustering.

The proof can be found in Appendix D. The main idea is to show that Algorithm 1 simulates a

well-known importance sampling framework for ofﬂine coreset construction by [

]. The term

supθ∈Rd

costR

i(X,θ)

costR(X,θ)

(or

supC∈C

costC

i(X,C)

costC(X,C)

) is called the sensitivity of point

for VRLR (or VKMC)

that represents the maximum contribution of

over all possible parameters. Algorithm 1 aims to use

Pj∈[T]g(j)

to estimate the sensitivity of

, and hence,

represents the maximum sensitivity gap

over all points. The performance of Algorithm 1 mainly depends on the quality of these estimations

Pj∈[T]g(j)

. As both

and the total sum

G=Pi∈[n],j∈[T]g(j)

become smaller, the required size

becomes smaller. Speciﬁcally, if both

and

only depends on parameters

k, d, T

, the coreset

size

is independent of

as expected. Combining with Theorem 2.5, we can heavily reduce the

communication complexity for VRLR or VKMC.

Algorithm 1 A uniﬁed importance sampling for coreset construction in the VFL model

Input: Each party j∈[T]holds data x(j)

itogether with a real number g(j)

i≥0, an integer m≥1

Output: a weighted collection S⊆[n]of size |S| ≤ m

1: procedure DIS(m, {g(j)

i:i∈[n], j ∈[T]})

2: Each party j∈[T]sends G(j)←Pi∈[n]g(j)

ito the server. 1st round begins

The server computes

G=Pj∈[T]G(j)

and samples a multiset

A⊆[T]

samples, where

each sample j∈[T]is selected with probability G(j)/G.

4: The server sends aj←#j∈Ato each party j∈[T].1st round ends

Each party

j∈[T]

samples a multiset

S(j)⊆[n]

of size

, where each sample

i∈[n]

selected with probability g(j)

i/G(j), and sends S(j)to the server. 2nd round begins

6: The server broadcasts a multiset S←Sj∈[T]S(j)to all parties. 2nd round ends

7: Each party j∈[T]sends G(j)=ng(j)

i:i∈Soto the server. 3rd round begins

8: The server computes weights w(i)←G/|S|·Pj∈[T]g(j)

ifor each i∈S.3rd round ends

9: return (S, w)

10: end procedure

Privacy issue.

We consider the privacy of the proposed scheme from two aspects: coreset construc-

tion and model training. As for the coreset construction part (Algorithm 1), the privacy leakage comes

from the "sensitivity score"

g(j)

of the data points in different parties. To tackle this problem, we

can use secure aggregation [

] to transport the sum

gi=PT

j=1 g(j)

to the server without revealing

the exact values of

g(j)

s (Line 7 of Algorithm 1). The server only knows

(S, w)

and

G(j)

s. For the

model training part, we can apply the secure VFL algorithms if existed, e.g., using homomorphic

encryption on SAGA for regression (it is an extension from SGD to SAGA [28]).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CoresetsforVerticalFederatedLearning:RegularizedLinearRegressionandK-MeansClusteringLingxiaoHuangNanjingUniversityhuanglingxiao1990@126.comZhizeLiCarnegieMellonUniversityzhizeli@cmu.eduJialinSunFudanUniversitysunjl20@fudan.edu.cnHaoyuZhaoPrincetonUniveristyhaoyu@princeton.eduAbstractVerticalfede...

展开>> 收起<<

Coresets for Vertical Federated Learning Regularized Linear Regression and K-Means Clustering.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Coresets for Vertical Federated Learning Regularized Linear Regression and K-Means Clustering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: