Depersonalized Federated Learning Tackling Statistical Heterogeneity by Alternating Stochastic Gradient Descent_2

2025-05-06 0 0 1015.14KB 12 页 10玖币

侵权投诉

Depersonalized Federated Learning: Tackling

Statistical Heterogeneity by Alternating Stochastic

Gradient Descent

Yujie Zhou1,2,3, Zhidu Li1,2,3, Tong Tang1,2,3, Ruyan Wang1,2,3

1Chongqing University of Posts and Telecommunications, School of Communication and Information Engineering, China

2Advanced Network and Intelligent Interconnection Technology Key Laboratory of Chongqing Education Commission of China

3Key Laboratory of Ubiquitous Sensing and Networking in Chongqing, China

Email: lizd@cqupt.edu.cn

Abstract—Federated learning (FL), which has gained increas-

ing attention recently, enables distributed devices to train a

common machine learning (ML) model for intelligent infer-

ence cooperatively without data sharing. However, problems

in practical networks, such as non-independent-and-identically-

distributed (non-iid) raw data and limited bandwidth, give rise

to slow and unstable convergence of the FL training process.

To address these issues, we propose a new FL method that

can signiﬁcantly mitigate statistical heterogeneity through the

depersonalization mechanism. Particularly, we decouple the global

and local optimization objectives by alternating stochastic gra-

dient descent, thus reducing the accumulated variance in local

update phases to accelerate the FL convergence. Then we analyze

the proposed method in detail to show the proposed method

converging at a sublinear speed in the general non-convex setting.

Finally, numerical results are conducted with experiments on

public datasets to verify the effectiveness of our proposed method.

Index Terms—Federated learning, depersonalization mecha-

nism, statistical heterogeneity, convergence analysis

I. INTRODUCTION

Due to a tremendous amount of data in edge devices, ma-

chine learning (ML) as a data-driven technology is generally

used to enhance the intelligence of applications and networks

[1, 2]. However, traditional ML requiring centralized training

is unsuitable for the scenario because of privacy concerns

and communication costs in raw data transmission. Thus, as

a distributed optimization paradigm, federated learning (FL),

is designed to train ML models across multiple clients while

keeping data decentralized.

To train ML models distributively, we can directly use

the classical Parallel-SGD [3], i.e., each client calculates the

local stochastic gradient to the central server for getting the

aggregated gradient at each iteration. Nevertheless, performing

This work was supported in part by the National Natural Science Foundation

of China under grants 61901078, 61871062,61771082 and U20A20157, and

in part by Natural Science Foundation of Chongqing under grant cstc2020jcyj-

zdxmX0024, and in part by University Innovation Research Group of

Chongqing under grant CXQT20017.

the procedure still leads to unaffordable communication costs,

especially in the case of training large primary models such

as deep neural networks. Then to reduce the costs, a popular

algorithm FedAvg [4] was proposed, which means training

individual models via several local SGD steps and uploading

them in place of gradients to the central server in aggregation.

Despite FedAvg successfully reducing the communication

overhead several times of Parallel-SGD, some key challenges

emerge in deploying the framework: (i) As massive clients may

join in an FL training process, it is impractical for communica-

tion links to support all nodes to upload data simultaneously.

(ii) As participators come from various regions, data on all

clients are usually non-independent-and-identically-distributed

(non-iid, known as statistical heterogeneity). Recently, some

efforts have been devoted to analyzing and improving FL

(with (i) partial communication, a.k.a. client scheduling)

performance on (ii) non-iid data. Works [5–7] studied on

FedAvg convergence. Then [8–11] proposed FedAvg-based

methods for incremental performance enhancement by update-

rule or sampling policy modiﬁcations. For instance, in [9], the

proposed FedProx introduced a proximal operator to obtain

surrogate local objectives to tackle the heterogeneity problem

empirically. Then unlike the above works that focus on global

performance improvement, other studies [12–14] tended to

generate a group of personalized FL models in place of a single

global model for all clients on non-iid data to ensure fairness

and stylization. For example, in [12], the authors proposed

a common personalized FL framework with inherent fairness

and robustness, and [13] raised a bi-level learning framework

for extracting personalized models from the global model.

Note that extra local information is implicit in customized

FL models generated by personalized FL approaches. While

utilizing this information may be beneﬁcial to reduce the

negative impact caused by (i) client sampling and (ii) statistical

heterogeneity. Thus in this paper, we are inspired to devise a

new method to improve global FL performance that modiﬁes

the local-update-rule by reversely using model-customization

arXiv:2210.03444v3 [cs.LG] 31 Oct 2022

techniques [12–14]. To take advantage of this personalization

information, we design a double-gradient-descent rule in the

local update stage that each client generates two decoupled

local models (rather than an original one) to separate the

global update direction from the local one. In particular, the

personalized local model is obtained by directly optimizing the

local objective, while the globalized local model is obtained by

subtracting the personalized local model from the original one.

Therefore, each sampled client can upload its globalized model

in place of the original local one to reduce the accumulated

local deviations for convergence acceleration and stabilization.

We summarize key contributions as follows:

•We propose a novel method called FedDeper to improve

the FL performance on non-iid data by the depersonal-

ization update mechanism, which can be widely adapted

to a variety of scenarios.

•We theoretically analyze the convergence performance of

our proposed method for the personalized and aggregated

models in the general non-convex setting.

•We provide relevant experimental results to evaluate the

convergence performance of our proposed algorithm ver-

sus baselines and study the impact factors of convergence.

The remainder of this paper is organized as follows. We start

by discussing the impact of data heterogeneity on the canonical

FedAvg method in Section II. Then, we propose a new

FedDeper method in Section III and analyze its convergence in

Section IV. Next, we present and discuss experimental results

in Section V. Finally, we conclude the paper in Section VI.

II. PRELIMINARIES AND BACKGROUNDS

In an FL framework, for all participating clients (denoted by

Nwith the cardinal number n:= |N|), we have the following

optimization objective:

min

x∈Rdf(x) := 1

nXi∈N fi(x)(1)

where ddenotes the dimension of the vector x, and fi(x) :=

Eϑi∼Di[f(x;ϑi)] represents the local objective function on

each client i. Besides, fiis generally the loss function deﬁned

by the local ML model, and ϑidenotes a data sample

belonging to the local dataset Di. In this paper, we mainly

deal with Problem (1) [8–10], and all the results can be

extended to the weighted version by techniques in [6, 11].

We depict a round of the typical algorithm FedAvg to solve

(1) as three parts: In the k-th round, (i) Broadcasting: The

server uniformly samples a subset of mclients (i.e., Uk⊆ N

with m:= |Uk| ≤ n, ∀k∈ {0,1, ..., K −1}for any integer

K≥1) and broadcasts the aggregated global model xk

to client i∈ Uk. (ii) Local Update: Each selected client i

initializes the local model vk

i,0as xkand then trains the model

by performing stochastic gradient descent (SGD) with a step

size ηon fi(·),

i,j+1 ←vk

i,j −ηgi(vk

i,j ),∀j∈ {0,1, ..., τ −1},(2)

where vk

i,j denotes the updated local model in the j-th step

SGD and gi(·)represents the stochastic gradient of fi(·)w.r.t.

server

heterogenous clients local models globalized

model

historical

info

personalized

model

(i) (ii)

(ii)(c)

global model

optimization depersonalization initialization

communication computation

optimization depersonalization initialization

communication computation

Fig. 1. Federated learning with depersonalization: (i) communication (broad-

casting & aggregating), (ii) computation (local updating): (a) optimization,

(b) depersonalization, and (c) initialization. Indeed, mechanism (a) integrates

(b) which integrates (c), as (a) ⊃(b) ⊃(c).

vi. While the number of local steps reaches a certain threshold

τ, client iwill upload its local model to the server. (iii) Global

Aggregation: The server aggregates all received local models

to derive a new global one for the next phase,

xk+1 ←1

mXi∈Ukvk

i,τ .(3)

We complete the whole process when the number of commu-

nication rounds reaches the upper limit K, and obtain a global

model trained by all participating clients.

Note that the stochastic gradient gi(·)in Process (2) can

be more precisely rewritten as gi(·) = ∇f(·;ϑi)with ϑi∼

Di. Since the high heterogeneity, local datasets Dii∈N obey

unbalanced data distributions, and the corresponding generated

gradients are consequently different in expectation:

Eϑi∼Di[f(·;ϑi)] 6=Eϑj∼Dj[f(·;ϑj)],∀i, j ∈ N, i 6=j. (4)

That means performing SGD (2) with (4) leads to each

client tending to ﬁnd its local solution v∗

iwith ∇fi(v∗

i) =

0deviating from the global one x∗with ∇f(x∗) = 0,

where always holds the optimization objective inconsistency

∩i∈N ker ∇fi=∅,∀i∈ N hence resulting in slow conver-

gence. Moreover, in practical deployed FL frameworks, the

number of participators is always much large while the band-

width or communication capability of the server is limited, i.e.,

only a small fraction of clients can be selected to join a training

round: mn. This fact (partial communication) aggravates

the inconsistency of local models, thus further leading to

unreliable training and poor performance.

III. FEDERATED LEARNING WITH DEPERSONALIZATION

To alleviate the negative impact of non-iid data and partial

communication on FL, we propose a new Depersonalized

FL (FedDeper) algorithm. In brief, we aim to generate local

approximations of the global model on clients, then upload

and aggregate them in place of the original local models for

stabilization and acceleration, as shown in Fig. 1.

local opt.

global opt.

globalized

local model

centralized

model

corrected

update

depersonalized

correction

original

local update

global

model

personalized

local model

SGD

update

model

decoupling

Fig. 2. The local update phase of FedDeper: each selected client alternately

updates globalized and personalized models in a round. The original local (or

personalized) update aims to reach the local optimum v∗

iwhile the corrected

update moves around the SGD update towards the global optimum x∗by

reversely local update (depersonalization mechanism).

A. Decoupling Global and Local Updating

Recall that performing (2) aims to minimize the local

objective fi(·)that usually disagrees with the global one (1)

resulting in slow convergence. To deal with this issue, we

propose a new depersonalization mechanism to decouple the

two objectives. In particular, to better optimize the objective

f(·), we induce a more globalized local model in place

of the original uploaded one to mitigate the local variance

accumulation in aggregation rounds. Different from Process

(2), we perform SGD on the surrogate loss function in each

selected client i,

fρ

i(yi) := fi(yi) + ρ

2ηkvi+yi−2xk2,(5)

where ρ

2ηis a constant for balancing the two terms, and vi

ﬁxed in updating yidenotes the personalized local model (the

analogue of the original local model), which aims to reach

the local optimum v∗

ivia (2). Thus, we expect to obtain two

models in the phase. The one viis kept locally for searching

the local solution v∗

iwhile the other yiestimating the global

model locally (i.e., y∗

i≈x∗) is uploaded to the aggregator to

accelerate FL convergence.

B. Using Local Information Reversely with Regularizer

As shown in Fig. 2, the globalized model yiis updated

by using the personalized one vireversely. To minimize (5),

the value of yiis restricted to a place slightly away from the

local optimum with the regularizer kvi+yi−2xk2. More

speciﬁcally, we regard vi−xand yi−xas two directions in

the update. Since viis a personalized solution for the client,

we note that vi−xcontains abundant information about local

deviations. To avoid introducing overmuch variance, we give

a penalty to term yi−xthat reﬂects yito the opposite

direction of vi−x. Nevertheless, in suppressing bias with

the regularizer kvi+yi−2xk2, we also eliminate the global

update direction implied in vi−x, which further interprets

Algorithm 1 FedDeper: Depersonalized Federated Learning

Input: learning rate η, penalty ρ, mixing rate λ, local step τ,

total round K, initialized models x0=y0

0,0=v0

0,0

1: for each round k= 0,1, ..., K −1do

2: sample clients Uk⊆ N uniformly

3: send xkto selected clients i∈ Uk

4: for each client i∈ Ukin parallel do

5: initialize yk

i,0←xk

6: for j= 0,1, ..., τ −1do

7: yk

i,j+1 ←yk

i,j −ηgρ

i(yk

i,j )

8: vk

i,j+1 ←vk

i,j −ηgi(vk

i,j )

9: end for

10: vk+1

i,0←(1 −λ)vk

i,τ +λyk

i,τ

11: send yk

i,τ −xkto server

12: end for

13: each client i∈{NUkupdates vk+1

i,0←vk

i,0

14: xk+1 ←xk+1

|Uk|Pi∈Uk(yk

i,τ −xk)

15: end for

the necessity of carefully tuning on ρ, η (trade off variance

reduction and convergence acceleration).

C. Retaining Historical Information for Personalized Model

In the current local update stage, model yiis initialized

as the received global model xwhile viis initialized as the

trained yiin the previous stage. And then they are updated

alternately by ﬁrst-order optimizers, i.e., each selected client

iperforms SGD on fi(·)as (2) to obtain a personalized local

model and on (5) to obtain a locally approximated globalized

model, respectively. However, this initialization policy results

in the new viforgetting all accumulated local information

contained in the previous vi. To make the best use of the

historical models, we let vipartially inherit the preceding

value, i.e.,

vk+1

i,0←(1 −λ)vk

i,τ +λyk

i,τ ,(6)

where vk

i,τ ,yk

i,τ are trained models in the k-th round, vk+1

i,0is

the initial model in the k+ 1-th round, and λ∈[1

2,1] is the

mixing rate controlling the stock of local deviation informa-

tion. To be speciﬁc, λlimits the distance between the initial

viand yiwithin a certain range to avoid destructively large

correction generated by kvi+yi−2xk2since monotonically

increasing difference between viand x(e.g., kvi−xk) in

updating.

Remark 1. If λis set in the deﬁned ﬁnite interval [1

2,1], we

claim there exists suitable η,ρenable the global model xto

converge to the global optimum.

D. Procedure of FedDeper and Further Discussion

The proposed method is summarized as Algorithm 1. In

Lines 7-8, we update the globalized local model yand the

personalized one valternately. Line 7 shows the j-th step of

local SGD where vis involved in the stochastic (mini-batch)

gradient gρ

iof fρ

i. Line 8 shows a step of SGD for approaching

the optimum of the local objective. In Line 10, we initialize the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DepersonalizedFederatedLearning:TacklingStatisticalHeterogeneitybyAlternatingStochasticGradientDescentYujieZhou1;2;3,ZhiduLi1;2;3,TongTang1;2;3,RuyanWang1;2;31ChongqingUniversityofPostsandTelecommunications,SchoolofCommunicationandInformationEngineering,China2AdvancedNetworkandIntelligentInterconnec...

展开>> 收起<<

Depersonalized Federated Learning Tackling Statistical Heterogeneity by Alternating Stochastic Gradient Descent_2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Depersonalized Federated Learning Tackling Statistical Heterogeneity by Alternating Stochastic Gradient Descent_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: