Depersonalized Federated Learning Tackling Statistical Heterogeneity by Alternating Stochastic Gradient Descent_2

2025-05-06 0 0 1015.14KB 12 页 10玖币
侵权投诉
Depersonalized Federated Learning: Tackling
Statistical Heterogeneity by Alternating Stochastic
Gradient Descent
Yujie Zhou1,2,3, Zhidu Li1,2,3, Tong Tang1,2,3, Ruyan Wang1,2,3
1Chongqing University of Posts and Telecommunications, School of Communication and Information Engineering, China
2Advanced Network and Intelligent Interconnection Technology Key Laboratory of Chongqing Education Commission of China
3Key Laboratory of Ubiquitous Sensing and Networking in Chongqing, China
Email: lizd@cqupt.edu.cn
Abstract—Federated learning (FL), which has gained increas-
ing attention recently, enables distributed devices to train a
common machine learning (ML) model for intelligent infer-
ence cooperatively without data sharing. However, problems
in practical networks, such as non-independent-and-identically-
distributed (non-iid) raw data and limited bandwidth, give rise
to slow and unstable convergence of the FL training process.
To address these issues, we propose a new FL method that
can significantly mitigate statistical heterogeneity through the
depersonalization mechanism. Particularly, we decouple the global
and local optimization objectives by alternating stochastic gra-
dient descent, thus reducing the accumulated variance in local
update phases to accelerate the FL convergence. Then we analyze
the proposed method in detail to show the proposed method
converging at a sublinear speed in the general non-convex setting.
Finally, numerical results are conducted with experiments on
public datasets to verify the effectiveness of our proposed method.
Index Terms—Federated learning, depersonalization mecha-
nism, statistical heterogeneity, convergence analysis
I. INTRODUCTION
Due to a tremendous amount of data in edge devices, ma-
chine learning (ML) as a data-driven technology is generally
used to enhance the intelligence of applications and networks
[1, 2]. However, traditional ML requiring centralized training
is unsuitable for the scenario because of privacy concerns
and communication costs in raw data transmission. Thus, as
a distributed optimization paradigm, federated learning (FL),
is designed to train ML models across multiple clients while
keeping data decentralized.
To train ML models distributively, we can directly use
the classical Parallel-SGD [3], i.e., each client calculates the
local stochastic gradient to the central server for getting the
aggregated gradient at each iteration. Nevertheless, performing
This work was supported in part by the National Natural Science Foundation
of China under grants 61901078, 61871062,61771082 and U20A20157, and
in part by Natural Science Foundation of Chongqing under grant cstc2020jcyj-
zdxmX0024, and in part by University Innovation Research Group of
Chongqing under grant CXQT20017.
the procedure still leads to unaffordable communication costs,
especially in the case of training large primary models such
as deep neural networks. Then to reduce the costs, a popular
algorithm FedAvg [4] was proposed, which means training
individual models via several local SGD steps and uploading
them in place of gradients to the central server in aggregation.
Despite FedAvg successfully reducing the communication
overhead several times of Parallel-SGD, some key challenges
emerge in deploying the framework: (i) As massive clients may
join in an FL training process, it is impractical for communica-
tion links to support all nodes to upload data simultaneously.
(ii) As participators come from various regions, data on all
clients are usually non-independent-and-identically-distributed
(non-iid, known as statistical heterogeneity). Recently, some
efforts have been devoted to analyzing and improving FL
(with (i) partial communication, a.k.a. client scheduling)
performance on (ii) non-iid data. Works [5–7] studied on
FedAvg convergence. Then [8–11] proposed FedAvg-based
methods for incremental performance enhancement by update-
rule or sampling policy modifications. For instance, in [9], the
proposed FedProx introduced a proximal operator to obtain
surrogate local objectives to tackle the heterogeneity problem
empirically. Then unlike the above works that focus on global
performance improvement, other studies [12–14] tended to
generate a group of personalized FL models in place of a single
global model for all clients on non-iid data to ensure fairness
and stylization. For example, in [12], the authors proposed
a common personalized FL framework with inherent fairness
and robustness, and [13] raised a bi-level learning framework
for extracting personalized models from the global model.
Note that extra local information is implicit in customized
FL models generated by personalized FL approaches. While
utilizing this information may be beneficial to reduce the
negative impact caused by (i) client sampling and (ii) statistical
heterogeneity. Thus in this paper, we are inspired to devise a
new method to improve global FL performance that modifies
the local-update-rule by reversely using model-customization
arXiv:2210.03444v3 [cs.LG] 31 Oct 2022
techniques [12–14]. To take advantage of this personalization
information, we design a double-gradient-descent rule in the
local update stage that each client generates two decoupled
local models (rather than an original one) to separate the
global update direction from the local one. In particular, the
personalized local model is obtained by directly optimizing the
local objective, while the globalized local model is obtained by
subtracting the personalized local model from the original one.
Therefore, each sampled client can upload its globalized model
in place of the original local one to reduce the accumulated
local deviations for convergence acceleration and stabilization.
We summarize key contributions as follows:
We propose a novel method called FedDeper to improve
the FL performance on non-iid data by the depersonal-
ization update mechanism, which can be widely adapted
to a variety of scenarios.
We theoretically analyze the convergence performance of
our proposed method for the personalized and aggregated
models in the general non-convex setting.
We provide relevant experimental results to evaluate the
convergence performance of our proposed algorithm ver-
sus baselines and study the impact factors of convergence.
The remainder of this paper is organized as follows. We start
by discussing the impact of data heterogeneity on the canonical
FedAvg method in Section II. Then, we propose a new
FedDeper method in Section III and analyze its convergence in
Section IV. Next, we present and discuss experimental results
in Section V. Finally, we conclude the paper in Section VI.
II. PRELIMINARIES AND BACKGROUNDS
In an FL framework, for all participating clients (denoted by
Nwith the cardinal number n:= |N|), we have the following
optimization objective:
min
xRdf(x) := 1
nXi∈N fi(x)(1)
where ddenotes the dimension of the vector x, and fi(x) :=
EϑiDi[f(x;ϑi)] represents the local objective function on
each client i. Besides, fiis generally the loss function defined
by the local ML model, and ϑidenotes a data sample
belonging to the local dataset Di. In this paper, we mainly
deal with Problem (1) [8–10], and all the results can be
extended to the weighted version by techniques in [6, 11].
We depict a round of the typical algorithm FedAvg to solve
(1) as three parts: In the k-th round, (i) Broadcasting: The
server uniformly samples a subset of mclients (i.e., Uk⊆ N
with m:= |Uk| ≤ n, k∈ {0,1, ..., K 1}for any integer
K1) and broadcasts the aggregated global model xk
to client i∈ Uk. (ii) Local Update: Each selected client i
initializes the local model vk
i,0as xkand then trains the model
by performing stochastic gradient descent (SGD) with a step
size ηon fi(·),
vk
i,j+1 vk
i,j ηgi(vk
i,j ),j∈ {0,1, ..., τ 1},(2)
where vk
i,j denotes the updated local model in the j-th step
SGD and gi(·)represents the stochastic gradient of fi(·)w.r.t.
server
heterogenous clients local models globalized
model
historical
info
personalized
model
(i) (ii)
(ii)(c)
global model
optimization depersonalization initialization
communication computation
optimization depersonalization initialization
communication computation
Fig. 1. Federated learning with depersonalization: (i) communication (broad-
casting & aggregating), (ii) computation (local updating): (a) optimization,
(b) depersonalization, and (c) initialization. Indeed, mechanism (a) integrates
(b) which integrates (c), as (a) (b) (c).
vi. While the number of local steps reaches a certain threshold
τ, client iwill upload its local model to the server. (iii) Global
Aggregation: The server aggregates all received local models
to derive a new global one for the next phase,
xk+1 1
mXi∈Ukvk
i,τ .(3)
We complete the whole process when the number of commu-
nication rounds reaches the upper limit K, and obtain a global
model trained by all participating clients.
Note that the stochastic gradient gi(·)in Process (2) can
be more precisely rewritten as gi(·) = f(·;ϑi)with ϑi
Di. Since the high heterogeneity, local datasets Dii∈N obey
unbalanced data distributions, and the corresponding generated
gradients are consequently different in expectation:
EϑiDi[f(·;ϑi)] 6=EϑjDj[f(·;ϑj)],i, j ∈ N, i 6=j. (4)
That means performing SGD (2) with (4) leads to each
client tending to find its local solution v
iwith fi(v
i) =
0deviating from the global one xwith f(x) = 0,
where always holds the optimization objective inconsistency
i∈N ker fi=,i∈ N hence resulting in slow conver-
gence. Moreover, in practical deployed FL frameworks, the
number of participators is always much large while the band-
width or communication capability of the server is limited, i.e.,
only a small fraction of clients can be selected to join a training
round: mn. This fact (partial communication) aggravates
the inconsistency of local models, thus further leading to
unreliable training and poor performance.
III. FEDERATED LEARNING WITH DEPERSONALIZATION
To alleviate the negative impact of non-iid data and partial
communication on FL, we propose a new Depersonalized
FL (FedDeper) algorithm. In brief, we aim to generate local
approximations of the global model on clients, then upload
and aggregate them in place of the original local models for
stabilization and acceleration, as shown in Fig. 1.
local opt.
global opt.
globalized
local model
centralized
model
corrected
update
depersonalized
correction
original
local update
global
model
x
i
y
personalized
local model
SGD
update
i
v
model
decoupling
*
x
*
i
v
Fig. 2. The local update phase of FedDeper: each selected client alternately
updates globalized and personalized models in a round. The original local (or
personalized) update aims to reach the local optimum v
iwhile the corrected
update moves around the SGD update towards the global optimum xby
reversely local update (depersonalization mechanism).
A. Decoupling Global and Local Updating
Recall that performing (2) aims to minimize the local
objective fi(·)that usually disagrees with the global one (1)
resulting in slow convergence. To deal with this issue, we
propose a new depersonalization mechanism to decouple the
two objectives. In particular, to better optimize the objective
f(·), we induce a more globalized local model in place
of the original uploaded one to mitigate the local variance
accumulation in aggregation rounds. Different from Process
(2), we perform SGD on the surrogate loss function in each
selected client i,
fρ
i(yi) := fi(yi) + ρ
2ηkvi+yi2xk2,(5)
where ρ
2ηis a constant for balancing the two terms, and vi
fixed in updating yidenotes the personalized local model (the
analogue of the original local model), which aims to reach
the local optimum v
ivia (2). Thus, we expect to obtain two
models in the phase. The one viis kept locally for searching
the local solution v
iwhile the other yiestimating the global
model locally (i.e., y
ix) is uploaded to the aggregator to
accelerate FL convergence.
B. Using Local Information Reversely with Regularizer
As shown in Fig. 2, the globalized model yiis updated
by using the personalized one vireversely. To minimize (5),
the value of yiis restricted to a place slightly away from the
local optimum with the regularizer kvi+yi2xk2. More
specifically, we regard vixand yixas two directions in
the update. Since viis a personalized solution for the client,
we note that vixcontains abundant information about local
deviations. To avoid introducing overmuch variance, we give
a penalty to term yixthat reflects yito the opposite
direction of vix. Nevertheless, in suppressing bias with
the regularizer kvi+yi2xk2, we also eliminate the global
update direction implied in vix, which further interprets
Algorithm 1 FedDeper: Depersonalized Federated Learning
Input: learning rate η, penalty ρ, mixing rate λ, local step τ,
total round K, initialized models x0=y0
0,0=v0
0,0
1: for each round k= 0,1, ..., K 1do
2: sample clients Uk⊆ N uniformly
3: send xkto selected clients i∈ Uk
4: for each client i∈ Ukin parallel do
5: initialize yk
i,0xk
6: for j= 0,1, ..., τ 1do
7: yk
i,j+1 yk
i,j ηgρ
i(yk
i,j )
8: vk
i,j+1 vk
i,j ηgi(vk
i,j )
9: end for
10: vk+1
i,0(1 λ)vk
i,τ +λyk
i,τ
11: send yk
i,τ xkto server
12: end for
13: each client i{NUkupdates vk+1
i,0vk
i,0
14: xk+1 xk+1
|Uk|Pi∈Uk(yk
i,τ xk)
15: end for
the necessity of carefully tuning on ρ, η (trade off variance
reduction and convergence acceleration).
C. Retaining Historical Information for Personalized Model
In the current local update stage, model yiis initialized
as the received global model xwhile viis initialized as the
trained yiin the previous stage. And then they are updated
alternately by first-order optimizers, i.e., each selected client
iperforms SGD on fi(·)as (2) to obtain a personalized local
model and on (5) to obtain a locally approximated globalized
model, respectively. However, this initialization policy results
in the new viforgetting all accumulated local information
contained in the previous vi. To make the best use of the
historical models, we let vipartially inherit the preceding
value, i.e.,
vk+1
i,0(1 λ)vk
i,τ +λyk
i,τ ,(6)
where vk
i,τ ,yk
i,τ are trained models in the k-th round, vk+1
i,0is
the initial model in the k+ 1-th round, and λ[1
2,1] is the
mixing rate controlling the stock of local deviation informa-
tion. To be specific, λlimits the distance between the initial
viand yiwithin a certain range to avoid destructively large
correction generated by kvi+yi2xk2since monotonically
increasing difference between viand x(e.g., kvixk) in
updating.
Remark 1. If λis set in the defined finite interval [1
2,1], we
claim there exists suitable η,ρenable the global model xto
converge to the global optimum.
D. Procedure of FedDeper and Further Discussion
The proposed method is summarized as Algorithm 1. In
Lines 7-8, we update the globalized local model yand the
personalized one valternately. Line 7 shows the j-th step of
local SGD where vis involved in the stochastic (mini-batch)
gradient gρ
iof fρ
i. Line 8 shows a step of SGD for approaching
the optimum of the local objective. In Line 10, we initialize the
摘要:

DepersonalizedFederatedLearning:TacklingStatisticalHeterogeneitybyAlternatingStochasticGradientDescentYujieZhou1;2;3,ZhiduLi1;2;3,TongTang1;2;3,RuyanWang1;2;31ChongqingUniversityofPostsandTelecommunications,SchoolofCommunicationandInformationEngineering,China2AdvancedNetworkandIntelligentInterconnec...

展开>> 收起<<
Depersonalized Federated Learning Tackling Statistical Heterogeneity by Alternating Stochastic Gradient Descent_2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1015.14KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注