
techniques [12–14]. To take advantage of this personalization
information, we design a double-gradient-descent rule in the
local update stage that each client generates two decoupled
local models (rather than an original one) to separate the
global update direction from the local one. In particular, the
personalized local model is obtained by directly optimizing the
local objective, while the globalized local model is obtained by
subtracting the personalized local model from the original one.
Therefore, each sampled client can upload its globalized model
in place of the original local one to reduce the accumulated
local deviations for convergence acceleration and stabilization.
We summarize key contributions as follows:
•We propose a novel method called FedDeper to improve
the FL performance on non-iid data by the depersonal-
ization update mechanism, which can be widely adapted
to a variety of scenarios.
•We theoretically analyze the convergence performance of
our proposed method for the personalized and aggregated
models in the general non-convex setting.
•We provide relevant experimental results to evaluate the
convergence performance of our proposed algorithm ver-
sus baselines and study the impact factors of convergence.
The remainder of this paper is organized as follows. We start
by discussing the impact of data heterogeneity on the canonical
FedAvg method in Section II. Then, we propose a new
FedDeper method in Section III and analyze its convergence in
Section IV. Next, we present and discuss experimental results
in Section V. Finally, we conclude the paper in Section VI.
II. PRELIMINARIES AND BACKGROUNDS
In an FL framework, for all participating clients (denoted by
Nwith the cardinal number n:= |N|), we have the following
optimization objective:
min
x∈Rdf(x) := 1
nXi∈N fi(x)(1)
where ddenotes the dimension of the vector x, and fi(x) :=
Eϑi∼Di[f(x;ϑi)] represents the local objective function on
each client i. Besides, fiis generally the loss function defined
by the local ML model, and ϑidenotes a data sample
belonging to the local dataset Di. In this paper, we mainly
deal with Problem (1) [8–10], and all the results can be
extended to the weighted version by techniques in [6, 11].
We depict a round of the typical algorithm FedAvg to solve
(1) as three parts: In the k-th round, (i) Broadcasting: The
server uniformly samples a subset of mclients (i.e., Uk⊆ N
with m:= |Uk| ≤ n, ∀k∈ {0,1, ..., K −1}for any integer
K≥1) and broadcasts the aggregated global model xk
to client i∈ Uk. (ii) Local Update: Each selected client i
initializes the local model vk
i,0as xkand then trains the model
by performing stochastic gradient descent (SGD) with a step
size ηon fi(·),
vk
i,j+1 ←vk
i,j −ηgi(vk
i,j ),∀j∈ {0,1, ..., τ −1},(2)
where vk
i,j denotes the updated local model in the j-th step
SGD and gi(·)represents the stochastic gradient of fi(·)w.r.t.
server
heterogenous clients local models globalized
model
historical
info
personalized
model
(i) (ii)
(ii)(c)
global model
optimization depersonalization initialization
communication computation
optimization depersonalization initialization
communication computation
Fig. 1. Federated learning with depersonalization: (i) communication (broad-
casting & aggregating), (ii) computation (local updating): (a) optimization,
(b) depersonalization, and (c) initialization. Indeed, mechanism (a) integrates
(b) which integrates (c), as (a) ⊃(b) ⊃(c).
vi. While the number of local steps reaches a certain threshold
τ, client iwill upload its local model to the server. (iii) Global
Aggregation: The server aggregates all received local models
to derive a new global one for the next phase,
xk+1 ←1
mXi∈Ukvk
i,τ .(3)
We complete the whole process when the number of commu-
nication rounds reaches the upper limit K, and obtain a global
model trained by all participating clients.
Note that the stochastic gradient gi(·)in Process (2) can
be more precisely rewritten as gi(·) = ∇f(·;ϑi)with ϑi∼
Di. Since the high heterogeneity, local datasets Dii∈N obey
unbalanced data distributions, and the corresponding generated
gradients are consequently different in expectation:
Eϑi∼Di[f(·;ϑi)] 6=Eϑj∼Dj[f(·;ϑj)],∀i, j ∈ N, i 6=j. (4)
That means performing SGD (2) with (4) leads to each
client tending to find its local solution v∗
iwith ∇fi(v∗
i) =
0deviating from the global one x∗with ∇f(x∗) = 0,
where always holds the optimization objective inconsistency
∩i∈N ker ∇fi=∅,∀i∈ N hence resulting in slow conver-
gence. Moreover, in practical deployed FL frameworks, the
number of participators is always much large while the band-
width or communication capability of the server is limited, i.e.,
only a small fraction of clients can be selected to join a training
round: mn. This fact (partial communication) aggravates
the inconsistency of local models, thus further leading to
unreliable training and poor performance.
III. FEDERATED LEARNING WITH DEPERSONALIZATION
To alleviate the negative impact of non-iid data and partial
communication on FL, we propose a new Depersonalized
FL (FedDeper) algorithm. In brief, we aim to generate local
approximations of the global model on clients, then upload
and aggregate them in place of the original local models for
stabilization and acceleration, as shown in Fig. 1.