
The common underlying assumption that determines the superiority of FL to indi-
vidual local training is that the data points of all clients are coming from the same
distribution, i.e., homogeneous data across clients. Consequently, FL can improve the
quality of empirical loss minimization when data available on each device is limited;
otherwise, each client may obtain a proper model without collaboration or commu-
nication with others. Therefore, FL1results in a common global model with better
generalization across clients [46] compared to individual training. In heterogeneous
data setups where clients hold samples from non-identical data distributions, a com-
mon (global) model may perform poorly on the local data points of each client. For
instance, consider the next word prediction task on a smart keyboard [28], where each
client has a unique writing style or emphasis on the vocabulary domain. In this exam-
ple, the corresponding mobile application is supposed to suggest a set of words that
will likely be selected as the next word in the sentence. This scenario clearly states a
case with a heterogeneous data setup with a limited sample on each client’s device.
Thus, if each client trains a model independently, without collaboration with the other
clients, the model will likely perform poorly on the new data due to sample limitation.
Hence, the question arises about what will occur if the clients hold data samples from
similar (but not identical) distributions.
In FL with heterogeneous data, an ideal scenario is to learn a globally common
model easily adaptable to local data on each client, i.e., model fusion. This approach
is known as Personalized Federated Learning (PFL), which strives to exploit both the
shared and unshared information from the data of all clients. A solution to the model
fusion in PFL is to apply transfer learning [12, 73] (e.g., fine-tuning) on a jointly
trained model under FL. Interestingly, the centralized version of this problem has
been extensively studied in Meta-Learning [66] and Multi-Task Learning [50], where
the goal is to obtain a meta (global) model that with (potentially) minimal adap-
tation performs well on multiple tasks. Particularly, Model-Agnostic Meta-Learning
(MAML) [21, 56] proposes an optimization-based formulation that aims to find an initial
meta-model with proper performance after applying one or a few steps of (stochastic)
gradient descent. The key property of MAML is its ability to gauge fine-tuning during
the learning process. Multiple studies have been conducted on the convergence and
generalization of MAML [8, 16, 18, 19, 22, 31] for various problems and setups. Fallah
et al. [17] suggest the MAML formulation as a potential solution for PFL, and propose
Per-FedAvg algorithm for collaborative learning with MAML personalized cost function.
Dinh et al. [13] present pFedMe algorithm for PFL via adopting a different formulation
for personalization, namely Moreau Envelopes (ME). The proposed algorithm is a joint
bi-level optimization problem with personalized parameters which are regularized to
be close to the global model. We will elaborate on these two formulations (MAML &ME)
in Section 2. Additionally, several recent works have approached PFL mainly through
optimization-based [5, 10, 20, 23, 26, 27, 29, 45, 46, 64, 72], or structure-based [9, 59, 65]
techniques.
Scalability to large-scale setups with potentially many clients is another major chal-
lenge for FL. The proposed algorithms in this scheme, mostly require synchronous com-
munications between the server and clients [10, 13, 17, 23, 38, 43, 47]. Such constraints
impose considerable delays on the learning progress, since increasing the concurrency
in synchronous updates decreases the training speed and quality. For example, lim-
ited communication bandwidth, computation power, and communication failures incur
large delays in the training process. In cross-device FL, devices are naturally prone to
1We refer to Federated Learning with no personalization as FL.
2