
Personalized Federated Learning via Heterogeneous
Modular Networks
Tianchun Wang1, Wei Cheng2, Dongsheng Luo3, Wenchao Yu2, Jingchao Ni4, Liang Tong2,
Haifeng Chen2, Xiang Zhang1
1The Pennsylvania State University,2NEC Laboratories America,3Florida International University,4AWS AI Labs, Amazon
{tkw5356, xzz89}@psu.edu, {weicheng, wyu, ltong, haifeng}@nec-labs.com, dluo@fiu.edu, jingchni@amazon.com
Abstract—Personalized Federated Learning (PFL) which col-
laboratively trains a federated model while considering local
clients under privacy constraints has attracted much attention.
Despite its popularity, it has been observed that existing PFL
approaches result in sub-optimal solutions when the joint dis-
tribution among local clients diverges. To address this issue,
we present Federated Modular Network (FedMN), a novel PFL
approach that adaptively selects sub-modules from a module
pool to assemble heterogeneous neural architectures for different
clients. FedMN adopts a light-weighted routing hypernetwork
to model the joint distribution on each client and produce the
personalized selection of the module blocks for each client. To
reduce the communication burden in existing FL, we develop an
efficient way to interact between the clients and the server. We
conduct extensive experiments on the real-world test beds and
the results show both effectiveness and efficiency of the proposed
FedMN over the baselines.
Index Terms—Federated Learning, Personalized Models, Mod-
ular Networks
I. INTRODUCTION
Federated Learning (FL) emerges as a prospective solu-
tion that facilitates distributed collaborative learning without
disclosing original training data whilst naturally complying
with the government regulations [1], [2]. In practice, the
problem of data heterogeneity deteriorates the performance of
the global FL model on individual clients due to the lack of
solution personalization. To tackle with it, researchers focus
on the Personalized Federated Learning (PFL) which aims
to make the global model fit the distributions on most of
the devices [3], [4]. The vanilla PFL approaches first learn
a global model and then locally adapt it to each client by
fine-tuning the global parameters [5], [6]. In this case, the
trained global model can be regarded as a meta-model ready
for further personalization of each local client. In order to build
a better meta-model, many efforts have been done to bridge
the FL and the Model Agnostic Meta Learning (MAML)
[7]–[9]. However, the global generalization error typically
does not decrease much [10] for these approaches. Thus, the
performance can not be significantly improved. Another line
of research focuses on jointly training a global model and a
local model for each client to achieve personalization [11],
[12]. This strategy does not perform well on the clients whose
local distributions are far from their average. Cluster-based
PFL approaches [13] address this issue by grouping the clients
into several clusters. The clients in a cluster share the same
model while those belonging to different clusters have different
models. Unfortunately, the model trained in one cluster will
not benefit from the knowledge of the clients in other clusters,
which limits the capability to share knowledge and therefore
results in a sub-optimal solution.
An alternative strategy is adopting the Multi-Task Learning
(MTL) framework to train a PFL model [4], [14]. How-
ever, most existing efforts did not consider the difference in
conditional distribution between clients. It is an important
problem when building a federated model. For example,
labels sometimes reflect sentiment. Some users may label
a laptop as cheap while others label it as expensive. This
conditional distribution heterogeneity problem will cause the
model inaccurate on some clients where the p(y|x)is far
from the average. To address the problem, a recent work [10]
assumes the data distribution of each client is a mixture of
Munderlying distributions and proposes a flexible framework
in which each client learns a combination of Mshared
components with different weights. It optimizes the varying
conditional distribution pi(y|x)under the assumption that
the marginal distribution pi(x) = p(x)is the same for all
clients (Assumption 2 in [10]). This assumption, however,
is restricted. For instance, in handwriting recognition, users
who write the same words might still have different stroke
widths, slants, etc. In this cases, pi(x)6=pj(x)for client
iand j. Other works [15], [16] either assume the marginal
distribution pi(x)or the conditional distribution pi(y|x)the
same across clients. In reality, data on each client may be
deviated from being identically distributed, say, Pi6=Pj
for client iand j. That is, the joint distribution Pi(x,y)
(can be rewritten as Pi(y|x)Pi(x)or Pi(x|y)Pi(y)) may be
different across clients. We call it the “joint distribution
heterogeneity” problem. Existing approaches [15], [16] fail to
completely model the difference of joint distribution between
clients because they assume one term to be the same while
varying the other one. Moreover, to accommodate different
data distributions, the homogeneous model would be too large
so that the given prediction power can be satisfied. Thus, the
communication costs between the server and clients would be
huge. In this case, communication would be a key bottleneck
to consider when developing FL methods. To this end, it is
desirable to design an effective PFL model to accommodate
heterogeneous clients in an efficient.
To solve the aforementioned problems, in this paper, we pro-
pose a novel Federated Modular Networks (FedMN) approach,
arXiv:2210.14830v2 [cs.LG] 2 Dec 2022