
2
•Lifelong learning techniques: These techniques treat the learning at each client as a separate task and learn these tasks
sequentially using a single model without forgetting the previously learned tasks [13].
Motivation: Our main observation is that these existing FL algorithms do not consider the central server as a learner or
assume that the server has no training data. In practice, however, the server can and often have access to some training data.
For example, the server may receive data from sensors and testing devices that do not participate in the learning process. It
may have synthetic data obtained from simulation (or emulation) and digital twins. The server may also receive some raw data
directly from the clients; this is often required to, for example, support system monitoring and diagnosis.
Consider again AVs, as an example, which need ML models to recognize objects. Today, two main sources of data are used
to train and test such models. First, test vehicles are used to scout selected areas to collect real-world data. Note that this
typically imposes no privacy concerns. It, however, may require large fleets of test vehicles, take years to accomplish, incur
heavy costs, and yet still fail to collect enough data to cover the vast range of possible learning needs [31]. Therefore, the AV
industry is increasingly relying on a second source of data – synthetic data, typically generated in the cloud – to extend model
training and testing scopes. Going forward, when some AVs participate in FL, a small fleet of test vehicles, which may not all
participate in FL, can still be used to collect and send data to the server to compensate the data that the FL clients can collect.
Sharing a common IID training dataset with all clients (so that each client will train its local model on its local data plus
this common dataset) has been shown to improve FL performance with non-IID data [13], [17], [33]. But, this method, which
we refer to as FL with data sharing or simply data sharing, also increases clients’ workload, making them less suitable for
resource-constrained devices. More importantly, it is often impractical for clients to share data with each other due to privacy
concerns, network bandwidth constraints, and latency requirements. We will show that it is unnecessary to share such common
datasets among clients, as comparable or better performance can be achieved by having the server learn from the same dataset.
Several recent works have considered server learning with some centralized data, e.g., hybrid training [27], mixed FL [1],
and FL with server learning [20]. However, [27] analyzes only the case where both clients’ data and server data are IID and
their algorithm requires all clients to participate in every round. Similarly, [1] assumes IID client data and considers server’s
role as a regularizer. In contrast, [20] focuses on FL with non-IID client data. In this paper, we build upon our work in [20]
to study the idea of using server learning to enhance FL on non-IID data and provide both analytical and experimental results
showing that this approach can be effective under certain conditions. Therefore, the primary focus of our study and reported
analysis are fundamentally different from those in [1] and [27].
Contributions: We consider a new FL algorithm that incorporates server learning to improve performance on non-IID
data. Specifically, the server collects a small amount of data, learns from it, and distills the knowledge into the global model
incrementally during the FL process. We refer to this method as Federated Learning with Server Learning (FSL). Our main
contributions can be summarized as follows:
•Through our analysis and experimental studies, we show that FSL can significantly improve the performance in both final
accuracy and convergence time when clients have non-IID data. Also, only a small amount of data is needed at the server for
FSL to improve performance, even when the server data distribution deviates from that of the aggregated data stored at the
clients. As expected, the training performance improves as such distribution divergence diminishes.
•By incorporating server learning with FL in an incremental fashion, we will demonstrate that FSL significantly accelerates
the learning process when the current model is far from any (locally) optimal model.
•FSL is simple and can be tuned relatively easily, even when the server dataset is relatively small. Compared to FL,
FSL adds only a local learning component to the server and does not affect the clients. Thus, FSL has the same per-round
communication overhead as FL while practically requiring to tune only one additional parameter, which is the weight given
to server’s loss function. Our experimental studies show that the performance improvement of FSL remains significant for a
relatively large range of this weight.
In our experiments, FSL consistently outperforms the data sharing method in [33], suggesting that sharing common datasets
with clients might be unnecessary. We also demonstrate that by employing a small amount of data from either a few clients
or other data sources (including synthetic data) for server learning, FSL can achieve similar (and often better) performance
compared to FedDyn [7] and SCAFFOLD [14], while enjoying a significant boost in learning rate at the beginning.
Preliminary results of this paper appeared in [20], where only the main algorithm and limited experimental results using
IID server data were reported. In this paper, we provide a theoretical analysis of FSL and more extensive experimental
evaluations, including a comparison with SCAFFOLD algorithm using non-IID server data.
The rest of the paper is organized as follows. The problem formulation and our algorithm are given in § II. Main convergence
results are presented in § III, followed by experimental evaluations in § IV. Conclusions are given in § V. All the proofs and
additional numerical results can be found in our technical report in Appendices A and B, respectively.
Notation: For each integer n > 0, we use [n]to denote the set {1, . . . , n}. For a finite set D,|D| denotes its cardinality. For
any vector x,∥x∥denotes its 2-norm. We denote by ⟨x, y⟩the inner product of two vectors xand y. A function f:D→R
is said to be smooth with parameter L, or simply L-smooth, if f(x)−f(y)− ⟨∇f(y), x −y⟩ ≤ L
2∥x−y∥2for all x, y ∈D.
For a random variable X, we use both E[X]and EXto denote its expected value.