1 Federated Learning with Server Learning Enhancing Performance for Non-IID Data

2025-04-28 0 0 2.09MB 22 页 10玖币

Federated Learning with Server Learning: Enhancing Performance

for Non-IID Data

Van Sy Mai, Richard J. La, Tao Zhang

Abstract

Federated Learning (FL) has emerged as a means of distributed learning using local data stored at clients with a coordinating

server. Recent studies showed that FL can suffer from poor performance and slower convergence when training data at clients

are not independent and identically distributed. Here we consider a new complementary approach to mitigating this performance

degradation by allowing the server to perform auxiliary learning from a small dataset. Our analysis and experiments show that this

new approach can achieve signiﬁcant improvements in both model accuracy and convergence time even when the server dataset

is small and its distribution differs from that of the aggregated data from all clients.

I. INTRODUCTION

Federated Learning (FL) is a recent paradigm in which multiple clients collaborate under the coordination of a central server

to train machine learning (ML) models [13]. A key advantage is that clients need not send their local data to any central sever

or share their data with each other. Performing learning where the data is generated (or collected) is becoming necessary as a

large and growing amount of data is created at the network edge and cannot all be forwarded to any central location due to

many factors such as network capacity constraints, latency requirements, and data privacy concerns [4].

In its basic form, FL trains a global model for all clients based on the following high-level iterative procedure. At each

global round: 1) the central server selects a subset of clients and shares the current global model with them, 2) each selected

client updates the model using only its local data and forwards the updated model to the central server, and 3) the central

server aggregates the updated local models from the clients to update the global model. This process is repeated until certain

convergence criteria are satisﬁed.

Background: Conventional FL techniques, such as the well-known Federated Averaging (FedAvg) algorithm [22], carry out

model aggregation by averaging the model parameters received from the clients. This performs well when clients have access

to independent and identically distributed (IID) training samples. In practice, however, the local data available to the clients

often do not satisfy this IID assumption for different reasons. For instance, clients may collect data from different sources,

using different tools, under different conditions, or only have access to partial or biased data, which can cause the distributions

of the samples or features at different clients to differ considerably. Such divergences are also referred to as drifts or shifts,

and can take different forms [13].

Large divergences can cause conventional FL techniques to suffer from poor model performance and slow training conver-

gence [6], [8], [12], [14], [18], [33]. For example, feature divergence, where the distributions of features differ at different

clients, may cause local models to focus on different features or even use different feature representations. Non-IID training

data can also cause clients to optimize their local models toward local optima that can differ signiﬁcantly from global optima.

This can further cause the weights of clients’ local models to diverge [21], [33]. As a result, simply averaging local models

may not move the aggregated model toward a global optimum.

Recently, growing efforts have been devoted to improving FL performance for non-IID data. The following are several

representative categories of approaches.

•Personalization: Clients personalize their local models to perform well on their local data [6], [11], [16], [17], [26].

Personalization can be for individual clients or groups of clients (e.g., clients that have similar training data or contribute

similar model updates to the server) [3] [9]. Many real-world applications, however, desire a common model for all clients.

For example, consider autonomous vehicles (AVs) in different geographical regions learning to recognize stop signs. The

snow-covered stop signs in northeast United States can look very different from those along the sunny southern country roads.

Since cars can travel anywhere, they will beneﬁt from a model that can work well everywhere.

•Changing how clients learn or contribute: Several approaches aim to better align the objectives of clients that can diverge

due to non-IID training data, e.g., [24], [30]. Clients may use Batch Normalization to alleviate local model divergence caused

by non-IID data [19]. Batch Normalization [10] has been used in deep learning to mitigate the impact of domain shifts (i.e.,

differences between training data distribution and test data distribution). Various methods have also been proposed to choose

a subset of the clients to participate in each round of FL to counterbalance distribution shifts [25], [32].

•Changing how the server aggregates local models: This approach alters the aggregation method of local models based on,

e.g., their distances to an estimated global model baseline [28], or additional client states or control variates [14].

V. S. Mai and T. Zhang are with the National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA. Email: {vansy.mai,

tao.zhang}@nist.gov. R.J. La is with NIST and the University of Maryland, College Park, MD 20742, USA. Email: hyongla@umd.edu. Any mention of

commercial products in this paper is for information only; it does not imply any recommendation or endorsement by NIST. U.S. Government work not

protected by U.S. copyright

arXiv:2210.02614v4 [cs.LG] 15 Aug 2023

•Lifelong learning techniques: These techniques treat the learning at each client as a separate task and learn these tasks

sequentially using a single model without forgetting the previously learned tasks [13].

Motivation: Our main observation is that these existing FL algorithms do not consider the central server as a learner or

assume that the server has no training data. In practice, however, the server can and often have access to some training data.

For example, the server may receive data from sensors and testing devices that do not participate in the learning process. It

may have synthetic data obtained from simulation (or emulation) and digital twins. The server may also receive some raw data

directly from the clients; this is often required to, for example, support system monitoring and diagnosis.

Consider again AVs, as an example, which need ML models to recognize objects. Today, two main sources of data are used

to train and test such models. First, test vehicles are used to scout selected areas to collect real-world data. Note that this

typically imposes no privacy concerns. It, however, may require large ﬂeets of test vehicles, take years to accomplish, incur

heavy costs, and yet still fail to collect enough data to cover the vast range of possible learning needs [31]. Therefore, the AV

industry is increasingly relying on a second source of data – synthetic data, typically generated in the cloud – to extend model

training and testing scopes. Going forward, when some AVs participate in FL, a small ﬂeet of test vehicles, which may not all

participate in FL, can still be used to collect and send data to the server to compensate the data that the FL clients can collect.

Sharing a common IID training dataset with all clients (so that each client will train its local model on its local data plus

this common dataset) has been shown to improve FL performance with non-IID data [13], [17], [33]. But, this method, which

we refer to as FL with data sharing or simply data sharing, also increases clients’ workload, making them less suitable for

resource-constrained devices. More importantly, it is often impractical for clients to share data with each other due to privacy

concerns, network bandwidth constraints, and latency requirements. We will show that it is unnecessary to share such common

datasets among clients, as comparable or better performance can be achieved by having the server learn from the same dataset.

Several recent works have considered server learning with some centralized data, e.g., hybrid training [27], mixed FL [1],

and FL with server learning [20]. However, [27] analyzes only the case where both clients’ data and server data are IID and

their algorithm requires all clients to participate in every round. Similarly, [1] assumes IID client data and considers server’s

role as a regularizer. In contrast, [20] focuses on FL with non-IID client data. In this paper, we build upon our work in [20]

to study the idea of using server learning to enhance FL on non-IID data and provide both analytical and experimental results

showing that this approach can be effective under certain conditions. Therefore, the primary focus of our study and reported

analysis are fundamentally different from those in [1] and [27].

Contributions: We consider a new FL algorithm that incorporates server learning to improve performance on non-IID

data. Speciﬁcally, the server collects a small amount of data, learns from it, and distills the knowledge into the global model

incrementally during the FL process. We refer to this method as Federated Learning with Server Learning (FSL). Our main

contributions can be summarized as follows:

•Through our analysis and experimental studies, we show that FSL can signiﬁcantly improve the performance in both ﬁnal

accuracy and convergence time when clients have non-IID data. Also, only a small amount of data is needed at the server for

FSL to improve performance, even when the server data distribution deviates from that of the aggregated data stored at the

clients. As expected, the training performance improves as such distribution divergence diminishes.

•By incorporating server learning with FL in an incremental fashion, we will demonstrate that FSL signiﬁcantly accelerates

the learning process when the current model is far from any (locally) optimal model.

•FSL is simple and can be tuned relatively easily, even when the server dataset is relatively small. Compared to FL,

FSL adds only a local learning component to the server and does not affect the clients. Thus, FSL has the same per-round

communication overhead as FL while practically requiring to tune only one additional parameter, which is the weight given

to server’s loss function. Our experimental studies show that the performance improvement of FSL remains signiﬁcant for a

relatively large range of this weight.

In our experiments, FSL consistently outperforms the data sharing method in [33], suggesting that sharing common datasets

with clients might be unnecessary. We also demonstrate that by employing a small amount of data from either a few clients

or other data sources (including synthetic data) for server learning, FSL can achieve similar (and often better) performance

compared to FedDyn [7] and SCAFFOLD [14], while enjoying a signiﬁcant boost in learning rate at the beginning.

Preliminary results of this paper appeared in [20], where only the main algorithm and limited experimental results using

IID server data were reported. In this paper, we provide a theoretical analysis of FSL and more extensive experimental

evaluations, including a comparison with SCAFFOLD algorithm using non-IID server data.

The rest of the paper is organized as follows. The problem formulation and our algorithm are given in § II. Main convergence

results are presented in § III, followed by experimental evaluations in § IV. Conclusions are given in § V. All the proofs and

additional numerical results can be found in our technical report in Appendices A and B, respectively.

Notation: For each integer n > 0, we use [n]to denote the set {1, . . . , n}. For a ﬁnite set D,|D| denotes its cardinality. For

any vector x,∥x∥denotes its 2-norm. We denote by ⟨x, y⟩the inner product of two vectors xand y. A function f:D→R

is said to be smooth with parameter L, or simply L-smooth, if f(x)−f(y)− ⟨∇f(y), x −y⟩ ≤ L

2∥x−y∥2for all x, y ∈D.

For a random variable X, we use both E[X]and EXto denote its expected value.

II. PROBLEM FORMULATION AND OUR APPROACH

In this section, we ﬁrst present our problem formulation in connection with the data sharing approach, and then delineate

the FSL algorithm aimed at coping with non-IID data.

A. Problem Formulation

Consider the following ML problem in which we train a model to minimize an empirical loss:

minx∈RdF(x)≜1

nPi∈[n]ℓ(x, si),(1)

where x∈Rdis the vector of model parameters that need to be learned, D={s1, . . . , sn}is the set of training samples, and

ℓ(x, si)is the loss for sample siunder model x.

In FL, the goal remains the same, which is to minimize the total loss, but training data are distributed at multiple clients.

Suppose that there are Nclients and the dataset Dis partitioned into {D1,D2,...,DN}, where Diis the local dataset at client

i. For each i∈[N], deﬁne ni:= |Di|and fi(x) := 1

niPs∈Diℓ(x, s)to be the loss function of client iover its own dataset

Diunder model x. Then, problem (1) can be reformulated as follows with pi=ni

nfor all i∈[N]:

minx∈RdF(x) = Pi∈[N]pifi(x).(2)

Suppose that the server also has access to a dataset D0with n0=|D0|. In the algorithm of [33], a subset of samples in D0

is shared with all clients and is not utilized by the server. Each client iimplements the conventional FL algorithm using the

augmented dataset D′

i=Di∪ D0.1Under such data sharing, the optimization problem in (1) is modiﬁed as follows to reﬂect

the change in clients’ datasets:

min

x∈RdF′(x) = 1

n+Nn0X

s∈D

ℓ(x, s) + NX

s′∈D0

ℓ(x, s′)

Similar to (2), this problem can be rewritten using the weighted sum of clients’ loss functions as follows:

minx∈RdF′(x) = Pi∈[N]p′

if′

i(x),(3)

where f′

i(x) = 1

ni+n0Ps∈D′

iℓ(x, s)is the modiﬁed loss of client i, and p′

i=ni+n0

n+Nn0is the corresponding weight.

Deﬁne f0:= n−1

0Ps∈D0ℓ(x, s)to be the loss function for the samples in D0. Using the deﬁnition of Fin (1), the new

objective function F′can be rewritten as

F′=n

n+Nn0F+N n0

nf0.(4)

This tells us that the above data sharing method alters the objective function by adding the loss function f0for the shared

samples with a weight of Nn0

n. It also suggests that the quality of the solution obtained from (3), relative to the original

problem in (2), depends on how similar Fand f0are: when F=f0, the two problems become equivalent. More importantly,

it shows that sharing the samples in D0with clients may be unnecessary; instead, the server can learn from D0and combine

its learned model with clients’ models in a federated fashion. Having the server learn, rather than sharing training samples

among the clients, avoids practical issues such as extra communication overheads, long and unpredictable network delays, and

privacy concerns. It also allows us to choose the weight for f0, which we denote by γ, to be different from N n0

n, based on

the quality of D0. This leads to a following (centralized) optimization problem:

minx∈RdF+γf0.(5)

Note that our problem formulation above can be generalized to the case with expected losses as follows:

minx∈RdPi∈[N]pifi(x)+γf0(x),(6)

where p= (pi;i∈[N]) is a probability vector, and fi(x) = Ez∼Di[fi(x;z)] is the expected loss function of the server (i= 0)

and each client i∈[N], and Diis the corresponding data distribution. In what follows, we will use (5) to facilitate our

discussion and emphasize that our analysis applies directly to (6).

1For simplicity, we either assume that Di∩ D0=∅or consider any dataset as a multiset, allowing for possible multiple instances for each of its elements.

Thus, we can write |D′

i|=|Di|+|D0|.

B. FSL Algorithm

We assume that the server has access to dataset D0and will augment FL with what the server learns over D0. As stated

earlier, we refer to this approach as Federated Learning with Server Learning or FSL.

There are several ways to incorporate server learning (SL) into FL. One is to treat the server as a regular client that

participates in every round of FL process [29]: During each global round, the server updates the current global model using D0

and then aggregates it with the updated models reported by the clients. We call this approach non-incremental SL. One issue

with non-incremental SL is that the weight for the server would be very small when n0≪n, which means that the server’s

contributions, based on its learning from D0, to the global model will be minor. Moreover, this approach fails to exploit the

good quality of D0, especially when its distribution is close to that of D. This issue can be partially alleviated by increasing

the weight given to the server’s model in the aggregation step.

These observations motivate us to consider an incremental learning scheme in which the server performs additional learning

over dataset D0based on the aggregated model, as shown in Algorithm 1 below in more detail. In particular, lines 1–9 of

Algorithm 1 are the same as in a conventional FL algorithm [22], where in each global round t, each selected client i(1) receives

the current global model xtfrom the server, (2) performs Ksteps of the Stochastic Gradient Descent (SGD) algorithm using

its local data Di(LOCALSGD) with learning rate ηl, and (3) returns to the server its update ∆(i)

t. The server then combines

its current model xtwith the updates from the clients using some weight ηg>0(lines 8–9). It then uses the resulting updated

model to learn locally by performing K0steps of LOCALSGD with learning rate γη0(line 10). As one can see, our approach

has the same computation and communication costs at the clients as the usual FL framework.

Note that our algorithm is similar to the incremental (stochastic) gradient method, which has been shown to be much faster

than the non-incremental gradient method when the model is far from a (locally) optimal point [2]. While FL with local SGD

also works in an incremental fashion, it often needs small learning rates, hence longer learning times, to ensure convergence

when the distribution of clients’ data is heterogeneous.

Before presenting a formal analysis and experimental results, let us provide some insights into FSL. First, when the

distributions of D0and Dare close, server’s loss function f0will be similar to the overall loss function Fin (1). Consequently, if

the current model is far from an optimal point, the gradient ∇f0will track the global gradient ∇F, even when individual clients’

gradients ∇fido not follow ∇Fclosely. Therefore, when the updated model obtained by aggregating clients’ updated models

does not make (much) progress, ∇f0will help improve the updated model. In fact, it turns out that signiﬁcant improvements

can still be achieved even when the distributions of D0and Dare not very similar as long as their difference is small in

relation to the non-IIDness of clients’ data. We will elaborate on these points in the following section.

Algorithm 1: FSL: FL with Server Learning

Server:

1initialize x0,K,K0,ηl, ηg, γη0

2for t= 0, . . . , T −1do

3sample a subset Sof clients (with |S| =S)

4broadcast xtto clients in S

5forall clients i∈ S do

6x(i)

t,K =LOCALSGD(xt, ηl, K, Di)

7upload to server: ∆(i)

t=x(i)

t,K −xt

8∆t=Pi∈S ∆(i)

|S|

9¯xt=xt+ηg∆t

10 xt+1 =LOCALSGD(¯xt, γη0, K0,D0)

LOCALSGD(x, η, K, Di):

11 y0=x

12 for k= 0, . . . , K −1do

13 compute an unbiased estimate g(yk)of ∇fi(yk)

14 yk+1 =yk−ηg(yk)

III. CONVERGENCE RESULTS

We ﬁrst show in subsection III-A that SL can be viewed as a correction step for FL in the case of non-IID training data.

Then the main convergence results are provided in subsection III-B.

A. SL as Corrections to FL When Far from Convergence

In order to simplify our discussion presented in this subsection which provides key intuition behind our approach, assume

that the server can compute gradient ∇f0(x),x∈Rd, and consider the usual gradient descent (GD) method for SL. First,

consider a single update carried out by the server using GD, starting with some model w0, i.e., w1=w0−η0∇f0(w0). Suppose

that ∇Fis Lipschitz continuous with parameter L.2Then,

F(w1)−F(w0)≤ ⟨∇F(w0), w1−w0⟩+ 0.5L∥w1−w0∥2

≤ −η0⟨∇F(w0),∇f0(w0)⟩+ 0.5Lη2

0∥∇f0(w0)∥2.(7)

The above inequality indicates that SL can improve FL further when the second term in (7) is sufﬁciently negative so that

2⟨∇F(w0),∇f0(w0)⟩> Lη0∥∇f0(w0)∥2.(8)

This condition holds when ∇f0(w0)makes an acute angle with ∇F(w0)(provided that ∥∇F(w0)∥>0), in which case

progress can be made by using a sufﬁciently small step size η0. This will likely be the case when the distribution of D0is

similar to that of Dand, when w0is far from a (local) minimizer, −∇f0(w0)will likely be a descent direction of Fat w0.

In order to further see the role of D0, let us rewrite condition (8) as follows:

∥∇F(w0)∥2+ (1 −Lη0)∥∇f0(w0)∥2>∥∇F(w0)− ∇f0(w0)∥2.(9)

This implies the following. First, the error ∥∇F(w0)− ∇f0(w0)∥2in general depends on relationship between the server’s

dataset D0and the aggregate dataset D; the more dissimilar D0is to D, the larger the error and thus the smaller the improvement.

In fact, SL can have negative impact if the error is sufﬁciently large. This suggests that the server dataset should be selected

carefully in order to maximize the beneﬁts of SL. As an example, consider D0consisting of IID samples. In this case, the

error ∥∇F(w0)− ∇f0(w0)∥2tends to decrease with the size of D0according to (sampling without replacement)

ED0∥∇f0(x)− ∇F(x)∥2=n

n0−1˜σ2

0(x)

n−1,(10)

where ˜σ2

0(x) = 1

nPs∈D ∥∇xℓ(x, s)−∇F(x)∥2is the population variance. Thus, condition (9) can be satisﬁed by increasing

n0.

Second, for ﬁxed D0(of reasonable quality), the inequality in (9) holds when ∥∇F(w0)∥is large, i.e., w0is far from being a

stationary point, which is expected at the beginning of the training process. This is true even when ∇f0(x)is a biased estimate

of ∇F(x)as long as ∥∇F(w0)− ∇f0(w0)∥2is strictly smaller than ∥∇F(w0)∥2+∥∇f0(w0)∥2, i.e., the angle between the

gradients is acute as mentioned earlier, for a sufﬁciently small step size η0. Third, when ∥∇f0(w0)∥is sufﬁciently small, e.g.,

when overﬁtting happens at the server, the improvement by SL is also insigniﬁcant. Finally, when w0is near a stationary

point of Fbut far from that of f0, i.e., ∥∇F(w0)∥ ≪ ∥∇f0(w0)∥, the inequality in (9) may be reversed, in which case SL

can impair FL, pushing the model toward server’s local stationary points. In this case, our algorithm does not yield exact

convergence but oscillates between stationary points of Fand f0, which is expected for an incremental gradient method [2].

Such convergence will be analyzed in details in the next subsection.

The above analysis also applies when the server performs multiple updates. In particular, suppose that the server performs

K0updates of the model using the GD method with a ﬁxed step size η0:

wt,k+1 =wt,k −η0∇f0(wt,k ), k = 0, . . . , K0−1,

with wt,0= ¯xtand xt+1 =wt,K0. Then, repeating the steps above and summing over the iterations, we obtain

F(xt+1)−F(¯xt)≤ −0.5η0PK0−1

k=0 ∥∇F(wt,k)∥2+ (1 −Lη0)∥∇f0(wt,k )∥2

+ 0.5η0PK0−1

k=0 ∥∇F(wt,k)− ∇f0(wt,k )∥2.

Similarly to the single-update case discussed earlier, we can see that carrying out multiple updates at the server is beneﬁcial

when wt,k is far from being a stationary point of either For f0, more precisely, ∥∇F(wt,k)∥2+ (1 −Lη0)∥∇f0(wt,k)∥2>

∥∇F(wt,k)−∇f0(wt,k)∥2.This also suggests that when learning collaboratively with clients, the server should not overﬁt its

own data, which could happen easily when n0is small.

B. Convergence Analysis

In this subsection, we study the convergence properties of FSL. Speciﬁcally, we will prove that, under suitable conditions

on step sizes, FSL converges to a neighborhood of a stationary point of the following modiﬁed loss function

F=1

1+γF+γ

1+γf0

2This assumption is standard in FL and often holds when training neural networks. We will state this assumption formally in Section III-B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1FederatedLearningwithServerLearning:EnhancingPerformanceforNon-IIDDataVanSyMai,RichardJ.La,TaoZhangAbstractFederatedLearning(FL)hasemergedasameansofdistributedlearningusinglocaldatastoredatclientswithacoordinatingserver.RecentstudiesshowedthatFLcansufferfrompoorperformanceandslowerconvergencewhentr...

展开>> 收起<<

1 Federated Learning with Server Learning Enhancing Performance for Non-IID Data.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Federated Learning with Server Learning Enhancing Performance for Non-IID Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: