1 Federated Learning with Server Learning Enhancing Performance for Non-IID Data

2025-04-28 0 0 2.09MB 22 页 10玖币
侵权投诉
1
Federated Learning with Server Learning: Enhancing Performance
for Non-IID Data
Van Sy Mai, Richard J. La, Tao Zhang
Abstract
Federated Learning (FL) has emerged as a means of distributed learning using local data stored at clients with a coordinating
server. Recent studies showed that FL can suffer from poor performance and slower convergence when training data at clients
are not independent and identically distributed. Here we consider a new complementary approach to mitigating this performance
degradation by allowing the server to perform auxiliary learning from a small dataset. Our analysis and experiments show that this
new approach can achieve significant improvements in both model accuracy and convergence time even when the server dataset
is small and its distribution differs from that of the aggregated data from all clients.
I. INTRODUCTION
Federated Learning (FL) is a recent paradigm in which multiple clients collaborate under the coordination of a central server
to train machine learning (ML) models [13]. A key advantage is that clients need not send their local data to any central sever
or share their data with each other. Performing learning where the data is generated (or collected) is becoming necessary as a
large and growing amount of data is created at the network edge and cannot all be forwarded to any central location due to
many factors such as network capacity constraints, latency requirements, and data privacy concerns [4].
In its basic form, FL trains a global model for all clients based on the following high-level iterative procedure. At each
global round: 1) the central server selects a subset of clients and shares the current global model with them, 2) each selected
client updates the model using only its local data and forwards the updated model to the central server, and 3) the central
server aggregates the updated local models from the clients to update the global model. This process is repeated until certain
convergence criteria are satisfied.
Background: Conventional FL techniques, such as the well-known Federated Averaging (FedAvg) algorithm [22], carry out
model aggregation by averaging the model parameters received from the clients. This performs well when clients have access
to independent and identically distributed (IID) training samples. In practice, however, the local data available to the clients
often do not satisfy this IID assumption for different reasons. For instance, clients may collect data from different sources,
using different tools, under different conditions, or only have access to partial or biased data, which can cause the distributions
of the samples or features at different clients to differ considerably. Such divergences are also referred to as drifts or shifts,
and can take different forms [13].
Large divergences can cause conventional FL techniques to suffer from poor model performance and slow training conver-
gence [6], [8], [12], [14], [18], [33]. For example, feature divergence, where the distributions of features differ at different
clients, may cause local models to focus on different features or even use different feature representations. Non-IID training
data can also cause clients to optimize their local models toward local optima that can differ significantly from global optima.
This can further cause the weights of clients’ local models to diverge [21], [33]. As a result, simply averaging local models
may not move the aggregated model toward a global optimum.
Recently, growing efforts have been devoted to improving FL performance for non-IID data. The following are several
representative categories of approaches.
Personalization: Clients personalize their local models to perform well on their local data [6], [11], [16], [17], [26].
Personalization can be for individual clients or groups of clients (e.g., clients that have similar training data or contribute
similar model updates to the server) [3] [9]. Many real-world applications, however, desire a common model for all clients.
For example, consider autonomous vehicles (AVs) in different geographical regions learning to recognize stop signs. The
snow-covered stop signs in northeast United States can look very different from those along the sunny southern country roads.
Since cars can travel anywhere, they will benefit from a model that can work well everywhere.
Changing how clients learn or contribute: Several approaches aim to better align the objectives of clients that can diverge
due to non-IID training data, e.g., [24], [30]. Clients may use Batch Normalization to alleviate local model divergence caused
by non-IID data [19]. Batch Normalization [10] has been used in deep learning to mitigate the impact of domain shifts (i.e.,
differences between training data distribution and test data distribution). Various methods have also been proposed to choose
a subset of the clients to participate in each round of FL to counterbalance distribution shifts [25], [32].
Changing how the server aggregates local models: This approach alters the aggregation method of local models based on,
e.g., their distances to an estimated global model baseline [28], or additional client states or control variates [14].
V. S. Mai and T. Zhang are with the National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA. Email: {vansy.mai,
tao.zhang}@nist.gov. R.J. La is with NIST and the University of Maryland, College Park, MD 20742, USA. Email: hyongla@umd.edu. Any mention of
commercial products in this paper is for information only; it does not imply any recommendation or endorsement by NIST. U.S. Government work not
protected by U.S. copyright
arXiv:2210.02614v4 [cs.LG] 15 Aug 2023
2
Lifelong learning techniques: These techniques treat the learning at each client as a separate task and learn these tasks
sequentially using a single model without forgetting the previously learned tasks [13].
Motivation: Our main observation is that these existing FL algorithms do not consider the central server as a learner or
assume that the server has no training data. In practice, however, the server can and often have access to some training data.
For example, the server may receive data from sensors and testing devices that do not participate in the learning process. It
may have synthetic data obtained from simulation (or emulation) and digital twins. The server may also receive some raw data
directly from the clients; this is often required to, for example, support system monitoring and diagnosis.
Consider again AVs, as an example, which need ML models to recognize objects. Today, two main sources of data are used
to train and test such models. First, test vehicles are used to scout selected areas to collect real-world data. Note that this
typically imposes no privacy concerns. It, however, may require large fleets of test vehicles, take years to accomplish, incur
heavy costs, and yet still fail to collect enough data to cover the vast range of possible learning needs [31]. Therefore, the AV
industry is increasingly relying on a second source of data – synthetic data, typically generated in the cloud – to extend model
training and testing scopes. Going forward, when some AVs participate in FL, a small fleet of test vehicles, which may not all
participate in FL, can still be used to collect and send data to the server to compensate the data that the FL clients can collect.
Sharing a common IID training dataset with all clients (so that each client will train its local model on its local data plus
this common dataset) has been shown to improve FL performance with non-IID data [13], [17], [33]. But, this method, which
we refer to as FL with data sharing or simply data sharing, also increases clients’ workload, making them less suitable for
resource-constrained devices. More importantly, it is often impractical for clients to share data with each other due to privacy
concerns, network bandwidth constraints, and latency requirements. We will show that it is unnecessary to share such common
datasets among clients, as comparable or better performance can be achieved by having the server learn from the same dataset.
Several recent works have considered server learning with some centralized data, e.g., hybrid training [27], mixed FL [1],
and FL with server learning [20]. However, [27] analyzes only the case where both clients’ data and server data are IID and
their algorithm requires all clients to participate in every round. Similarly, [1] assumes IID client data and considers server’s
role as a regularizer. In contrast, [20] focuses on FL with non-IID client data. In this paper, we build upon our work in [20]
to study the idea of using server learning to enhance FL on non-IID data and provide both analytical and experimental results
showing that this approach can be effective under certain conditions. Therefore, the primary focus of our study and reported
analysis are fundamentally different from those in [1] and [27].
Contributions: We consider a new FL algorithm that incorporates server learning to improve performance on non-IID
data. Specifically, the server collects a small amount of data, learns from it, and distills the knowledge into the global model
incrementally during the FL process. We refer to this method as Federated Learning with Server Learning (FSL). Our main
contributions can be summarized as follows:
Through our analysis and experimental studies, we show that FSL can significantly improve the performance in both final
accuracy and convergence time when clients have non-IID data. Also, only a small amount of data is needed at the server for
FSL to improve performance, even when the server data distribution deviates from that of the aggregated data stored at the
clients. As expected, the training performance improves as such distribution divergence diminishes.
By incorporating server learning with FL in an incremental fashion, we will demonstrate that FSL significantly accelerates
the learning process when the current model is far from any (locally) optimal model.
FSL is simple and can be tuned relatively easily, even when the server dataset is relatively small. Compared to FL,
FSL adds only a local learning component to the server and does not affect the clients. Thus, FSL has the same per-round
communication overhead as FL while practically requiring to tune only one additional parameter, which is the weight given
to server’s loss function. Our experimental studies show that the performance improvement of FSL remains significant for a
relatively large range of this weight.
In our experiments, FSL consistently outperforms the data sharing method in [33], suggesting that sharing common datasets
with clients might be unnecessary. We also demonstrate that by employing a small amount of data from either a few clients
or other data sources (including synthetic data) for server learning, FSL can achieve similar (and often better) performance
compared to FedDyn [7] and SCAFFOLD [14], while enjoying a significant boost in learning rate at the beginning.
Preliminary results of this paper appeared in [20], where only the main algorithm and limited experimental results using
IID server data were reported. In this paper, we provide a theoretical analysis of FSL and more extensive experimental
evaluations, including a comparison with SCAFFOLD algorithm using non-IID server data.
The rest of the paper is organized as follows. The problem formulation and our algorithm are given in § II. Main convergence
results are presented in § III, followed by experimental evaluations in § IV. Conclusions are given in § V. All the proofs and
additional numerical results can be found in our technical report in Appendices A and B, respectively.
Notation: For each integer n > 0, we use [n]to denote the set {1, . . . , n}. For a finite set D,|D| denotes its cardinality. For
any vector x,xdenotes its 2-norm. We denote by x, ythe inner product of two vectors xand y. A function f:DR
is said to be smooth with parameter L, or simply L-smooth, if f(x)f(y) ⟨∇f(y), x y⟩ ≤ L
2xy2for all x, y D.
For a random variable X, we use both E[X]and EXto denote its expected value.
3
II. PROBLEM FORMULATION AND OUR APPROACH
In this section, we first present our problem formulation in connection with the data sharing approach, and then delineate
the FSL algorithm aimed at coping with non-IID data.
A. Problem Formulation
Consider the following ML problem in which we train a model to minimize an empirical loss:
minxRdF(x)1
nPi[n](x, si),(1)
where xRdis the vector of model parameters that need to be learned, D={s1, . . . , sn}is the set of training samples, and
(x, si)is the loss for sample siunder model x.
In FL, the goal remains the same, which is to minimize the total loss, but training data are distributed at multiple clients.
Suppose that there are Nclients and the dataset Dis partitioned into {D1,D2,...,DN}, where Diis the local dataset at client
i. For each i[N], define ni:= |Di|and fi(x) := 1
niPs∈Di(x, s)to be the loss function of client iover its own dataset
Diunder model x. Then, problem (1) can be reformulated as follows with pi=ni
nfor all i[N]:
minxRdF(x) = Pi[N]pifi(x).(2)
Suppose that the server also has access to a dataset D0with n0=|D0|. In the algorithm of [33], a subset of samples in D0
is shared with all clients and is not utilized by the server. Each client iimplements the conventional FL algorithm using the
augmented dataset D
i=Di∪ D0.1Under such data sharing, the optimization problem in (1) is modified as follows to reflect
the change in clients’ datasets:
min
xRdF(x) = 1
n+Nn0X
s∈D
(x, s) + NX
s∈D0
(x, s)
Similar to (2), this problem can be rewritten using the weighted sum of clients’ loss functions as follows:
minxRdF(x) = Pi[N]p
if
i(x),(3)
where f
i(x) = 1
ni+n0Ps∈D
i(x, s)is the modified loss of client i, and p
i=ni+n0
n+Nn0is the corresponding weight.
Define f0:= n1
0Ps∈D0(x, s)to be the loss function for the samples in D0. Using the definition of Fin (1), the new
objective function Fcan be rewritten as
F=n
n+Nn0F+N n0
nf0.(4)
This tells us that the above data sharing method alters the objective function by adding the loss function f0for the shared
samples with a weight of Nn0
n. It also suggests that the quality of the solution obtained from (3), relative to the original
problem in (2), depends on how similar Fand f0are: when F=f0, the two problems become equivalent. More importantly,
it shows that sharing the samples in D0with clients may be unnecessary; instead, the server can learn from D0and combine
its learned model with clients’ models in a federated fashion. Having the server learn, rather than sharing training samples
among the clients, avoids practical issues such as extra communication overheads, long and unpredictable network delays, and
privacy concerns. It also allows us to choose the weight for f0, which we denote by γ, to be different from N n0
n, based on
the quality of D0. This leads to a following (centralized) optimization problem:
minxRdF+γf0.(5)
Note that our problem formulation above can be generalized to the case with expected losses as follows:
minxRdPi[N]pifi(x)+γf0(x),(6)
where p= (pi;i[N]) is a probability vector, and fi(x) = Ez∼Di[fi(x;z)] is the expected loss function of the server (i= 0)
and each client i[N], and Diis the corresponding data distribution. In what follows, we will use (5) to facilitate our
discussion and emphasize that our analysis applies directly to (6).
1For simplicity, we either assume that Di∩ D0=or consider any dataset as a multiset, allowing for possible multiple instances for each of its elements.
Thus, we can write |D
i|=|Di|+|D0|.
4
B. FSL Algorithm
We assume that the server has access to dataset D0and will augment FL with what the server learns over D0. As stated
earlier, we refer to this approach as Federated Learning with Server Learning or FSL.
There are several ways to incorporate server learning (SL) into FL. One is to treat the server as a regular client that
participates in every round of FL process [29]: During each global round, the server updates the current global model using D0
and then aggregates it with the updated models reported by the clients. We call this approach non-incremental SL. One issue
with non-incremental SL is that the weight for the server would be very small when n0n, which means that the server’s
contributions, based on its learning from D0, to the global model will be minor. Moreover, this approach fails to exploit the
good quality of D0, especially when its distribution is close to that of D. This issue can be partially alleviated by increasing
the weight given to the server’s model in the aggregation step.
These observations motivate us to consider an incremental learning scheme in which the server performs additional learning
over dataset D0based on the aggregated model, as shown in Algorithm 1 below in more detail. In particular, lines 1–9 of
Algorithm 1 are the same as in a conventional FL algorithm [22], where in each global round t, each selected client i(1) receives
the current global model xtfrom the server, (2) performs Ksteps of the Stochastic Gradient Descent (SGD) algorithm using
its local data Di(LOCALSGD) with learning rate ηl, and (3) returns to the server its update (i)
t. The server then combines
its current model xtwith the updates from the clients using some weight ηg>0(lines 8–9). It then uses the resulting updated
model to learn locally by performing K0steps of LOCALSGD with learning rate γη0(line 10). As one can see, our approach
has the same computation and communication costs at the clients as the usual FL framework.
Note that our algorithm is similar to the incremental (stochastic) gradient method, which has been shown to be much faster
than the non-incremental gradient method when the model is far from a (locally) optimal point [2]. While FL with local SGD
also works in an incremental fashion, it often needs small learning rates, hence longer learning times, to ensure convergence
when the distribution of clients’ data is heterogeneous.
Before presenting a formal analysis and experimental results, let us provide some insights into FSL. First, when the
distributions of D0and Dare close, server’s loss function f0will be similar to the overall loss function Fin (1). Consequently, if
the current model is far from an optimal point, the gradient f0will track the global gradient F, even when individual clients’
gradients fido not follow Fclosely. Therefore, when the updated model obtained by aggregating clients’ updated models
does not make (much) progress, f0will help improve the updated model. In fact, it turns out that significant improvements
can still be achieved even when the distributions of D0and Dare not very similar as long as their difference is small in
relation to the non-IIDness of clients’ data. We will elaborate on these points in the following section.
Algorithm 1: FSL: FL with Server Learning
Server:
1initialize x0,K,K0,ηl, ηg, γη0
2for t= 0, . . . , T 1do
3sample a subset Sof clients (with |S| =S)
4broadcast xtto clients in S
5forall clients i∈ S do
6x(i)
t,K =LOCALSGD(xt, ηl, K, Di)
7upload to server: (i)
t=x(i)
t,K xt
8t=Pi∈S (i)
t
|S|
9¯xt=xt+ηgt
10 xt+1 =LOCALSGD(¯xt, γη0, K0,D0)
LOCALSGD(x, η, K, Di):
11 y0=x
12 for k= 0, . . . , K 1do
13 compute an unbiased estimate g(yk)of fi(yk)
14 yk+1 =ykηg(yk)
III. CONVERGENCE RESULTS
We first show in subsection III-A that SL can be viewed as a correction step for FL in the case of non-IID training data.
Then the main convergence results are provided in subsection III-B.
5
A. SL as Corrections to FL When Far from Convergence
In order to simplify our discussion presented in this subsection which provides key intuition behind our approach, assume
that the server can compute gradient f0(x),xRd, and consider the usual gradient descent (GD) method for SL. First,
consider a single update carried out by the server using GD, starting with some model w0, i.e., w1=w0η0f0(w0). Suppose
that Fis Lipschitz continuous with parameter L.2Then,
F(w1)F(w0)≤ ⟨∇F(w0), w1w0+ 0.5Lw1w02
≤ −η0⟨∇F(w0),f0(w0)+ 0.52
0∥∇f0(w0)2.(7)
The above inequality indicates that SL can improve FL further when the second term in (7) is sufficiently negative so that
2⟨∇F(w0),f0(w0)> Lη0∥∇f0(w0)2.(8)
This condition holds when f0(w0)makes an acute angle with F(w0)(provided that ∥∇F(w0)>0), in which case
progress can be made by using a sufficiently small step size η0. This will likely be the case when the distribution of D0is
similar to that of Dand, when w0is far from a (local) minimizer, −∇f0(w0)will likely be a descent direction of Fat w0.
In order to further see the role of D0, let us rewrite condition (8) as follows:
∥∇F(w0)2+ (1 0)∥∇f0(w0)2>∥∇F(w0)− ∇f0(w0)2.(9)
This implies the following. First, the error ∥∇F(w0)− ∇f0(w0)2in general depends on relationship between the server’s
dataset D0and the aggregate dataset D; the more dissimilar D0is to D, the larger the error and thus the smaller the improvement.
In fact, SL can have negative impact if the error is sufficiently large. This suggests that the server dataset should be selected
carefully in order to maximize the benefits of SL. As an example, consider D0consisting of IID samples. In this case, the
error ∥∇F(w0)− ∇f0(w0)2tends to decrease with the size of D0according to (sampling without replacement)
ED0∥∇f0(x)− ∇F(x)2=n
n01˜σ2
0(x)
n1,(10)
where ˜σ2
0(x) = 1
nPs∈D ∥∇x(x, s)F(x)2is the population variance. Thus, condition (9) can be satisfied by increasing
n0.
Second, for fixed D0(of reasonable quality), the inequality in (9) holds when ∥∇F(w0)is large, i.e., w0is far from being a
stationary point, which is expected at the beginning of the training process. This is true even when f0(x)is a biased estimate
of F(x)as long as ∥∇F(w0)− ∇f0(w0)2is strictly smaller than ∥∇F(w0)2+∥∇f0(w0)2, i.e., the angle between the
gradients is acute as mentioned earlier, for a sufficiently small step size η0. Third, when ∥∇f0(w0)is sufficiently small, e.g.,
when overfitting happens at the server, the improvement by SL is also insignificant. Finally, when w0is near a stationary
point of Fbut far from that of f0, i.e., ∥∇F(w0)∥ ≪ ∥∇f0(w0), the inequality in (9) may be reversed, in which case SL
can impair FL, pushing the model toward server’s local stationary points. In this case, our algorithm does not yield exact
convergence but oscillates between stationary points of Fand f0, which is expected for an incremental gradient method [2].
Such convergence will be analyzed in details in the next subsection.
The above analysis also applies when the server performs multiple updates. In particular, suppose that the server performs
K0updates of the model using the GD method with a fixed step size η0:
wt,k+1 =wt,k η0f0(wt,k ), k = 0, . . . , K01,
with wt,0= ¯xtand xt+1 =wt,K0. Then, repeating the steps above and summing over the iterations, we obtain
F(xt+1)F(¯xt)≤ −0.5η0PK01
k=0 ∥∇F(wt,k)2+ (1 0)∥∇f0(wt,k )2
+ 0.5η0PK01
k=0 ∥∇F(wt,k)− ∇f0(wt,k )2.
Similarly to the single-update case discussed earlier, we can see that carrying out multiple updates at the server is beneficial
when wt,k is far from being a stationary point of either For f0, more precisely, ∥∇F(wt,k)2+ (1 0)∥∇f0(wt,k)2>
∥∇F(wt,k)f0(wt,k)2.This also suggests that when learning collaboratively with clients, the server should not overfit its
own data, which could happen easily when n0is small.
B. Convergence Analysis
In this subsection, we study the convergence properties of FSL. Specifically, we will prove that, under suitable conditions
on step sizes, FSL converges to a neighborhood of a stationary point of the following modified loss function
˜
F=1
1+γF+γ
1+γf0
2This assumption is standard in FL and often holds when training neural networks. We will state this assumption formally in Section III-B.
摘要:

1FederatedLearningwithServerLearning:EnhancingPerformanceforNon-IIDDataVanSyMai,RichardJ.La,TaoZhangAbstractFederatedLearning(FL)hasemergedasameansofdistributedlearningusinglocaldatastoredatclientswithacoordinatingserver.RecentstudiesshowedthatFLcansufferfrompoorperformanceandslowerconvergencewhentr...

展开>> 收起<<
1 Federated Learning with Server Learning Enhancing Performance for Non-IID Data.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:2.09MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注