Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents

2025-05-06 0 0 1.44MB 13 页 10玖币
侵权投诉
Federated Learning Using Variance Reduced
Stochastic Gradient for Probabilistically Activated
Agents
Mohammadreza Rostami and Solmaz S. Kia, Senior Member, IEEE
Abstract
This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves
both variance reduction and a faster convergence rate to an optimal solution in the setting where each
agent has an arbitrary probability of selection in each iteration. In distributed machine learning, when
privacy matters, FL is a functional tool. Placing FL in an environment where it has some irregular
connections of agents (devices), reaching a trained model in both an economical and quick way can
be a demanding job. The first layer of our algorithm corresponds to the model parameter propagation
across agents done by the server. In the second layer, each agent does its local update with a stochastic
and variance-reduced technique called Stochastic Variance Reduced Gradient (SVRG). We leverage the
concept of variance reduction from stochastic optimization when the agents want to do their local update
step to reduce the variance caused by stochastic gradient descent (SGD). We provide a convergence
bound for our algorithm which improves the rate from O(1
K)to O(1
K)by using a constant step-size.
We demonstrate the performance of our algorithm using numerical examples.
I. INTRODUCTION
In recent years, with the technological advances in modern smart devices, each phone, tablet, or smart home system,
generates and stores an abundance of data, which, if harvested collaboratively with other users’ data, can lead to
learning models that support many intelligent applications such as smart health and image classification [1], [2].
Standard traditional machine learning approaches require centralizing the training data on one machine, cloud, or
in a data center. However, the data collected on modern smart devices are often of sensitive nature that discourages
users from relying on centralized solutions. Federated Learning (FL) [3], [4] has been proposed to decouple the
ability to do machine learning from the need to store the data in a centralized location. The idea of Federated
Learning is to enable smart devices to collaboratively learn a shared prediction model while keeping all the training
data on the device.
Figure 1 shows a schematic representation of an FL architecture. In FL, collaborative learning without data sharing
is accomplished by each agent receiving a current model weight from the server. Then, each participating learning
separately updates the model by implementing a stochastic gradient descent (SGD) [5] using its own locally collected
datasets. Then, the participating agents send their locally calculated model weights to a server/aggregator, which
often combines the models through a simple averaging, as in FedAvg [4], to be sent back to the agents. The process
repeats until a satisfactory model is obtained. Federated learning relies heavily on communication between learner
agents (clients) and a moderating server. Engaging all the clients in the learning procedure at each time step of the
algorithm results in huge communication cost. On the other hand, poor channel quality and intermittent connectivity
can completely derail training. For resource management, in the original popular FL algorithms such as FedAvg in
[4], at each round of the algorithm, a batch of agents are selected uniformly at random to receive the updated model
weights and perform local learning. FedAvg and similar FL algorithms come with convergence guarantees [6]–[9]
under the assumption of availability of the randomly selected agents at each round. However, in practice due to
factors such as energy and time constraints, agents’ availability is not ubiquitous at all times. Thus, some works
have been done to solve this problem via device scheduling [10]–[14]. Nevertheless, the agents’ availability can be
a function of unforeseen factors such as communication channel quality, and thus is not deterministic and known
in advance.
The authors are with the Department of Mechanical and Aerospace Engineering, University of California Irvine, Irvine, CA
92697, {mrostam2,solmaz}@uci.edu. This work was supported by NSF, under CAREER Award ECCS-1653838.
arXiv:2210.14362v2 [cs.LG] 1 Apr 2023
Fig. 1: Federated Learning structure with non-uniform probability of agent selection in each iteration.
To understand the effect of an agent’s stochastic availability on the FL, recent work such as [15] proposed to
move from random batch selection to an FL model where the agents availability and participation at each round
are probabilistic, see Fig. 1. In this paper, we adopt this newly proposed framework and contribute to devising an
algorithm that achieve faster convergence and lower error covariance. Our focus will be on incorporating a variance
reduction procedure into the local SGD procedure of participating learner agents at each round. The randomness
in SGD algorithms induces variance of the gradient, which leads to decay learning rate and sub-linear convergence
rate. Thus, there has been an effort to reduce the variance of the stochastic gradient, which resulted in the so-called
Stochastic Variance Reduced Gradient (SVRG) methods. It is shown that SVRG allows using a constant learning
rate and results in linear convergence in expectation.
In this paper, we incorporate an SVRG approach in an FL algorithm where agents’ participation in the update process
in each round is probabilistic and non-uniform. Through rigorous analysis, we show that the proposed algorithm has
a faster convergence rate. In particular, we show that our algorithm results in a practical convergence in expectation
with a rate O(1
K), which is an improvement over the sublinear rate of O(1
K)in [15]. We demonstrate the
effectiveness of our proposed algorithm through a set of numerical studies and by comparing the rate of convergence,
covariance, and the circular error probable (CEP) measure. Our results show that our algorithm drastically improves
the convergence guarantees, thanks to the decrease in the variance, which results in faster convergence.
Organization: Section II introduces our basic notation, and presents some of the properties of smooth functions.
Section III presents the problem formulation and the structure behind it. Section IV includes the proposed algorithm
and its scheme. Section V contains our convergence analysis for the proposed algorithm and provides its convergence
rate. Section VI presents simulations and Section VII gathers our conclusions and ideas for future work. For the
convenience of the reader, we provide some of the proofs in the Appendix.
II. PRELIMINARIES
In this section, we introduce our notations and terminologies used throughout the paper. We let R,R>0,R0,
denote the set of real, positive real numbers. Consequently, when xR,|x|is its absolute value. For xRd,
kxk=x>xdenotes the standard Euclidean norm. We let h., .idenotes an inner product between two vectors for
two vectors xand yRd. A differentiable function f:RdRis Lipschitz with constant LR>0, or simply
L-Lipschitz, over a set C Rdif and only if kf(x)f(y)k ≤ Lkxyk, for x, y ∈ C. Furthermore, if the function
is differentiable, we have f(y)f(x) + f>(yx) + L
2kyxk2for all x, y ∈ C [16]. Lastly, we recall Jensen’s
inequality, which states [17]:
1
NXN
n=1xn
2
1
NXN
n=1 kxnk2.(1)
III. PROBLEM STATEMENT
This section formalizes the problem of interest. Consider a set of Nagents (clients) that communicate with a server
to learn parameters of a model that they want to fit into their collective data set. Each agent has its own local
Fig. 2: SVRG update steps.
data which can be distributed either uniformly or non-uniformly. The learning objective is to obtain the learning
function weights θRdfrom
min
θRdf(θ) := 1
N
N
X
n=1
fn(θ), fn(θ) = 1
Ln
Ln
X
i=1
fn,i(θ),(2)
where fnis possibly a convex or non-convex local learning loss function. At each agent n∈ {1,··· , N },fn
depends on training data set {(qn,i,ˆyn,i)}Ln
i=1 R1×d×R(supervised learning). Examples include [18]
square loss fn,i(θ) = k(ˆyn,i qn,iθ)k2,
log loss fn,i(θ) = log(1 + eˆyn,iqn,iθ).
Assumption 1 (Assumption on L-smoothness of local cost functions).The local loss functions have L-Lipschitz
gradients, i.e., for any agent n∈ {1,··· , N }we have
k∇fn(θ)− ∇fn(¯
θ)k ≤ Lkθ¯
θk(3)
for any θ, ¯
θRdand LR>0.
This assumption is technical and common in the literature.
Problem (2) should be solved in the framework of FL in which agents maintain their local data and only interact
with the server to update their local learning parameter vector based on a global feedback provided by the server.
The server generates this global feedback from the local weights it receives from the agents. In our setting, at
each round kof the FL algorithm, each agent n∈ {1,··· , N }becomes active to perform local computations and
connect to the server with a probability of pk
n. We denote the active state by 1k
n∈ {0,1}; thus,
pk
n=Prob(1k
n= 1).
The activation probability at each round can be different.
IV. FEDERATED LEARNING WITH VARIANCE REDUCTION
To solve (2), we design the FedAvg-SVRG Algorithm 1, which has a two-layer structure. In this algorithm, each
agent has its own probability to be active or passive in each round which is denoted by pk
nfor agent nat iteration
k.
Algorithm 1 is initialized with θ0by the server. We denote the number of the FL iterations by K. At each round
k∈ {1,··· , K}, the set of active agents is denoted by Ak(line 5), which is the set of agents for which 1k
n= 1.
Then, each active agent receives a copy of the learning parameter θkfrom the server. Afterward, active agents
perform their local updates according to steps 7 to 18. For resource management local update in FL algorithms,
e.g., [15], follow an SGD update. However, the SGD update suffers from a high variance because of the randomized
search of the algorithm, so instead of using the SGD update step, which results in a decaying step size and slow
convergence, we use the SVRG update step which is stated in lines 7 to 18. In the SVRG update, we calculate
the full batch gradient of the agents at some points, which are referred to as snapshots. Then, between every two
snapshots, each agent does its local update. A schematic of SVRG update steps is shown in Fig. 2.
We denote the number of snapshots in our SVRG method by S. We let Mbe the number of local SVRG updates in
between two snapshots for each active agent before aggregation. Line 10 of Algorithm 1 corresponds to computing
the full batch gradient of each agent at the snapshot points, then in line 12, each agent does its local update with
substituted gradient term denoted as vk
n,s,m =f#(wk
n,s,m)− ∇f#( ˜w) + ˜µ. Note the gradient substituted term in
摘要:

FederatedLearningUsingVarianceReducedStochasticGradientforProbabilisticallyActivatedAgentsMohammadrezaRostamiandSolmazS.Kia,SeniorMember,IEEEAbstractThispaperproposesanalgorithmforFederatedLearning(FL)withatwo-layerstructurethatachievesbothvariancereductionandafasterconvergenceratetoanoptimalsolutio...

展开>> 收起<<
Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:13 页 大小:1.44MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注