Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents

2025-05-06 1 0 1.44MB 13 页 10玖币

侵权投诉

Federated Learning Using Variance Reduced

Stochastic Gradient for Probabilistically Activated

Agents

Mohammadreza Rostami and Solmaz S. Kia, Senior Member, IEEE

Abstract

This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves

both variance reduction and a faster convergence rate to an optimal solution in the setting where each

agent has an arbitrary probability of selection in each iteration. In distributed machine learning, when

privacy matters, FL is a functional tool. Placing FL in an environment where it has some irregular

connections of agents (devices), reaching a trained model in both an economical and quick way can

be a demanding job. The ﬁrst layer of our algorithm corresponds to the model parameter propagation

across agents done by the server. In the second layer, each agent does its local update with a stochastic

and variance-reduced technique called Stochastic Variance Reduced Gradient (SVRG). We leverage the

concept of variance reduction from stochastic optimization when the agents want to do their local update

step to reduce the variance caused by stochastic gradient descent (SGD). We provide a convergence

bound for our algorithm which improves the rate from O(1

√K)to O(1

K)by using a constant step-size.

We demonstrate the performance of our algorithm using numerical examples.

I. INTRODUCTION

In recent years, with the technological advances in modern smart devices, each phone, tablet, or smart home system,

generates and stores an abundance of data, which, if harvested collaboratively with other users’ data, can lead to

learning models that support many intelligent applications such as smart health and image classiﬁcation [1], [2].

Standard traditional machine learning approaches require centralizing the training data on one machine, cloud, or

in a data center. However, the data collected on modern smart devices are often of sensitive nature that discourages

users from relying on centralized solutions. Federated Learning (FL) [3], [4] has been proposed to decouple the

ability to do machine learning from the need to store the data in a centralized location. The idea of Federated

Learning is to enable smart devices to collaboratively learn a shared prediction model while keeping all the training

data on the device.

Figure 1 shows a schematic representation of an FL architecture. In FL, collaborative learning without data sharing

is accomplished by each agent receiving a current model weight from the server. Then, each participating learning

separately updates the model by implementing a stochastic gradient descent (SGD) [5] using its own locally collected

datasets. Then, the participating agents send their locally calculated model weights to a server/aggregator, which

often combines the models through a simple averaging, as in FedAvg [4], to be sent back to the agents. The process

repeats until a satisfactory model is obtained. Federated learning relies heavily on communication between learner

agents (clients) and a moderating server. Engaging all the clients in the learning procedure at each time step of the

algorithm results in huge communication cost. On the other hand, poor channel quality and intermittent connectivity

can completely derail training. For resource management, in the original popular FL algorithms such as FedAvg in

[4], at each round of the algorithm, a batch of agents are selected uniformly at random to receive the updated model

weights and perform local learning. FedAvg and similar FL algorithms come with convergence guarantees [6]–[9]

under the assumption of availability of the randomly selected agents at each round. However, in practice due to

factors such as energy and time constraints, agents’ availability is not ubiquitous at all times. Thus, some works

have been done to solve this problem via device scheduling [10]–[14]. Nevertheless, the agents’ availability can be

a function of unforeseen factors such as communication channel quality, and thus is not deterministic and known

in advance.

The authors are with the Department of Mechanical and Aerospace Engineering, University of California Irvine, Irvine, CA

92697, {mrostam2,solmaz}@uci.edu. This work was supported by NSF, under CAREER Award ECCS-1653838.

arXiv:2210.14362v2 [cs.LG] 1 Apr 2023

Fig. 1: Federated Learning structure with non-uniform probability of agent selection in each iteration.

To understand the effect of an agent’s stochastic availability on the FL, recent work such as [15] proposed to

move from random batch selection to an FL model where the agents availability and participation at each round

are probabilistic, see Fig. 1. In this paper, we adopt this newly proposed framework and contribute to devising an

algorithm that achieve faster convergence and lower error covariance. Our focus will be on incorporating a variance

reduction procedure into the local SGD procedure of participating learner agents at each round. The randomness

in SGD algorithms induces variance of the gradient, which leads to decay learning rate and sub-linear convergence

rate. Thus, there has been an effort to reduce the variance of the stochastic gradient, which resulted in the so-called

Stochastic Variance Reduced Gradient (SVRG) methods. It is shown that SVRG allows using a constant learning

rate and results in linear convergence in expectation.

In this paper, we incorporate an SVRG approach in an FL algorithm where agents’ participation in the update process

in each round is probabilistic and non-uniform. Through rigorous analysis, we show that the proposed algorithm has

a faster convergence rate. In particular, we show that our algorithm results in a practical convergence in expectation

with a rate O(1

K), which is an improvement over the sublinear rate of O(1

√K)in [15]. We demonstrate the

effectiveness of our proposed algorithm through a set of numerical studies and by comparing the rate of convergence,

covariance, and the circular error probable (CEP) measure. Our results show that our algorithm drastically improves

the convergence guarantees, thanks to the decrease in the variance, which results in faster convergence.

Organization: Section II introduces our basic notation, and presents some of the properties of smooth functions.

Section III presents the problem formulation and the structure behind it. Section IV includes the proposed algorithm

and its scheme. Section V contains our convergence analysis for the proposed algorithm and provides its convergence

rate. Section VI presents simulations and Section VII gathers our conclusions and ideas for future work. For the

convenience of the reader, we provide some of the proofs in the Appendix.

II. PRELIMINARIES

In this section, we introduce our notations and terminologies used throughout the paper. We let R,R>0,R≥0,

denote the set of real, positive real numbers. Consequently, when x∈R,|x|is its absolute value. For x∈Rd,

kxk=√x>xdenotes the standard Euclidean norm. We let h., .idenotes an inner product between two vectors for

two vectors xand y∈Rd. A differentiable function f:Rd→Ris Lipschitz with constant L∈R>0, or simply

L-Lipschitz, over a set C ⊆ Rdif and only if kf(x)−f(y)k ≤ Lkx−yk, for x, y ∈ C. Furthermore, if the function

is differentiable, we have f(y)≤f(x) + ∇f>(y−x) + L

2ky−xk2for all x, y ∈ C [16]. Lastly, we recall Jensen’s

inequality, which states [17]:



NXN

n=1xn



≤1

NXN

n=1 kxnk2.(1)

III. PROBLEM STATEMENT

This section formalizes the problem of interest. Consider a set of Nagents (clients) that communicate with a server

to learn parameters of a model that they want to ﬁt into their collective data set. Each agent has its own local

Fig. 2: SVRG update steps.

data which can be distributed either uniformly or non-uniformly. The learning objective is to obtain the learning

function weights θ∈Rdfrom

min

θ∈Rdf(θ) := 1

n=1

fn(θ), fn(θ) = 1

i=1

fn,i(θ),(2)

where fnis possibly a convex or non-convex local learning loss function. At each agent n∈ {1,··· , N },fn

depends on training data set {(qn,i,ˆyn,i)}Ln

i=1 ⊂R1×d×R(supervised learning). Examples include [18]

•square loss fn,i(θ) = k(ˆyn,i −qn,iθ)k2,

•log loss fn,i(θ) = log(1 + e−ˆyn,iqn,iθ).

Assumption 1 (Assumption on L-smoothness of local cost functions).The local loss functions have L-Lipschitz

gradients, i.e., for any agent n∈ {1,··· , N }we have

k∇fn(θ)− ∇fn(¯

θ)k ≤ Lkθ−¯

θk(3)

for any θ, ¯

θ∈Rdand L∈R>0.

This assumption is technical and common in the literature.

Problem (2) should be solved in the framework of FL in which agents maintain their local data and only interact

with the server to update their local learning parameter vector based on a global feedback provided by the server.

The server generates this global feedback from the local weights it receives from the agents. In our setting, at

each round kof the FL algorithm, each agent n∈ {1,··· , N }becomes active to perform local computations and

connect to the server with a probability of pk

n. We denote the active state by 1k

n∈ {0,1}; thus,

n=Prob(1k

n= 1).

The activation probability at each round can be different.

IV. FEDERATED LEARNING WITH VARIANCE REDUCTION

To solve (2), we design the FedAvg-SVRG Algorithm 1, which has a two-layer structure. In this algorithm, each

agent has its own probability to be active or passive in each round which is denoted by pk

nfor agent nat iteration

Algorithm 1 is initialized with θ0by the server. We denote the number of the FL iterations by K. At each round

k∈ {1,··· , K}, the set of active agents is denoted by Ak(line 5), which is the set of agents for which 1k

n= 1.

Then, each active agent receives a copy of the learning parameter θkfrom the server. Afterward, active agents

perform their local updates according to steps 7 to 18. For resource management local update in FL algorithms,

e.g., [15], follow an SGD update. However, the SGD update suffers from a high variance because of the randomized

search of the algorithm, so instead of using the SGD update step, which results in a decaying step size and slow

convergence, we use the SVRG update step which is stated in lines 7 to 18. In the SVRG update, we calculate

the full batch gradient of the agents at some points, which are referred to as snapshots. Then, between every two

snapshots, each agent does its local update. A schematic of SVRG update steps is shown in Fig. 2.

We denote the number of snapshots in our SVRG method by S. We let Mbe the number of local SVRG updates in

between two snapshots for each active agent before aggregation. Line 10 of Algorithm 1 corresponds to computing

the full batch gradient of each agent at the snapshot points, then in line 12, each agent does its local update with

substituted gradient term denoted as vk

n,s,m =∇f#(wk

n,s,m)− ∇f#( ˜w) + ˜µ. Note the gradient substituted term in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FederatedLearningUsingVarianceReducedStochasticGradientforProbabilisticallyActivatedAgentsMohammadrezaRostamiandSolmazS.Kia,SeniorMember,IEEEAbstractThispaperproposesanalgorithmforFederatedLearning(FL)withatwo-layerstructurethatachievesbothvariancereductionandafasterconvergenceratetoanoptimalsolutio...

展开>> 收起<<

Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: