On the Performance of Gradient Tracking with Local Updates Edward Duc Hien Nguyen Sulaiman A. Alghunaim Kun Yuan and César A. Uribe Abstract We study the decentralized optimization problem

2025-05-02 0 0 616.73KB 8 页 10玖币

侵权投诉

On the Performance of Gradient Tracking with Local Updates

Edward Duc Hien Nguyen, Sulaiman A. Alghunaim, Kun Yuan and César A. Uribe

Abstract— We study the decentralized optimization problem

where a network of nagents seeks to minimize the average

of a set of heterogeneous non-convex cost functions distribut-

edly. State-of-the-art decentralized algorithms like Exact Diffu-

sion (ED) and Gradient Tracking (GT) involve communicating

every iteration. However, communication is expensive, resource

intensive, and slow. In this work, we analyze a locally updated

GT method (LU-GT), where agents perform local recursions

before interacting with their neighbors. While local updates

have been shown to reduce communication overhead in practice,

their theoretical inﬂuence has not been fully characterized.

We show LU-GT has the same communication complexity as

the Federated Learning setting but allows arbitrary network

topologies. In addition, we prove that the number of local

updates does not degrade the quality of the solution achieved by

LU-GT. Numerical examples reveal that local updates can lower

communication costs in certain regimes (e.g., well-connected

graphs).

I. INTRODUCTION

Distributed optimization problems have garnered signiﬁ-

cant interest due to the demand for efﬁciently solving data

processing problems [1], such as the training of deep neural

networks [2]. Nodes, processors, and computer clusters can

be abstracted as agents responsible for a partition of a large

set of data. In this work, we study the distributed multi-agent

optimization problem

minimize

x∈Rmf(x),1

i=1

fi(x)(1)

where fi(·):Rm→Ris a smooth, non-convex function held

privately by an agent i∈ {1, . . . , n}. To ﬁnd a consensual

solution x∗of (1), decentralized methods have the nagents

cooperate according to some network topology constraints.

Many decentralized methods have been proposed to

solve (1). Among the most proliﬁc include decentral-

ized/distributed gradient descent (DGD) [3], [4], EX-

TRA [5], Exact-Diffusion/D2/NIDS (ED) [6]–[9], and Gradi-

ent Tracking (GT) [10]–[13]. DGD is an algorithm wherein

agents perform a local gradient step followed by a commu-

nication round. However, DGD has been shown not optimal

when agents’ local objective functions are heterogeneous,

EDHN and CAU ({en18,cauribe}@rice.edu) are with the Department

of Electrical and Computer Engineering, Rice University, Houston, TX,

USA. SAA (sulaiman.alghunaim@ku.edu.kw) is with the Department

of Electrical Engineering, Kuwait University, Kuwait City, Kuwait. KY

(kun.yuan@alibaba-inc.com) is with Alibaba DAMO Academy, Hangzhou,

Zhejiang, China.

Edward Nguyen is supported by a training fellowship from the Gulf Coast

Consortia, on the NLM Training Program in Biomedical Informatics & Data

Science (T15LM007093). This work is supported in part by the National

Science Foundation under Grant No. 2211815.

i.e., the minimizer of functions fi(·)differs from the mini-

mizer of f(·). This shortcoming has been analyzed in [14],

[15] where the heterogeneity causes the rate of DGD to

incur an additional bias term with a magnitude directly

proportional to the level of heterogeneity, slowing down the

rate of convergence. Moreover, this bias term is inversely

inﬂuenced by the connectivity of the network (becomes

larger for sparse networks) [8], [16].

EXTRA, ED, and GT employ bias-correction techniques

to account for heterogeneity. EXTRA and ED use local

updates with memory. GT methods have each agent perform

the local update with an estimate of the global gradient called

the tracking variable. The effect of these techniques is ob-

served in the analysis where the rates of these algorithms are

independent of heterogeneity, i.e., the bias term proportional

to the amount of heterogeneity found in the rate of DGD is

removed, leading to better rates [17], [18].

While EXTRA, ED, and GT tackle the bias induced by

heterogeneity, they require communication over the network

at every iteration. However, communication is expensive,

resource intensive, and slow in practice [2]. Centralized

methods in which agents communicate with a central

coordinator (i.e., server) have been developed to solve (1)

with an explicit focus on reducing the communication cost.

This has been achieved empirically by requiring agents

to perform local recursions before communicating. Among

these methods include LocalGD [19]–[23], Scaffold [24], S-

Local-GD [25], FedLin [26], and Scaffnew [27]. Analysis

on LocalGD revealed that local recursions cause agents

to drift towards their local solution rather than the global

optimal [16], [20], [28]. Scaffold, S-Local-GD, FedLin, and

Scaffnew address this issue by introducing bias-correction

techniques. However, besides Scaffnew, analysis of these

methods has failed to show communication complexity im-

provements. Scaffnew has shown that for µ-strongly-convex,

L-smooth, and deterministic functions, the communication

complexity can be improved from O(κ)(no local recur-

sions) to O(√κ)if one performs √κlocal recursions with

κ,L/µ.

Local recursions in decentralized methods have been much

less studied. DGD with local recursions has been studied

in [16], but the convergence rates still have bias terms due

to heterogeneity. Additionally, the magnitude of the bias

term is proportional to the number of local recursions taken.

Scaffnew [27] has been studied under the decentralized case

but for the strongly convex and smooth function class. In

[27] for sufﬁciently well connected graphs, an improvement

to a communication complexity of O(pκ/(1 −λ)) where λ

is the mixing rate of the matrix is shown. GT with local

arXiv:2210.04757v2 [math.OC] 13 Oct 2022

updates can be seen as a special case of the time-varying

graph setting (with no connections for several iterations,

and the graph is connected periodically for one iteration).

Several works studied GT under time-varying graphs such

as [11], [13], [29]–[31], among these only the works [11],

[29], [32] considered nonconvex setting. Different from [11],

[29], [32], we provide explicit expressions that characterize

the convergence rate in terms of the problem parameters

(e.g., network topology). We note that no proofs are given

in [32]. Our analysis is also of independent interest and can

be readily extended to stochastic costs or arbitrary time-

varying networks.

In this work, we propose and study LU-GT, a locally

updated decentralized algorithm based on the bias-corrected

method GT. Our contributions are as follows:

•We analyze LU-GT under the deterministic, non-convex

regime. Our analysis provides a more convenient and

simpler way to analyze GT methods, which builds upon

and extends the framework outlined in [17].

•Our analysis has a communication complexity matching

the rates for previously locally updated variants of cen-

tralized and distributed algorithms.

•We demonstrate that LU-GT retains the bias-correction

properties of GT irrespective of the number of local

recursions and that the number of local recursions does

not affect the quality of the solution.

•Numerical analysis shows that local recursions can reduce

the communication overhead in certain regimes, e.g., well-

connected graphs.

This paper is organized as follows. Section II deﬁnes relevant

notation, states the assumptions used in our analysis, intro-

duces LU-GT, and states our main result on the convergence

rate. In Section III, we provide intuition into how the direc-

tion of our analysis can show that following LU-GT, agents

reach a consensus that is also a ﬁrst-order stationary point.

We also cover relevant lemmas needed in the analysis of LU-

GT. In Section IV, we prove the convergence rate of LU-GT.

Section Vshows evidence that the local recursions of LU-GT

can reduce communication costs in certain regimes.

II. ASSUMPTIONS, ALGORITHM AND MAIN RESULT

We begin this section by providing some useful notation:

xk=x⊗1n,yk=y⊗1n

f(x)=

i=1

fi(xi),∇f(x)=col{∇f1(x1),...,∇fn(xn)}.

We deﬁne W= [wij ]∈Rn×nas the symmetric mixing

matrix for an undirected graph G={V,E} that models the

connections of a group of nagents. The weight wij scales the

information agent ireceives from agent j. We set wij = 0

if j /∈ Niwhere Niis the set of neighbors of agent i. We

also deﬁne W=W⊗Id∈Rmn×mn.

The proposed method LU-GT is detailed in Algorithm 1

where αand ηare step-size parameters, and To−1is the

number of local recursions before a round of communication.

The intuition behind the algorithm is to have agents perform

Algorithm 1 LU-GT

1: Input: x0=0∈Rmn,y0=α∇f(x0),α > 0,η > 0

To∈Z≥0,K∈Z+

2: Deﬁne: τ={0, To,2To,3To...}

3: for k= 0, ..., K −1do

4: if k∈τthen

5: xk+1 =W(xk−ηyk)

6: yk+1 =W(yk+α∇f(xk+1)−α∇f(xk))

7: else

8: xk+1 =xk−ηyk

9: yk+1 =yk+α∇f(xk+1)−α∇f(xk)

10: end if

11: end for

a descent step using a staling estimate of the global gradient

for To−1iterations. Afterwards, agents perform a weighted

average of their parameters with their neighbors and update

their tracking variable.

Remark 1 For To= 1, Algorithm 1becomes equivalent to

the vanilla ATC-GT [10] with stepsize ¯η=ηα. This can be

seen by introducing the change of variable gk= (1/α)yk.

Thus, our analysis also covers the vanilla GT method.

To analyze Algorithm 1, we ﬁrst introduce the following

time-varying matrix:

Wk,(Wwhen k∈τ,

Iotherwise. (3)

Thus, we can succinctly rewrite Algorithm 1as follows

xk+1 =Wk(xk−ηyk)(4a)

yk+1 =Wk(yk+α∇f(xk+1)−α∇f(xk)).(4b)

We now list the assumptions used in our analysis.

Assumption 1 (Mixing matrix) The mixing matrix Wis

doubly stochastic and symmetric.

The Metropolis-Hastings algorithm [33] can be used to

construct mixing matrices from an undirected graph sat-

isfying Assumption 1. Moreover, from Assumption 1, we

have that the mixing matrix Whas a singular, maximum

eigenvalue denoted as λ1= 1. All other eigenvalues

are deﬁned as {λi}n

i=2. We deﬁne the mixing rate as

λ:= maxi∈{2,...,n}{|λi|}.

Assumption 2 (L-smoothness) The function fi:Rm→R

is L-smooth for i∈ V, i.e., k∇fi(y)−∇fi(z)k ≤ Lky−zk,

∀y, z ∈Rmfor some L > 0.

Our analysis of Algorithm 1leads us to ﬁnd the following

convergence rate.

Theorem 1 (Convergence of LU-GT) Let Assumptions 1

and 2hold, and let, To∈Z≥0,η > 0and α > 0such that

η < min n1,(1 −√λ)/(√λ(1 + To))o

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnthePerformanceofGradientTrackingwithLocalUpdatesEdwardDucHienNguyen,SulaimanA.Alghunaim,KunYuanandCésarA.UribeAbstractWestudythedecentralizedoptimizationproblemwhereanetworkofnagentsseekstominimizetheaverageofasetofheterogeneousnon-convexcostfunctionsdistribut-edly.State-of-the-artdecentralizedal...

展开>> 收起<<

On the Performance of Gradient Tracking with Local Updates Edward Duc Hien Nguyen Sulaiman A. Alghunaim Kun Yuan and César A. Uribe Abstract We study the decentralized optimization problem.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Performance of Gradient Tracking with Local Updates Edward Duc Hien Nguyen Sulaiman A. Alghunaim Kun Yuan and César A. Uribe Abstract We study the decentralized optimization problem

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: