On the Performance of Gradient Tracking with Local Updates Edward Duc Hien Nguyen Sulaiman A. Alghunaim Kun Yuan and César A. Uribe Abstract We study the decentralized optimization problem

2025-05-02 0 0 616.73KB 8 页 10玖币
侵权投诉
On the Performance of Gradient Tracking with Local Updates
Edward Duc Hien Nguyen, Sulaiman A. Alghunaim, Kun Yuan and César A. Uribe
Abstract We study the decentralized optimization problem
where a network of nagents seeks to minimize the average
of a set of heterogeneous non-convex cost functions distribut-
edly. State-of-the-art decentralized algorithms like Exact Diffu-
sion (ED) and Gradient Tracking (GT) involve communicating
every iteration. However, communication is expensive, resource
intensive, and slow. In this work, we analyze a locally updated
GT method (LU-GT), where agents perform local recursions
before interacting with their neighbors. While local updates
have been shown to reduce communication overhead in practice,
their theoretical influence has not been fully characterized.
We show LU-GT has the same communication complexity as
the Federated Learning setting but allows arbitrary network
topologies. In addition, we prove that the number of local
updates does not degrade the quality of the solution achieved by
LU-GT. Numerical examples reveal that local updates can lower
communication costs in certain regimes (e.g., well-connected
graphs).
I. INTRODUCTION
Distributed optimization problems have garnered signifi-
cant interest due to the demand for efficiently solving data
processing problems [1], such as the training of deep neural
networks [2]. Nodes, processors, and computer clusters can
be abstracted as agents responsible for a partition of a large
set of data. In this work, we study the distributed multi-agent
optimization problem
minimize
xRmf(x),1
n
n
X
i=1
fi(x)(1)
where fi(·):RmRis a smooth, non-convex function held
privately by an agent i∈ {1, . . . , n}. To find a consensual
solution xof (1), decentralized methods have the nagents
cooperate according to some network topology constraints.
Many decentralized methods have been proposed to
solve (1). Among the most prolific include decentral-
ized/distributed gradient descent (DGD) [3], [4], EX-
TRA [5], Exact-Diffusion/D2/NIDS (ED) [6]–[9], and Gradi-
ent Tracking (GT) [10]–[13]. DGD is an algorithm wherein
agents perform a local gradient step followed by a commu-
nication round. However, DGD has been shown not optimal
when agents’ local objective functions are heterogeneous,
EDHN and CAU ({en18,cauribe}@rice.edu) are with the Department
of Electrical and Computer Engineering, Rice University, Houston, TX,
USA. SAA (sulaiman.alghunaim@ku.edu.kw) is with the Department
of Electrical Engineering, Kuwait University, Kuwait City, Kuwait. KY
(kun.yuan@alibaba-inc.com) is with Alibaba DAMO Academy, Hangzhou,
Zhejiang, China.
Edward Nguyen is supported by a training fellowship from the Gulf Coast
Consortia, on the NLM Training Program in Biomedical Informatics & Data
Science (T15LM007093). This work is supported in part by the National
Science Foundation under Grant No. 2211815.
i.e., the minimizer of functions fi(·)differs from the mini-
mizer of f(·). This shortcoming has been analyzed in [14],
[15] where the heterogeneity causes the rate of DGD to
incur an additional bias term with a magnitude directly
proportional to the level of heterogeneity, slowing down the
rate of convergence. Moreover, this bias term is inversely
influenced by the connectivity of the network (becomes
larger for sparse networks) [8], [16].
EXTRA, ED, and GT employ bias-correction techniques
to account for heterogeneity. EXTRA and ED use local
updates with memory. GT methods have each agent perform
the local update with an estimate of the global gradient called
the tracking variable. The effect of these techniques is ob-
served in the analysis where the rates of these algorithms are
independent of heterogeneity, i.e., the bias term proportional
to the amount of heterogeneity found in the rate of DGD is
removed, leading to better rates [17], [18].
While EXTRA, ED, and GT tackle the bias induced by
heterogeneity, they require communication over the network
at every iteration. However, communication is expensive,
resource intensive, and slow in practice [2]. Centralized
methods in which agents communicate with a central
coordinator (i.e., server) have been developed to solve (1)
with an explicit focus on reducing the communication cost.
This has been achieved empirically by requiring agents
to perform local recursions before communicating. Among
these methods include LocalGD [19]–[23], Scaffold [24], S-
Local-GD [25], FedLin [26], and Scaffnew [27]. Analysis
on LocalGD revealed that local recursions cause agents
to drift towards their local solution rather than the global
optimal [16], [20], [28]. Scaffold, S-Local-GD, FedLin, and
Scaffnew address this issue by introducing bias-correction
techniques. However, besides Scaffnew, analysis of these
methods has failed to show communication complexity im-
provements. Scaffnew has shown that for µ-strongly-convex,
L-smooth, and deterministic functions, the communication
complexity can be improved from O(κ)(no local recur-
sions) to O(κ)if one performs κlocal recursions with
κ,L/µ.
Local recursions in decentralized methods have been much
less studied. DGD with local recursions has been studied
in [16], but the convergence rates still have bias terms due
to heterogeneity. Additionally, the magnitude of the bias
term is proportional to the number of local recursions taken.
Scaffnew [27] has been studied under the decentralized case
but for the strongly convex and smooth function class. In
[27] for sufficiently well connected graphs, an improvement
to a communication complexity of O(pκ/(1 λ)) where λ
is the mixing rate of the matrix is shown. GT with local
arXiv:2210.04757v2 [math.OC] 13 Oct 2022
updates can be seen as a special case of the time-varying
graph setting (with no connections for several iterations,
and the graph is connected periodically for one iteration).
Several works studied GT under time-varying graphs such
as [11], [13], [29]–[31], among these only the works [11],
[29], [32] considered nonconvex setting. Different from [11],
[29], [32], we provide explicit expressions that characterize
the convergence rate in terms of the problem parameters
(e.g., network topology). We note that no proofs are given
in [32]. Our analysis is also of independent interest and can
be readily extended to stochastic costs or arbitrary time-
varying networks.
In this work, we propose and study LU-GT, a locally
updated decentralized algorithm based on the bias-corrected
method GT. Our contributions are as follows:
We analyze LU-GT under the deterministic, non-convex
regime. Our analysis provides a more convenient and
simpler way to analyze GT methods, which builds upon
and extends the framework outlined in [17].
Our analysis has a communication complexity matching
the rates for previously locally updated variants of cen-
tralized and distributed algorithms.
We demonstrate that LU-GT retains the bias-correction
properties of GT irrespective of the number of local
recursions and that the number of local recursions does
not affect the quality of the solution.
Numerical analysis shows that local recursions can reduce
the communication overhead in certain regimes, e.g., well-
connected graphs.
This paper is organized as follows. Section II defines relevant
notation, states the assumptions used in our analysis, intro-
duces LU-GT, and states our main result on the convergence
rate. In Section III, we provide intuition into how the direc-
tion of our analysis can show that following LU-GT, agents
reach a consensus that is also a first-order stationary point.
We also cover relevant lemmas needed in the analysis of LU-
GT. In Section IV, we prove the convergence rate of LU-GT.
Section Vshows evidence that the local recursions of LU-GT
can reduce communication costs in certain regimes.
II. ASSUMPTIONS, ALGORITHM AND MAIN RESULT
We begin this section by providing some useful notation:
xk=x1n,yk=y1n
f(x)=
n
X
i=1
fi(xi),f(x)=col{∇f1(x1),...,fn(xn)}.
We define W= [wij ]Rn×nas the symmetric mixing
matrix for an undirected graph G={V,E} that models the
connections of a group of nagents. The weight wij scales the
information agent ireceives from agent j. We set wij = 0
if j /∈ Niwhere Niis the set of neighbors of agent i. We
also define W=WIdRmn×mn.
The proposed method LU-GT is detailed in Algorithm 1
where αand ηare step-size parameters, and To1is the
number of local recursions before a round of communication.
The intuition behind the algorithm is to have agents perform
Algorithm 1 LU-GT
1: Input: x0=0Rmn,y0=αf(x0),α > 0,η > 0
ToZ0,KZ+
2: Define: τ={0, To,2To,3To...}
3: for k= 0, ..., K 1do
4: if kτthen
5: xk+1 =W(xkηyk)
6: yk+1 =W(yk+αf(xk+1)αf(xk))
7: else
8: xk+1 =xkηyk
9: yk+1 =yk+αf(xk+1)αf(xk)
10: end if
11: end for
a descent step using a staling estimate of the global gradient
for To1iterations. Afterwards, agents perform a weighted
average of their parameters with their neighbors and update
their tracking variable.
Remark 1 For To= 1, Algorithm 1becomes equivalent to
the vanilla ATC-GT [10] with stepsize ¯η=ηα. This can be
seen by introducing the change of variable gk= (1)yk.
Thus, our analysis also covers the vanilla GT method.
To analyze Algorithm 1, we first introduce the following
time-varying matrix:
Wk,(Wwhen kτ,
Iotherwise. (3)
Thus, we can succinctly rewrite Algorithm 1as follows
xk+1 =Wk(xkηyk)(4a)
yk+1 =Wk(yk+αf(xk+1)αf(xk)).(4b)
We now list the assumptions used in our analysis.
Assumption 1 (Mixing matrix) The mixing matrix Wis
doubly stochastic and symmetric.
The Metropolis-Hastings algorithm [33] can be used to
construct mixing matrices from an undirected graph sat-
isfying Assumption 1. Moreover, from Assumption 1, we
have that the mixing matrix Whas a singular, maximum
eigenvalue denoted as λ1= 1. All other eigenvalues
are defined as {λi}n
i=2. We define the mixing rate as
λ:= maxi∈{2,...,n}{|λi|}.
Assumption 2 (L-smoothness) The function fi:RmR
is L-smooth for i∈ V, i.e., k∇fi(y)fi(z)k ≤ Lkyzk,
y, z Rmfor some L > 0.
Our analysis of Algorithm 1leads us to find the following
convergence rate.
Theorem 1 (Convergence of LU-GT) Let Assumptions 1
and 2hold, and let, ToZ0,η > 0and α > 0such that
η < min n1,(1 λ)/(λ(1 + To))o
摘要:

OnthePerformanceofGradientTrackingwithLocalUpdatesEdwardDucHienNguyen,SulaimanA.Alghunaim,KunYuanandCésarA.UribeAbstract—Westudythedecentralizedoptimizationproblemwhereanetworkofnagentsseekstominimizetheaverageofasetofheterogeneousnon-convexcostfunctionsdistribut-edly.State-of-the-artdecentralizedal...

展开>> 收起<<
On the Performance of Gradient Tracking with Local Updates Edward Duc Hien Nguyen Sulaiman A. Alghunaim Kun Yuan and César A. Uribe Abstract We study the decentralized optimization problem.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:616.73KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注