
On the Performance of Gradient Tracking with Local Updates
Edward Duc Hien Nguyen, Sulaiman A. Alghunaim, Kun Yuan and César A. Uribe
Abstract— We study the decentralized optimization problem
where a network of nagents seeks to minimize the average
of a set of heterogeneous non-convex cost functions distribut-
edly. State-of-the-art decentralized algorithms like Exact Diffu-
sion (ED) and Gradient Tracking (GT) involve communicating
every iteration. However, communication is expensive, resource
intensive, and slow. In this work, we analyze a locally updated
GT method (LU-GT), where agents perform local recursions
before interacting with their neighbors. While local updates
have been shown to reduce communication overhead in practice,
their theoretical influence has not been fully characterized.
We show LU-GT has the same communication complexity as
the Federated Learning setting but allows arbitrary network
topologies. In addition, we prove that the number of local
updates does not degrade the quality of the solution achieved by
LU-GT. Numerical examples reveal that local updates can lower
communication costs in certain regimes (e.g., well-connected
graphs).
I. INTRODUCTION
Distributed optimization problems have garnered signifi-
cant interest due to the demand for efficiently solving data
processing problems [1], such as the training of deep neural
networks [2]. Nodes, processors, and computer clusters can
be abstracted as agents responsible for a partition of a large
set of data. In this work, we study the distributed multi-agent
optimization problem
minimize
x∈Rmf(x),1
n
n
X
i=1
fi(x)(1)
where fi(·):Rm→Ris a smooth, non-convex function held
privately by an agent i∈ {1, . . . , n}. To find a consensual
solution x∗of (1), decentralized methods have the nagents
cooperate according to some network topology constraints.
Many decentralized methods have been proposed to
solve (1). Among the most prolific include decentral-
ized/distributed gradient descent (DGD) [3], [4], EX-
TRA [5], Exact-Diffusion/D2/NIDS (ED) [6]–[9], and Gradi-
ent Tracking (GT) [10]–[13]. DGD is an algorithm wherein
agents perform a local gradient step followed by a commu-
nication round. However, DGD has been shown not optimal
when agents’ local objective functions are heterogeneous,
EDHN and CAU ({en18,cauribe}@rice.edu) are with the Department
of Electrical and Computer Engineering, Rice University, Houston, TX,
USA. SAA (sulaiman.alghunaim@ku.edu.kw) is with the Department
of Electrical Engineering, Kuwait University, Kuwait City, Kuwait. KY
(kun.yuan@alibaba-inc.com) is with Alibaba DAMO Academy, Hangzhou,
Zhejiang, China.
Edward Nguyen is supported by a training fellowship from the Gulf Coast
Consortia, on the NLM Training Program in Biomedical Informatics & Data
Science (T15LM007093). This work is supported in part by the National
Science Foundation under Grant No. 2211815.
i.e., the minimizer of functions fi(·)differs from the mini-
mizer of f(·). This shortcoming has been analyzed in [14],
[15] where the heterogeneity causes the rate of DGD to
incur an additional bias term with a magnitude directly
proportional to the level of heterogeneity, slowing down the
rate of convergence. Moreover, this bias term is inversely
influenced by the connectivity of the network (becomes
larger for sparse networks) [8], [16].
EXTRA, ED, and GT employ bias-correction techniques
to account for heterogeneity. EXTRA and ED use local
updates with memory. GT methods have each agent perform
the local update with an estimate of the global gradient called
the tracking variable. The effect of these techniques is ob-
served in the analysis where the rates of these algorithms are
independent of heterogeneity, i.e., the bias term proportional
to the amount of heterogeneity found in the rate of DGD is
removed, leading to better rates [17], [18].
While EXTRA, ED, and GT tackle the bias induced by
heterogeneity, they require communication over the network
at every iteration. However, communication is expensive,
resource intensive, and slow in practice [2]. Centralized
methods in which agents communicate with a central
coordinator (i.e., server) have been developed to solve (1)
with an explicit focus on reducing the communication cost.
This has been achieved empirically by requiring agents
to perform local recursions before communicating. Among
these methods include LocalGD [19]–[23], Scaffold [24], S-
Local-GD [25], FedLin [26], and Scaffnew [27]. Analysis
on LocalGD revealed that local recursions cause agents
to drift towards their local solution rather than the global
optimal [16], [20], [28]. Scaffold, S-Local-GD, FedLin, and
Scaffnew address this issue by introducing bias-correction
techniques. However, besides Scaffnew, analysis of these
methods has failed to show communication complexity im-
provements. Scaffnew has shown that for µ-strongly-convex,
L-smooth, and deterministic functions, the communication
complexity can be improved from O(κ)(no local recur-
sions) to O(√κ)if one performs √κlocal recursions with
κ,L/µ.
Local recursions in decentralized methods have been much
less studied. DGD with local recursions has been studied
in [16], but the convergence rates still have bias terms due
to heterogeneity. Additionally, the magnitude of the bias
term is proportional to the number of local recursions taken.
Scaffnew [27] has been studied under the decentralized case
but for the strongly convex and smooth function class. In
[27] for sufficiently well connected graphs, an improvement
to a communication complexity of O(pκ/(1 −λ)) where λ
is the mixing rate of the matrix is shown. GT with local
arXiv:2210.04757v2 [math.OC] 13 Oct 2022