
2
also on the network. Specifically, let σbe the second largest
singular value of the mixing matrix used in decentralized
optimization and 1
1−σbe the condition number of the under-
lying communication graph. A network with larger 1
1−σhas a
weaker information diffusion ability. For strongly convex and
smooth problems, the work [14] establishes the lower bounds
O√κFlog 1
and OqκF
1−σlog 1
on the computation and
communication costs for decentralized first-order algorithms
to reach an -optimal solution, respectively. The lower bounds
are achieved or nearly achieved in [15], [16], where multi-
step consensus is introduced to balance the computation and
communication costs.
B. Decentralized Second-order Methods
In the centralized setting, Newton’s method is proved to
have a locally quadratic convergence rate that is indepen-
dent of κF. However, whether there is a communication-
efficient decentralized variant of Newton’s method with κF-
independent super-linear convergence rate under mild as-
sumptions is still an open question. On the one hand, some
decentralized second-order methods have provably faster rates
but suffer from inexact convergence, high communication cost,
or requiring strict assumptions. The work [17] extends the
network Newton’s method in [18] for minimizing a penalized
approximation of (1) and shows that the convergence rate is
super-linear in a specific neighborhood near the optimal solu-
tion of the penalized problem. Beyond this neighborhood, the
rate becomes linear. The work [19] proposes an approximate
Newton’s method for the dual problem of (2) and establishes
a super-linear convergence rate within a neighborhood of the
primal-dual optimal solution. However, in each iteration, it
needs to solve the primal problem exactly to obtain the dual
gradient and call a solver to obtain the local Newton direction.
The work [20] proposes a decentralized adaptive Newton’s
method, which uses the communication-inefficient flooding
technique to make the global gradient and Hessian available
to each node. In this way, each node conducts exactly the
same update so that the global super-linear convergence rate
of the centralized Newton’s method with Polyak’s adaptive
step size still holds. The work [21] proposes a decentralized
Newton-type method with cubic regularization and proves
faster convergence up to statistical error under the assumption
that each local Hessian is close enough to the global Hessian.
The work [22] studies quadratic local objective functions and
shows that for a distributed Newton’s method, the computation
complexity depends only logarithmically on κFwith the help
of exchanging the entire Hessian matrices. The algorithm in
[22] is close to that in [23], but the latter has no convergence
rate guarantee.
On the other hand, some works are devoted to developing
efficient decentralized second-order algorithms with similar
computation and communication costs per iteration to first-
order algorithms. However, these methods only have globally
linear convergence rate, which is no better than that of first-
order methods [24]–[30]. Here we summarize several reasons
for the lack of provably faster rates: (i) The information fusion
over the network is realized by averaging consensus, whose
convergence rate is at most linear [4]. (ii) The global Hessian is
estimated just from local Hessians [24]–[28] or from Hessian
inverse approximations constructed with local gradient approx-
imations [29], [30]. The purpose is to avoid the communication
of entire Hessian matrices, but a downside is that the nodes
are unable to fully utilize the global second-order information.
(iii) The global Hessian matrices are typically assumed to be
uniformly bounded, which simplifies the analysis but leads to
under-utilization of the curvature information [24]–[30]. (iv)
For the centralized Newton’s method, backtracking line search
is vital for convergence analysis. It adaptively gives a small
step size at the early stage to guarantee global convergence
with arbitrary initialization and always gives a unit step
size after reaching a neighborhood of the optimal solution
to guarantee locally quadratic convergence rate. However,
backtracking line search is not affordable in the decentralized
setting since it is expensive for all the nodes to jointly calculate
the global objective function value.
To the best of our knowledge, there is no decentralized
Newton’s method that, under mild assumptions, is not only
communication-efficient but also inherits the κF-independent
super-linear convergence rate of the centralized Newton’s
method. Therefore, in this paper we aim to address the follow-
ing question: Can we design a communication-efficient decen-
tralized Newton’s method that has a provably κF-independent
super-linear convergence rate?
C. Major Contributions
To answer these questions, we propose a decentralized New-
ton’s method with multi-step consensus and compression and
establish its convergence rate. Roughly speaking, our method
proceeds as follows. In each iteration, each node moves one
step along a local approximated Newton direction, followed by
variable averaging to improve consensus. To construct the local
approximated Newton direction, we use the DAC technique to
obtain a gradient approximation and a Hessian approximation,
which track the global gradient and global Hessian, respec-
tively. To avoid having each node to transmit the entire local
Hessian approximation, we design a compression procedure
with error compensation to estimate the global Hessian in
a communication-efficient way. In other words, each node
is able to obtain more accurate curvature information by
exchanging the compressed local Hessian approximations with
its neighbors, without incurring a high communication cost. In
addition, to balance between computation and communication
costs, we use multi-step consensus for communicating the
local copies of the decision variable and the local gradient
approximations. Multi-step consensus helps to obtain not only
a globally linear rate that is independent of the graph but also
a faster local convergence rate.
Theoretically, we show, with a novel analysis, that our
proposed method enjoys a provably faster convergence rate
than those of decentralized first-order methods. The conver-
gence process is split into two stages. In stage I, we use
a small step size and get globally linear convergence at
the contraction rate of 1−O1
κFmin n(1−σ2)3
σm−1,1
2owith
arbitrary initialization. Here, 1
1−σis the condition number