1 A Communication-Efficient Decentralized Newtons Method with Provably Faster Convergence

2025-04-30 0 0 989.77KB 18 页 10玖币
侵权投诉
1
A Communication-Efficient Decentralized Newton’s
Method with Provably Faster Convergence
Huikang Liu, Jiaojiao Zhang, Anthony Man-Cho So, and Qing Ling
Abstract—In this paper, we consider a strongly convex finite-
sum minimization problem over a decentralized network and pro-
pose a communication-efficient decentralized Newton’s method
for solving it. The main challenges in designing such an algorithm
come from three aspects: (i) mismatch between local gradi-
ents/Hessians and the global ones; (ii) cost of sharing second-
order information; (iii) tradeoff among computation and com-
munication. To handle these challenges, we first apply dynamic
average consensus (DAC) so that each node is able to use a local
gradient approximation and a local Hessian approximation to
track the global gradient and Hessian, respectively. Second, since
exchanging Hessian approximations is far from communication-
efficient, we require the nodes to exchange the compressed ones
instead and then apply an error compensation mechanism to
correct for the compression noise. Third, we introduce multi-
step consensus for exchanging local variables and local gradient
approximations to balance between computation and communica-
tion. With novel analysis, we establish the globally linear (resp.,
asymptotically super-linear) convergence rate of the proposed
method when mis constant (resp., tends to infinity), where
m1is the number of consensus inner steps. To the best of our
knowledge, this is the first super-linear convergence result for a
communication-efficient decentralized Newton’s method. More-
over, the rate we establish is provably faster than those of first-
order methods. Our numerical results on various applications
corroborate the theoretical findings.
Index Terms—Decentralized optimization, convergence rate,
Newton’s method, compressed communication
I. INTRODUCTION
In this paper, we consider solving a finite-sum optimization
problem defined over an undirected, connected network with
nnodes:
x= arg min
xRdF(x),1
n
n
X
i=1
fi(x).(1)
Here, xRdis the decision variable and fi:RdRis
a twice-continuously differentiable function privately owned
by node i. The entire objective function Fis assumed to be
strongly convex. Each node is allowed to exchange limited
Huikang Liu is with the Research Institute for Interdisciplinary Sciences,
School of Information Management and Engineering, Shanghai University of
Finance and Economics (e-mail: liuhuikang@sufe.edu.cn).
Jiaojiao Zhang is with the Division of Decision and Control Systems,
School of Electrical Engineering and Computer Science, KTH Royal Institute
of Technology (e-mail: zdhzjj@mail.ustc.edu.cn).
Anthony Man-Cho So is with the Department of Systems Engineering and
Engineering Management, The Chinese University of Hong Kong (e-mail:
manchoso@se.cuhk.edu.hk).
Qing Ling is with the School of Computer Science and Engineer-
ing and Guangdong Province Key Laboratory of Computational Sci-
ence, Sun Yat-Sen University, and also with the Pazhou Lab (e-mail:
lingqing556@mail.sysu.edu.cn).
information with its neighbors during the optimization process.
To make (1) separable across the nodes, one common way is
to introduce a local copy xiRdof xfor node iand then
force all the local copies to be equal by adding consensus
constraints. This leads to the following alternative formulation
of Problem (1):
x= arg min
{xi}n
i=1
1
n
n
X
i=1
fi(xi)(2)
s.t. xi=xj,j∈ Ni,i.
Here, x,[x;. . . ;x]Rnd and Niis the set of neighbors
of node i. The equivalence between (1) and (2) holds when the
network is connected. Decentralized optimization problems in
the form of (2) appear in various applications, such as deep
learning [1], sensor networking [2], statistical learning [3], etc.
Decentralized algorithms for solving (2) are well studied.
All nodes cooperatively obtain the common optimal solu-
tion x, simultaneously minimizing the objective function
and reaching consensus. Generally speaking, minimization is
realized by inexact descent on local objective functions and
consensus is realized by variable averaging with a mixing
matrix [4]. Below, we briefly review the existing first-order
and second-order decentralized algorithms for solving (2).
A. Decentralized First-order Methods
First-order methods enjoy low per-iteration computational
complexity and thus are popular. Decentralized gradient de-
scent (DGD) is studied in [5], [6], where each node updates
its local copy by a weighted average step on local copies from
its neighbors, followed by a minimization step along its local
gradient descent direction. With a fixed step size, DGD only
converges to a neighborhood of x. This disadvantage can in
part be explained by the observation that the local gradient is
generally not a satisfactory estimate of the global one, even
though the local copies are all equal to the optimal solution
x. To construct a better local direction, various works with
bias-correction techniques are proposed, such as primal-dual
[7], [8], exact diffusion [9], and gradient tracking [10], [11].
For example, gradient tracking replaces the local gradient in
DGD with a local gradient approximation obtained by the
dynamic average consensus (DAC) technique, which leads to
exact convergence with a fixed step size. Unified frameworks
for first-order algorithms are investigated in [12], [13].
In the centralized setting, it is well-known that convergence
of first-order algorithms suffer from dependence on κF, the
condition number of the objective function F. In the de-
centralized setting, the dependence is not only on κFbut
arXiv:2210.00184v1 [math.OC] 1 Oct 2022
2
also on the network. Specifically, let σbe the second largest
singular value of the mixing matrix used in decentralized
optimization and 1
1σbe the condition number of the under-
lying communication graph. A network with larger 1
1σhas a
weaker information diffusion ability. For strongly convex and
smooth problems, the work [14] establishes the lower bounds
OκFlog 1
and OqκF
1σlog 1
on the computation and
communication costs for decentralized first-order algorithms
to reach an -optimal solution, respectively. The lower bounds
are achieved or nearly achieved in [15], [16], where multi-
step consensus is introduced to balance the computation and
communication costs.
B. Decentralized Second-order Methods
In the centralized setting, Newton’s method is proved to
have a locally quadratic convergence rate that is indepen-
dent of κF. However, whether there is a communication-
efficient decentralized variant of Newton’s method with κF-
independent super-linear convergence rate under mild as-
sumptions is still an open question. On the one hand, some
decentralized second-order methods have provably faster rates
but suffer from inexact convergence, high communication cost,
or requiring strict assumptions. The work [17] extends the
network Newton’s method in [18] for minimizing a penalized
approximation of (1) and shows that the convergence rate is
super-linear in a specific neighborhood near the optimal solu-
tion of the penalized problem. Beyond this neighborhood, the
rate becomes linear. The work [19] proposes an approximate
Newton’s method for the dual problem of (2) and establishes
a super-linear convergence rate within a neighborhood of the
primal-dual optimal solution. However, in each iteration, it
needs to solve the primal problem exactly to obtain the dual
gradient and call a solver to obtain the local Newton direction.
The work [20] proposes a decentralized adaptive Newton’s
method, which uses the communication-inefficient flooding
technique to make the global gradient and Hessian available
to each node. In this way, each node conducts exactly the
same update so that the global super-linear convergence rate
of the centralized Newton’s method with Polyak’s adaptive
step size still holds. The work [21] proposes a decentralized
Newton-type method with cubic regularization and proves
faster convergence up to statistical error under the assumption
that each local Hessian is close enough to the global Hessian.
The work [22] studies quadratic local objective functions and
shows that for a distributed Newton’s method, the computation
complexity depends only logarithmically on κFwith the help
of exchanging the entire Hessian matrices. The algorithm in
[22] is close to that in [23], but the latter has no convergence
rate guarantee.
On the other hand, some works are devoted to developing
efficient decentralized second-order algorithms with similar
computation and communication costs per iteration to first-
order algorithms. However, these methods only have globally
linear convergence rate, which is no better than that of first-
order methods [24]–[30]. Here we summarize several reasons
for the lack of provably faster rates: (i) The information fusion
over the network is realized by averaging consensus, whose
convergence rate is at most linear [4]. (ii) The global Hessian is
estimated just from local Hessians [24]–[28] or from Hessian
inverse approximations constructed with local gradient approx-
imations [29], [30]. The purpose is to avoid the communication
of entire Hessian matrices, but a downside is that the nodes
are unable to fully utilize the global second-order information.
(iii) The global Hessian matrices are typically assumed to be
uniformly bounded, which simplifies the analysis but leads to
under-utilization of the curvature information [24]–[30]. (iv)
For the centralized Newton’s method, backtracking line search
is vital for convergence analysis. It adaptively gives a small
step size at the early stage to guarantee global convergence
with arbitrary initialization and always gives a unit step
size after reaching a neighborhood of the optimal solution
to guarantee locally quadratic convergence rate. However,
backtracking line search is not affordable in the decentralized
setting since it is expensive for all the nodes to jointly calculate
the global objective function value.
To the best of our knowledge, there is no decentralized
Newton’s method that, under mild assumptions, is not only
communication-efficient but also inherits the κF-independent
super-linear convergence rate of the centralized Newton’s
method. Therefore, in this paper we aim to address the follow-
ing question: Can we design a communication-efficient decen-
tralized Newton’s method that has a provably κF-independent
super-linear convergence rate?
C. Major Contributions
To answer these questions, we propose a decentralized New-
ton’s method with multi-step consensus and compression and
establish its convergence rate. Roughly speaking, our method
proceeds as follows. In each iteration, each node moves one
step along a local approximated Newton direction, followed by
variable averaging to improve consensus. To construct the local
approximated Newton direction, we use the DAC technique to
obtain a gradient approximation and a Hessian approximation,
which track the global gradient and global Hessian, respec-
tively. To avoid having each node to transmit the entire local
Hessian approximation, we design a compression procedure
with error compensation to estimate the global Hessian in
a communication-efficient way. In other words, each node
is able to obtain more accurate curvature information by
exchanging the compressed local Hessian approximations with
its neighbors, without incurring a high communication cost. In
addition, to balance between computation and communication
costs, we use multi-step consensus for communicating the
local copies of the decision variable and the local gradient
approximations. Multi-step consensus helps to obtain not only
a globally linear rate that is independent of the graph but also
a faster local convergence rate.
Theoretically, we show, with a novel analysis, that our
proposed method enjoys a provably faster convergence rate
than those of decentralized first-order methods. The conver-
gence process is split into two stages. In stage I, we use
a small step size and get globally linear convergence at
the contraction rate of 1O1
κFmin n(1σ2)3
σm1,1
2owith
arbitrary initialization. Here, 1
1σis the condition number
3
of the graph, κFis the condition number of the objective
function, and mis the number of consensus inner steps. This
globally linear rate holds for any m1. As a special case,
when mlog 2(1σ2)3
log σ+ 1, the contraction rate in stage I
becomes 1O1
κF, which is independent of the graph.
When the local copies are close enough to the optimal solution,
the algorithm enters stage II, where we use a unit step size and
get the faster convergence rate of σm
2. This implies that we
have a κF-independent linear rate when mis a constant and an
asymptotically super-linear rate when mgoes to infinity. No
matter what mis, the total communication complexity in stage
II is O1
log σlog 1
. A comparison of the computational
complexity of existing decentralized first-order and second-
order methods are summarized in Table I.
TABLE I: Computational complexity to reach an -optimal
solution for decentralized consensus optimization algorithms
Algorithm Computational complexity
DLM [31] Omax κ2
Fλmax(Lu)
λmin(Lu),(λmax(Lu))2
λmin(Lu)ˆ
λmin(Lo)log 1
1
EXTRA [7] Oκ2
F
1σlog 1
2
GT [32] Oκ2
F
(1σ)2log 1
DQM [24] Omax λmax(Lu)
ˆ
λmin(Lo)2,κFλmax(Lu)
ˆ
λmin(Lo)log 1
3
ESOM [33] Oκ2
F
ˆ
λmin(InW)log 1
4
NT [27] Omax κ2
F+κFκg,κ3/2
g
κF+κFκglog 1
5
Stage I OκFmax nσm1
(1σ2)3,2olog 1
Stage II O1
mlog σlog 1
6
Notation. We use Idto denote the d×didentity matrix, 1n
to denote the n-dimensional column vector of all ones, k·kto
denote the Euclidean norm of a vector or the largest singular
value of a matrix, k · kFto denote the Frobenius norm, and
to denote the Kronecker product. For a matrix A, we use
A0to indicate that each entry of Ais non-negative. For a
symmetric matrix A, we use A0and A0to indicate that
Ais positive semidefinite and positive definite, respectively.
For two matrices Aand Bof the same dimensions, we use
AB,AB, and ABto indicate that AB0,
AB0, and AB0, respectively. We use λmax(·),
λmin(·), and ˆ
λmin to denote the largest, smallest, and the
smallest positive eigenvalues of a matrix, respectively.
For x1, . . . , xnRd, we define the aggregated vari-
able x= [x1;. . . ;xn]Rnd. The aggregated variables d
and gare defined similarly. We define the average variable
1Here, Luand Loare the unoriented and oriented Laplacian defined in
[31], respectively. The rate is obtained when α=L1κF
λmin(Lu)and =L1κF
with L1being the constant of Lipschitz continuous gradient.
2Here, Wis the mixing matrix, ˜
W=In+W
2, and α=0.5ˆ
λmin(˜
W)
L1κF.
3Here, the convergence is local and α=L1
λmax(Lu)ˆ
λmin(L0).
4Here, the convergence is local and the number of consensus inner steps
goes to infinity.
5Here, κg=λmax(InW)
ˆ
λmin(InW)as defined in [27] and the convergence is local.
6Here, mis set as a constant.
over all the nodes at time step kas xk=1
nPn
i=1 xk
i
Rd. The average variables dkand gkare defined simi-
larly. We define the aggregated gradient at time step k
as f(xk) = f1(xk
1); . . . ;fn(xk
n)Rnd, the av-
erage of all the local gradients at the local variables as
f(xk) = 1
nPn
i=1 fi(xk
i)Rd, and the average of all
the local gradients at the common average xkas F(xk) =
1
nPn
i=1 fi(xk)Rd. The aggregated Hessian 2f(xk)
Rnd×d, the average of all the local Hessian at the local
variables 2f(xk)Rd×d, and the average of all the local
Hessians at the common average 2F(xk)Rd×dare
defined similarly. Given the matrices Hk
iRd×d, we define
the aggregated matrix Hk= [Hk
1;. . . ;Hk
n]Rnd×d. The
aggregated matrices Ek,˜
Hk, and ˆ
Hkare defined similarly.
We define the average variable over all the nodes at time step
kas Hk=1
nPn
i=1 Hk
iand diag{Hi} ∈ Rnd×nd as the block
diagonal matrix whose i-th block is HiRd×d. We define
W=WIdRnd×nd and W=1n1T
n
nIdRnd×nd.
II. PROBLEM SETTING AND ALGORITHM DEVELOPMENT
In this section, we give the problem setting and the basic
assumptions. Then, we propose a decentralized Newton’s
method with multi-step consensus and compression.
A. Problem Setting
We consider an undirected, connected graph G= (V,E),
where V={1, . . . , n}is the set of nodes and E V × V is
the set of edges. We use (i, j)∈ E to indicate that nodes iand
jare neighbors, and neighbors are allowed to communicate
with each other. We use Nito denote the set of neighbors of
node iand itself. We introduce a mixing matrix WRn×nto
model the communication among nodes. The mixing matrix
is assumed to satisfy the following:
Assumption 1. The mixing matrix Wis non-negative, sym-
metric, and doubly stochastic (i.e., wij 0for all i, j
{1, . . . , n},W=WT, and W1n= 1n) with wij = 0 if and
only if j /∈ Ni.
Assumption 1 implies that the null space of InWis
span(1n), the eigenvalues of Wlie in (1,1], and 1 is an
eigenvalue of Wof multiplicity 1. Let σbe the second largest
singular value of W. Under Assumption 1, we have
σ=kWWk<1.
Usually, σis used to represent the connectedness of the graph
[5], [7]. Mixing matrices satisfying Assumption 1 are fre-
quently used in decentralized optimization over an undirected,
connected network; see, e.g., [34] for details.
Throughout the paper, we make the following assumptions
on the local objective functions.
Assumption 2. Each fiis twice-continuously differentiable.
Both the gradient and Hessian are Lipschitz continuous, i.e.,
k∇fi(x)− ∇fi(y)k ≤ L1kxyk(3)
and
k∇2fi(x)− ∇2fi(y)k ≤ L2kxyk(4)
4
for all x, y Rd, where L1>0and L2>0are the Lipschitz
constants of the local gradient and local Hessian, respectively.
Assumption 3. The entire objective Fis µ-strongly convex
for some constant µ > 0, i.e.,
2F(x)µId(5)
for all xRd, where µis the strong convexity constant.
We should remark that in Assumption 3, we only assume the
entire objective function Fto be strongly convex. The local
objective function fion each node could even be nonconvex,
which makes our analysis more general.
To avoid having each node to communicate the entire local
Hessian approximation, we design a compression procedure
with a deterministic contractive compression operator Q(·)
that satisfies the following assumption.
Assumption 4. The deterministic contractive compression
operator Q:Rd×dRd×dsatisfies
kQ(A)AkF(1 δ)kAkF(6)
for all ARd×d, where δ(0,1] is a constant determined
by the compression operator.
We now present two concrete examples of such an operator.
Let A=Pd
i=1 σiuivT
ibe the singular value decomposition of
the matrix A, where σiis the i-th largest singular value with ui
and vibeing the corresponding singular vectors. The Rank-K
compression operator outputs Q(A) = PK
i=1 σiuivT
i, which
is a rank-Kapproximation of A. For Top-Kcompression
operator, Q(A)keeps the Klargest entries (in terms of the
absolute value) of the matrix Aand sets the other entries as
zero. For more details of compression operators, one can refer
to [35], [36].
The following proposition shows that both the Rank-Kand
Top-Kcompression operators satisfy Assumption 4.
Proposition 1. For the Rank-Kand Top-Kcompression
operators, Assumption 4 holds with δ=K
2dand δ=K
2d2,
respectively.
Proof. See Appendix VI-A.
Remark 1. Different from the random compression operators
used in first-order algorithms [37], [38], we use deterministic
compression operators. This is because any realization not
satisfying (6) may lead to a non-positive semidefinite Hessian
approximation and thus leads to the failure of the proposed
Newton’s method.
B. Algorithm Development
In this section, we propose a decentralized Newton’s method
with multi-step consensus and compression. In iteration k,
node ifirst conducts one minimization step along a local
approximated Newton direction diand then communicates the
result with its neighbors for mrounds to compute
xk+1
i=X
j∈Ni
(Wm)ij xk
jαdk
j.(7)
Here, α > 0is a step size and (Wm)ij is the (i, j)-th
entry of Wm. Such multi-step consensus costs mrounds
of communication. As we will show in the next section,
the multi-step consensus balances between computation and
communication and is vital to get a provably fast convergence
rate.
To update the local direction, we use the DAC technique
to obtain a gradient approximation and a Hessian approxi-
mation, which track the global gradient and global Hessian,
respectively. The gradient approximation gk+1
ion node iis
computed by
gk+1
i=X
j∈Ni
(Wm)ij gk
j+fj(xk+1
j)− ∇fj(xk
j)(8)
with initialization g0
i=fi(x0
i). Similar to (7), we use
multi-step consensus to make gk+1
ia more accurate gradient
approximation.
Algorithm 1: Decentralized Newton’s method
Input: x0,d0, g0
i=fi(x0
i), H0
i=2fi(x0
i),E0
i,˜
H0
i,α,
γ,m,M.
for k= 0,1,2, . . . do
xk+1 =Wm(xkαdk)
gk+1 =Wmgk+f(xk+1)− ∇f(xk)
Compression Procedure
Ek+1 =Ek+Hk˜
Hk− Q(Ek+Hk˜
Hk)
˜
Hk+1 =˜
Hk+Q(Hk˜
Hk)
ˆ
Hk=˜
Hk+Q(Ek+Hk˜
Hk)
Hk+1 =Hkγ(Ind W)ˆ
Hk+2f(xk+1)2f(xk)
dk+1 diag{Hk+1
i}+MInd1gk+1
end for
To obtain the Hessian approximation, we also use DAC to
mix the second-order curvature information over the network
but keep in mind that communicating the entire local Hessian
approximation leads to a high communication cost. Thus, we
design a compression procedure with error compensation to
estimate the global Hessian in a communication-efficient way.
In other words, each node is able to obtain more accurate
global curvature information by exchanging the compressed
local Hessian approximation with its neighbors, without incur-
ring a high communication cost. The Hessian approximation
Hk+1
ion node iis given by
hk+1
i=hk
i+Q(Hk
ihk
i),
Ek+1
i=Ek
i+Hk
ihk
i− Q(Ek
i+Hk
ihk
i),(9)
ˆ
Hk
i=hk
i+Q(Ek
i+Hk
ihk
i),
Hk+1
i=Hk
iγX
j∈Ni
wij (ˆ
Hk
iˆ
Hk
j)+2fi(xk+1
i)2fi(xk
i)
with initialization H0
i=2fi(x0
i), where γ > 0is a parame-
ter. Compared with DAC without compression, i.e., Hk+1
i=
Hk
iγPj∈Niwij (Hk
iHk
j) + 2fi(xk+1
i)− ∇2fi(xk
i),
the term wij (Hk
iHk
j)is replaced by wij (ˆ
Hk
iˆ
Hk
j), which
can be constructed with compressed communication. There are
摘要:

1ACommunication-EfcientDecentralizedNewton'sMethodwithProvablyFasterConvergenceHuikangLiu,JiaojiaoZhang,AnthonyMan-ChoSo,andQingLingAbstract—Inthispaper,weconsiderastronglyconvexnite-summinimizationproblemoveradecentralizednetworkandpro-poseacommunication-efcientdecentralizedNewton'smethodforsolv...

展开>> 收起<<
1 A Communication-Efficient Decentralized Newtons Method with Provably Faster Convergence.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:989.77KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注