1 A Communication-Efﬁcient Decentralized Newtons Method with Provably Faster Convergence

2025-04-30 1 0 989.77KB 18 页 10玖币

侵权投诉

A Communication-Efﬁcient Decentralized Newton’s

Method with Provably Faster Convergence

Huikang Liu, Jiaojiao Zhang, Anthony Man-Cho So, and Qing Ling

Abstract—In this paper, we consider a strongly convex ﬁnite-

sum minimization problem over a decentralized network and pro-

pose a communication-efﬁcient decentralized Newton’s method

for solving it. The main challenges in designing such an algorithm

come from three aspects: (i) mismatch between local gradi-

ents/Hessians and the global ones; (ii) cost of sharing second-

order information; (iii) tradeoff among computation and com-

munication. To handle these challenges, we ﬁrst apply dynamic

average consensus (DAC) so that each node is able to use a local

gradient approximation and a local Hessian approximation to

track the global gradient and Hessian, respectively. Second, since

exchanging Hessian approximations is far from communication-

efﬁcient, we require the nodes to exchange the compressed ones

instead and then apply an error compensation mechanism to

correct for the compression noise. Third, we introduce multi-

step consensus for exchanging local variables and local gradient

approximations to balance between computation and communica-

tion. With novel analysis, we establish the globally linear (resp.,

asymptotically super-linear) convergence rate of the proposed

method when mis constant (resp., tends to inﬁnity), where

m≥1is the number of consensus inner steps. To the best of our

knowledge, this is the ﬁrst super-linear convergence result for a

communication-efﬁcient decentralized Newton’s method. More-

over, the rate we establish is provably faster than those of ﬁrst-

order methods. Our numerical results on various applications

corroborate the theoretical ﬁndings.

Index Terms—Decentralized optimization, convergence rate,

Newton’s method, compressed communication

I. INTRODUCTION

In this paper, we consider solving a ﬁnite-sum optimization

problem deﬁned over an undirected, connected network with

nnodes:

x∗= arg min

x∈RdF(x),1

i=1

fi(x).(1)

Here, x∈Rdis the decision variable and fi:Rd→Ris

a twice-continuously differentiable function privately owned

by node i. The entire objective function Fis assumed to be

strongly convex. Each node is allowed to exchange limited

Huikang Liu is with the Research Institute for Interdisciplinary Sciences,

School of Information Management and Engineering, Shanghai University of

Finance and Economics (e-mail: liuhuikang@sufe.edu.cn).

Jiaojiao Zhang is with the Division of Decision and Control Systems,

School of Electrical Engineering and Computer Science, KTH Royal Institute

of Technology (e-mail: zdhzjj@mail.ustc.edu.cn).

Anthony Man-Cho So is with the Department of Systems Engineering and

Engineering Management, The Chinese University of Hong Kong (e-mail:

manchoso@se.cuhk.edu.hk).

Qing Ling is with the School of Computer Science and Engineer-

ing and Guangdong Province Key Laboratory of Computational Sci-

ence, Sun Yat-Sen University, and also with the Pazhou Lab (e-mail:

lingqing556@mail.sysu.edu.cn).

information with its neighbors during the optimization process.

To make (1) separable across the nodes, one common way is

to introduce a local copy xi∈Rdof xfor node iand then

force all the local copies to be equal by adding consensus

constraints. This leads to the following alternative formulation

of Problem (1):

x∗= arg min

{xi}n

i=1

fi(xi)(2)

s.t. xi=xj,∀j∈ Ni,∀i.

Here, x∗,[x∗;. . . ;x∗]∈Rnd and Niis the set of neighbors

of node i. The equivalence between (1) and (2) holds when the

network is connected. Decentralized optimization problems in

the form of (2) appear in various applications, such as deep

learning [1], sensor networking [2], statistical learning [3], etc.

Decentralized algorithms for solving (2) are well studied.

All nodes cooperatively obtain the common optimal solu-

tion x∗, simultaneously minimizing the objective function

and reaching consensus. Generally speaking, minimization is

realized by inexact descent on local objective functions and

consensus is realized by variable averaging with a mixing

matrix [4]. Below, we brieﬂy review the existing ﬁrst-order

and second-order decentralized algorithms for solving (2).

A. Decentralized First-order Methods

First-order methods enjoy low per-iteration computational

complexity and thus are popular. Decentralized gradient de-

scent (DGD) is studied in [5], [6], where each node updates

its local copy by a weighted average step on local copies from

its neighbors, followed by a minimization step along its local

gradient descent direction. With a ﬁxed step size, DGD only

converges to a neighborhood of x∗. This disadvantage can in

part be explained by the observation that the local gradient is

generally not a satisfactory estimate of the global one, even

though the local copies are all equal to the optimal solution

x∗. To construct a better local direction, various works with

bias-correction techniques are proposed, such as primal-dual

[7], [8], exact diffusion [9], and gradient tracking [10], [11].

For example, gradient tracking replaces the local gradient in

DGD with a local gradient approximation obtained by the

dynamic average consensus (DAC) technique, which leads to

exact convergence with a ﬁxed step size. Uniﬁed frameworks

for ﬁrst-order algorithms are investigated in [12], [13].

In the centralized setting, it is well-known that convergence

of ﬁrst-order algorithms suffer from dependence on κF, the

condition number of the objective function F. In the de-

centralized setting, the dependence is not only on κFbut

arXiv:2210.00184v1 [math.OC] 1 Oct 2022

also on the network. Speciﬁcally, let σbe the second largest

singular value of the mixing matrix used in decentralized

optimization and 1

1−σbe the condition number of the under-

lying communication graph. A network with larger 1

1−σhas a

weaker information diffusion ability. For strongly convex and

smooth problems, the work [14] establishes the lower bounds

O√κFlog 1

and OqκF

1−σlog 1

on the computation and

communication costs for decentralized ﬁrst-order algorithms

to reach an -optimal solution, respectively. The lower bounds

are achieved or nearly achieved in [15], [16], where multi-

step consensus is introduced to balance the computation and

communication costs.

B. Decentralized Second-order Methods

In the centralized setting, Newton’s method is proved to

have a locally quadratic convergence rate that is indepen-

dent of κF. However, whether there is a communication-

efﬁcient decentralized variant of Newton’s method with κF-

independent super-linear convergence rate under mild as-

sumptions is still an open question. On the one hand, some

decentralized second-order methods have provably faster rates

but suffer from inexact convergence, high communication cost,

or requiring strict assumptions. The work [17] extends the

network Newton’s method in [18] for minimizing a penalized

approximation of (1) and shows that the convergence rate is

super-linear in a speciﬁc neighborhood near the optimal solu-

tion of the penalized problem. Beyond this neighborhood, the

rate becomes linear. The work [19] proposes an approximate

Newton’s method for the dual problem of (2) and establishes

a super-linear convergence rate within a neighborhood of the

primal-dual optimal solution. However, in each iteration, it

needs to solve the primal problem exactly to obtain the dual

gradient and call a solver to obtain the local Newton direction.

The work [20] proposes a decentralized adaptive Newton’s

method, which uses the communication-inefﬁcient ﬂooding

technique to make the global gradient and Hessian available

to each node. In this way, each node conducts exactly the

same update so that the global super-linear convergence rate

of the centralized Newton’s method with Polyak’s adaptive

step size still holds. The work [21] proposes a decentralized

Newton-type method with cubic regularization and proves

faster convergence up to statistical error under the assumption

that each local Hessian is close enough to the global Hessian.

The work [22] studies quadratic local objective functions and

shows that for a distributed Newton’s method, the computation

complexity depends only logarithmically on κFwith the help

of exchanging the entire Hessian matrices. The algorithm in

[22] is close to that in [23], but the latter has no convergence

rate guarantee.

On the other hand, some works are devoted to developing

efﬁcient decentralized second-order algorithms with similar

computation and communication costs per iteration to ﬁrst-

order algorithms. However, these methods only have globally

linear convergence rate, which is no better than that of ﬁrst-

order methods [24]–[30]. Here we summarize several reasons

for the lack of provably faster rates: (i) The information fusion

over the network is realized by averaging consensus, whose

convergence rate is at most linear [4]. (ii) The global Hessian is

estimated just from local Hessians [24]–[28] or from Hessian

inverse approximations constructed with local gradient approx-

imations [29], [30]. The purpose is to avoid the communication

of entire Hessian matrices, but a downside is that the nodes

are unable to fully utilize the global second-order information.

(iii) The global Hessian matrices are typically assumed to be

uniformly bounded, which simpliﬁes the analysis but leads to

under-utilization of the curvature information [24]–[30]. (iv)

For the centralized Newton’s method, backtracking line search

is vital for convergence analysis. It adaptively gives a small

step size at the early stage to guarantee global convergence

with arbitrary initialization and always gives a unit step

size after reaching a neighborhood of the optimal solution

to guarantee locally quadratic convergence rate. However,

backtracking line search is not affordable in the decentralized

setting since it is expensive for all the nodes to jointly calculate

the global objective function value.

To the best of our knowledge, there is no decentralized

Newton’s method that, under mild assumptions, is not only

communication-efﬁcient but also inherits the κF-independent

super-linear convergence rate of the centralized Newton’s

method. Therefore, in this paper we aim to address the follow-

ing question: Can we design a communication-efﬁcient decen-

tralized Newton’s method that has a provably κF-independent

super-linear convergence rate?

C. Major Contributions

To answer these questions, we propose a decentralized New-

ton’s method with multi-step consensus and compression and

establish its convergence rate. Roughly speaking, our method

proceeds as follows. In each iteration, each node moves one

step along a local approximated Newton direction, followed by

variable averaging to improve consensus. To construct the local

approximated Newton direction, we use the DAC technique to

obtain a gradient approximation and a Hessian approximation,

which track the global gradient and global Hessian, respec-

tively. To avoid having each node to transmit the entire local

Hessian approximation, we design a compression procedure

with error compensation to estimate the global Hessian in

a communication-efﬁcient way. In other words, each node

is able to obtain more accurate curvature information by

exchanging the compressed local Hessian approximations with

its neighbors, without incurring a high communication cost. In

addition, to balance between computation and communication

costs, we use multi-step consensus for communicating the

local copies of the decision variable and the local gradient

approximations. Multi-step consensus helps to obtain not only

a globally linear rate that is independent of the graph but also

a faster local convergence rate.

Theoretically, we show, with a novel analysis, that our

proposed method enjoys a provably faster convergence rate

than those of decentralized ﬁrst-order methods. The conver-

gence process is split into two stages. In stage I, we use

a small step size and get globally linear convergence at

the contraction rate of 1−O1

κFmin n(1−σ2)3

σm−1,1

2owith

arbitrary initialization. Here, 1

1−σis the condition number

of the graph, κFis the condition number of the objective

function, and mis the number of consensus inner steps. This

globally linear rate holds for any m≥1. As a special case,

when m≥log 2(1−σ2)3

log σ+ 1, the contraction rate in stage I

becomes 1−O1

κF, which is independent of the graph.

When the local copies are close enough to the optimal solution,

the algorithm enters stage II, where we use a unit step size and

get the faster convergence rate of σm

2. This implies that we

have a κF-independent linear rate when mis a constant and an

asymptotically super-linear rate when mgoes to inﬁnity. No

matter what mis, the total communication complexity in stage

II is O1

−log σlog 1

. A comparison of the computational

complexity of existing decentralized ﬁrst-order and second-

order methods are summarized in Table I.

TABLE I: Computational complexity to reach an -optimal

solution for decentralized consensus optimization algorithms

Algorithm Computational complexity

DLM [31] Omax κ2

Fλmax(Lu)

λmin(Lu),(λmax(Lu))2

λmin(Lu)ˆ

λmin(Lo)log 1

1

EXTRA [7] Oκ2

1−σlog 1

2

GT [32] Oκ2

(1−σ)2log 1



DQM [24] Omax λmax(Lu)

λmin(Lo)2,κFλmax(Lu)

λmin(Lo)log 1

3

ESOM [33] Oκ2

λmin(In−W)log 1

4

NT [27] Omax κ2

F+κF√κg,κ3/2

κF+κF√κglog 1

5

Stage I OκFmax nσm−1

(1−σ2)3,2olog 1



Stage II O1

−mlog σlog 1

6

Notation. We use Idto denote the d×didentity matrix, 1n

to denote the n-dimensional column vector of all ones, k·kto

denote the Euclidean norm of a vector or the largest singular

value of a matrix, k · kFto denote the Frobenius norm, and

⊗to denote the Kronecker product. For a matrix A, we use

A≥0to indicate that each entry of Ais non-negative. For a

symmetric matrix A, we use A0and A0to indicate that

Ais positive semideﬁnite and positive deﬁnite, respectively.

For two matrices Aand Bof the same dimensions, we use

A≥B,AB, and ABto indicate that A−B≥0,

A−B0, and A−B0, respectively. We use λmax(·),

λmin(·), and ˆ

λmin to denote the largest, smallest, and the

smallest positive eigenvalues of a matrix, respectively.

For x1, . . . , xn∈Rd, we deﬁne the aggregated vari-

able x= [x1;. . . ;xn]∈Rnd. The aggregated variables d

and gare deﬁned similarly. We deﬁne the average variable

1Here, Luand Loare the unoriented and oriented Laplacian deﬁned in

[31], respectively. The rate is obtained when α=L1κF

λmin(Lu)and =L1κF

with L1being the constant of Lipschitz continuous gradient.

2Here, Wis the mixing matrix, ˜

W=In+W

2, and α=0.5ˆ

λmin(˜

L1κF.

3Here, the convergence is local and α=L1

λmax(Lu)ˆ

λmin(L0).

4Here, the convergence is local and the number of consensus inner steps

goes to inﬁnity.

5Here, κg=λmax(In−W)

λmin(In−W)as deﬁned in [27] and the convergence is local.

6Here, mis set as a constant.

over all the nodes at time step kas xk=1

nPn

i=1 xk

i∈

Rd. The average variables dkand gkare deﬁned simi-

larly. We deﬁne the aggregated gradient at time step k

as ∇f(xk) = ∇f1(xk

1); . . . ;∇fn(xk

n)∈Rnd, the av-

erage of all the local gradients at the local variables as

∇f(xk) = 1

nPn

i=1 ∇fi(xk

i)∈Rd, and the average of all

the local gradients at the common average xkas ∇F(xk) =

nPn

i=1 ∇fi(xk)∈Rd. The aggregated Hessian ∇2f(xk)∈

Rnd×d, the average of all the local Hessian at the local

variables ∇2f(xk)∈Rd×d, and the average of all the local

Hessians at the common average ∇2F(xk)∈Rd×dare

deﬁned similarly. Given the matrices Hk

i∈Rd×d, we deﬁne

the aggregated matrix Hk= [Hk

1;. . . ;Hk

n]∈Rnd×d. The

aggregated matrices Ek,˜

Hk, and ˆ

Hkare deﬁned similarly.

We deﬁne the average variable over all the nodes at time step

kas Hk=1

nPn

i=1 Hk

iand diag{Hi} ∈ Rnd×nd as the block

diagonal matrix whose i-th block is Hi∈Rd×d. We deﬁne

W=W⊗Id∈Rnd×nd and W∞=1n1T

n⊗Id∈Rnd×nd.

II. PROBLEM SETTING AND ALGORITHM DEVELOPMENT

In this section, we give the problem setting and the basic

assumptions. Then, we propose a decentralized Newton’s

method with multi-step consensus and compression.

A. Problem Setting

We consider an undirected, connected graph G= (V,E),

where V={1, . . . , n}is the set of nodes and E ⊆ V × V is

the set of edges. We use (i, j)∈ E to indicate that nodes iand

jare neighbors, and neighbors are allowed to communicate

with each other. We use Nito denote the set of neighbors of

node iand itself. We introduce a mixing matrix W∈Rn×nto

model the communication among nodes. The mixing matrix

is assumed to satisfy the following:

Assumption 1. The mixing matrix Wis non-negative, sym-

metric, and doubly stochastic (i.e., wij ≥0for all i, j ∈

{1, . . . , n},W=WT, and W1n= 1n) with wij = 0 if and

only if j /∈ Ni.

Assumption 1 implies that the null space of In−Wis

span(1n), the eigenvalues of Wlie in (−1,1], and 1 is an

eigenvalue of Wof multiplicity 1. Let σbe the second largest

singular value of W. Under Assumption 1, we have

σ=kW−W∞k<1.

Usually, σis used to represent the connectedness of the graph

[5], [7]. Mixing matrices satisfying Assumption 1 are fre-

quently used in decentralized optimization over an undirected,

connected network; see, e.g., [34] for details.

Throughout the paper, we make the following assumptions

on the local objective functions.

Assumption 2. Each fiis twice-continuously differentiable.

Both the gradient and Hessian are Lipschitz continuous, i.e.,

k∇fi(x)− ∇fi(y)k ≤ L1kx−yk(3)

and

k∇2fi(x)− ∇2fi(y)k ≤ L2kx−yk(4)

for all x, y ∈Rd, where L1>0and L2>0are the Lipschitz

constants of the local gradient and local Hessian, respectively.

Assumption 3. The entire objective Fis µ-strongly convex

for some constant µ > 0, i.e.,

∇2F(x)µId(5)

for all x∈Rd, where µis the strong convexity constant.

We should remark that in Assumption 3, we only assume the

entire objective function Fto be strongly convex. The local

objective function fion each node could even be nonconvex,

which makes our analysis more general.

To avoid having each node to communicate the entire local

Hessian approximation, we design a compression procedure

with a deterministic contractive compression operator Q(·)

that satisﬁes the following assumption.

Assumption 4. The deterministic contractive compression

operator Q:Rd×d→Rd×dsatisﬁes

kQ(A)−AkF≤(1 −δ)kAkF(6)

for all A∈Rd×d, where δ∈(0,1] is a constant determined

by the compression operator.

We now present two concrete examples of such an operator.

Let A=Pd

i=1 σiuivT

ibe the singular value decomposition of

the matrix A, where σiis the i-th largest singular value with ui

and vibeing the corresponding singular vectors. The Rank-K

compression operator outputs Q(A) = PK

i=1 σiuivT

i, which

is a rank-Kapproximation of A. For Top-Kcompression

operator, Q(A)keeps the Klargest entries (in terms of the

absolute value) of the matrix Aand sets the other entries as

zero. For more details of compression operators, one can refer

to [35], [36].

The following proposition shows that both the Rank-Kand

Top-Kcompression operators satisfy Assumption 4.

Proposition 1. For the Rank-Kand Top-Kcompression

operators, Assumption 4 holds with δ=K

2dand δ=K

2d2,

respectively.

Proof. See Appendix VI-A.

Remark 1. Different from the random compression operators

used in ﬁrst-order algorithms [37], [38], we use deterministic

compression operators. This is because any realization not

satisfying (6) may lead to a non-positive semideﬁnite Hessian

approximation and thus leads to the failure of the proposed

Newton’s method.

B. Algorithm Development

In this section, we propose a decentralized Newton’s method

with multi-step consensus and compression. In iteration k,

node iﬁrst conducts one minimization step along a local

approximated Newton direction diand then communicates the

result with its neighbors for mrounds to compute

xk+1

i=X

j∈Ni

(Wm)ij xk

j−αdk

j.(7)

Here, α > 0is a step size and (Wm)ij is the (i, j)-th

entry of Wm. Such multi-step consensus costs mrounds

of communication. As we will show in the next section,

the multi-step consensus balances between computation and

communication and is vital to get a provably fast convergence

rate.

To update the local direction, we use the DAC technique

to obtain a gradient approximation and a Hessian approxi-

mation, which track the global gradient and global Hessian,

respectively. The gradient approximation gk+1

ion node iis

computed by

gk+1

i=X

j∈Ni

(Wm)ij gk

j+∇fj(xk+1

j)− ∇fj(xk

j)(8)

with initialization g0

i=∇fi(x0

i). Similar to (7), we use

multi-step consensus to make gk+1

ia more accurate gradient

approximation.

Algorithm 1: Decentralized Newton’s method

Input: x0,d0, g0

i=∇fi(x0

i), H0

i=∇2fi(x0

i),E0

i,˜

i,α,

γ,m,M.

for k= 0,1,2, . . . do

xk+1 =Wm(xk−αdk)

gk+1 =Wmgk+∇f(xk+1)− ∇f(xk)

Compression Procedure

Ek+1 =Ek+Hk−˜

Hk− Q(Ek+Hk−˜

Hk)

Hk+1 =˜

Hk+Q(Hk−˜

Hk)

Hk=˜

Hk+Q(Ek+Hk−˜

Hk)

Hk+1 =Hk−γ(Ind −W)ˆ

Hk+∇2f(xk+1)−∇2f(xk)

dk+1 ≈diag{Hk+1

i}+MInd−1gk+1

end for

To obtain the Hessian approximation, we also use DAC to

mix the second-order curvature information over the network

but keep in mind that communicating the entire local Hessian

approximation leads to a high communication cost. Thus, we

design a compression procedure with error compensation to

estimate the global Hessian in a communication-efﬁcient way.

In other words, each node is able to obtain more accurate

global curvature information by exchanging the compressed

local Hessian approximation with its neighbors, without incur-

ring a high communication cost. The Hessian approximation

Hk+1

ion node iis given by

hk+1

i=hk

i+Q(Hk

i−hk

i),

Ek+1

i=Ek

i+Hk

i−hk

i− Q(Ek

i+Hk

i−hk

i),(9)

i=hk

i+Q(Ek

i+Hk

i−hk

i),

Hk+1

i=Hk

i−γX

j∈Ni

wij (ˆ

i−ˆ

j)+∇2fi(xk+1

i)−∇2fi(xk

with initialization H0

i=∇2fi(x0

i), where γ > 0is a parame-

ter. Compared with DAC without compression, i.e., Hk+1

i−γPj∈Niwij (Hk

i−Hk

j) + ∇2fi(xk+1

i)− ∇2fi(xk

i),

the term wij (Hk

i−Hk

j)is replaced by wij (ˆ

i−ˆ

j), which

can be constructed with compressed communication. There are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1ACommunication-EfcientDecentralizedNewton'sMethodwithProvablyFasterConvergenceHuikangLiu,JiaojiaoZhang,AnthonyMan-ChoSo,andQingLingAbstractInthispaper,weconsiderastronglyconvexnite-summinimizationproblemoveradecentralizednetworkandpro-poseacommunication-efcientdecentralizedNewton'smethodforsolv...

展开>> 收起<<

1 A Communication-Efﬁcient Decentralized Newtons Method with Provably Faster Convergence.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 A Communication-Efﬁcient Decentralized Newtons Method with Provably Faster Convergence

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: