On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization Luo Luo1 Yunyan Bai1 Lesi Chen2 Yuxing Liu3 Haishan Ye4

2025-05-02 0 0 1.4MB 79 页 10玖币

侵权投诉

On the Complexity of Decentralized Smooth

Nonconvex Finite-Sum Optimization∗

Luo Luo1, Yunyan Bai1, Lesi Chen2, Yuxing Liu3, Haishan Ye4*

1School of Data Science, Fudan University.

2Institute for Interdisciplinary Information Sciences, Tsinghua Univerisity.

3Siebel School of Computing and Data Science, University of Illinois

Urbana-Champaign.

4School of Management, Xi’an Jiaotong University.

*Corresponding author(s). E-mail(s): yehaishan@xjtu.edu.cn;

Contributing authors: luoluo@fudan.edu.cn;yybai22@m.fudan.edu.cn;

chenlc23@mails.tsinghua.edu.cn;yuxing6@illinois.edu;

Abstract

We study the decentralized optimization problem

minx∈Rdf

(x)

≜1

mPm

i=1 fi

(x),

where the local function on the

-th agent has the form of

(x)

≜1

nPn

j=1 fi,j

(x)

and every individual

fi,j

is smooth but possibly nonconvex. We propose a

stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt

deScenT (

DEAREST+

) method, which achieves an

-stationary point at each

agent with the communication rounds of

(

Lϵ−2/√γ

), the computation rounds

(

+ (

min{nL, pn/m ¯

)

ϵ−2

), and the local incremental ﬁrst-oracle calls

(

min{mnL, √mn ¯

L}ϵ−2

), where

is the smoothness parameter of the

objective function,

is the mean-squared smoothness parameter of all individual

functions, and

is the spectral gap of the mixing matrix associated with the

network. We then establish the lower bounds to show that the proposed method

is near-optimal. Notice that the smoothness parameters

and

used in our

algorithm design and analysis are global, leading to sharper complexity bounds

than existing results that depend on the local smoothness. We further extend

DEAREST+

to solve the decentralized ﬁnite-sum optimization problem under the

Polyak–

ojasiewicz condition, also achieving the near-optimal complexity bounds.

Keywords: decentralized optimization, nonconvex optimization, smoothness parameter,

variance reduction, Polyak– Lojasiewicz condition

∗

The conference version of this manuscript is published on ICML 2024 [

], which contains

the results under the PL condition (Section 6) in the special case of L=¯

Land n= 1.

arXiv:2210.13931v4 [math.OC] 11 Jan 2025

1 Introduction

We study the decentralized optimization problem

min

x∈Rdf(x)≜1

i=1

fi(x),(1)

over a connected network with

agents, where

Rd→R

is the local function on

the

-th agent that has the ﬁnite-sum structure with

individual functions as follows

fi(x)≜1

j=1

fi,j (x).(2)

We suppose each individual function

fi,j

Rd→R

is smooth but possibly nonconvex.

This formulation includes a lot of applications in statistics [

], signal processing

[

], and machine learning [

]. In decentralized scenario, all

of agents target to collaboratively solve problem (1) and each of them can only commu-

nicate with its neighbors. We focus on the complexity for achieving the approximate

stationary point of the global objective at every agent.

For decentralized optimization, the limitation on the communication protocol

leads to the local agents cannot access the exact global information at each round,

which leads to the requirement of communication rounds to reduce the consensus

error. The gradient tracking [

] is a useful technique to approximate the

average of local gradients and make the local ﬁrst-order estimator accurate. Directly

extending (stochastic) gradient descent methods to decentralized setting [

] cannot take the advantage of the popular ﬁnite-sum structure

in local functions. It is well-known that the algorithms with stochastic recursive

gradient estimator [

] can achieve the optimal incremental ﬁrst-order

oracle (IFO) complexity for ﬁnding the approximate stationary point of the ﬁnite-

sum nonconvex function under the mean-squared smooth assumption. Sun et al. [73]

ﬁrst combined stochastic recursive gradient estimator with gradient tracking to solve

decentralized nonconvex ﬁnite-sum optimization problem by proposing Decentralized

Gradient Estimation and Tracking (

D-GET

). Later, Xin et al.

[81]

and Zhan et al.

[89]

proposed

GT-SARAH

and eﬃcient decentralized stochastic gradient descent (

EDSGD

)

respectively to improve the complexity in terms of the dependency on the numbers

of agents and individual functions. Li et al.

[39]

proposed DEcentralized STochastic

REcurSive gradient methodS (

DESTRESS

), which introduces Chebyshev acceleration [

]

to achieve the tighter dependency on the spectral gap of the mixing matrix associated

with the network. Metelev et al.

[51]

further considered the nonconvex problem over

the time-varying network. Additionally, Lu and De Sa

[48]

, Yuan et al.

[85]

studied

the tightness of decentralized nonconvex optimization in the online setting.

It is worth noting that existing works [

] for decentralized

nonconvex optimization only consider the local smoothness parameters, which may be

arbitrary larger than the global ones. Furthermore, their analysis for the computation

complexity focuses on one of the overall local incremental ﬁrst-order oracle (LIFO)

calls and the number of computation rounds. Noticing that in each computation

round, a distributed algorithm can make partial agents access their LIFO and allow

other agents skip their local computation steps [

]. Therefore, the LIFO calls and

the computation rounds should be addressed separately. Recently, Liu et al.

[46]

, Ye

et al.

[83]

studied the distributed optimization by considering the global smoothness

dependency and the partial participated protocol, while their results only address the

convex problem.

In this paper, we reﬁne the setting of decentralized smooth nonconvex ﬁnite-

sum optimization (1) by distinguishing the diﬀerent smoothness parameters and

considering the partial participation protocol. We proposed a novel stochastic algo-

rithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (

DEAREST+

achieving the

-stationary point at every agent with the communication rounds

(

Lϵ−2/√γ

), the computation rounds of

(

+ (

min{nL, pn/m ¯

)

ϵ−2

and the LIFO calls of

(

min{mnL, √mn ¯

L}ϵ−2

), where

is the smoothness

parameter of the global objective function

is the mean-squared smoothness

parameter of all individual functions

{fi,j }m,n

i,j=1

, and

is the spectral gap of the

mixing matrix associated with the network. We then establish lower complexity

bounds with respect to

, and

to show the near-optimality of our

method. Notice that the smoothness parameters

and

in our results are global,

leading to sharper complexity bounds than existing ones that depend on the local

smoothness. For single-machine scenario (i.e.,

= 1), our theory indicates incremen-

tal ﬁrst-order (IFO) complexity of

(

min{nL, √n¯

L}ϵ−2

), which is a trade-oﬀ

between the complexity of

(

nLϵ−2

) from vanilla gradient descent [

]

and the complexity of

(

√n¯

Lϵ−2

) from stochastic variance-reduced methods

[

]. We further apply

DEAREST+

to solve the decentralized ﬁnite-sum

problem under the Polyak–

ojasiewicz (PL) condition [

], which achieves the

-suboptimal solution at every agent with the communication rounds of

(

κln

/ϵ

)),

the computation rounds

((

min{nκ, pn/m¯κ}

)

/ϵ

)), and the LIFO calls

((

min{mnκ, √mn¯κ}

)

/ϵ

)), where

κ≜L/µ

is the global condition

number

¯κ≜¯

L/µ

is the mean-squared condition number, and

is the PL parameter. We also

provide lower bounds to show above upper complexity bounds under the PL condition

are near-optimal.

2 Preliminaries

We formally introduce the notations and problem settings used in this paper.

2.1 Notations

We use bold lower-case letters for vectors and bold upper-case letters for matrices.

The notations

∥·∥

and

∥ · ∥2

are used to denote the Frobenius norm and the spectral

norm of the matrix, respectively, as well as the Euclidean norm of the vector. We

let 1= [1

,··· ,

⊤∈Rm

and denote I

∈Rm×m

as the identity matrix. We deﬁne

aggregated variables for all agents as

X=









∈Rm×d,(3)

where each x

i∈R1×d

are the local variable on the

-th agent. We use the lower case

with the bar to represent the mean vector, such that

x=1

i=1

xi.

We also introduce the matrix for aggregated gradients of local functions

f1, . . . , fm

∇F(X) = 



∇f1(x1)

∇fm(xm)





∈Rm×d.(4)

For ease of presentation, we let the input of a function can be also organized as a row

vector, such as f(¯

x), fi(xi) and ∇fi(xi) for some i∈[m].

2.2 Problem Settings

We suppose the formulations (1)–(2) satisfy the following assumptions.

Assumption 1 (lower bounded).We suppose the objective function

Rd→R

lower bounded, i.e., we have

f∗≜inf

x∈Rdf(x)>−∞.(5)

Assumption 2 (global smooth).We suppose the diﬀerentiable function

Rd→R

is L-smooth for some L > 0, i.e.,

∥∇f(x)− ∇f(y)∥ ≤ L∥x−y∥(6)

for all x,y∈Rd.

Assumption 3 (global mean-squared smooth).We suppose the individual functions

{fi,j }m,n

i,j=1 are ¯

L-mean-squared smooth for some ¯

L > 0, i.e., we have

i=1

j=1 ∥∇fi,j (x)− ∇fi,j (y)∥2≤¯

L2∥x−y∥2(7)

for all x,y∈Rd.

We present the relationship between the smoothness parameters

and

as follows.

Proposition 1. The smoothness conditions in Assumptions 2and 3have the following

relationships:

(a)

If the individual functions

{fi,j }m,n

i,j=1

are

-mean-squared smooth, then each of

{fi,j }m,n

i,j=1 and {fi}m

i=1 is √mn¯

L-smooth, i.e., we have

∥∇fi,j (x)− ∇fi,j (y)∥ ≤ √mn¯

L∥x−y∥(8)

and

∥∇fi(x)− ∇fi(y)∥ ≤ √mn¯

L∥x−y∥(9)

for all x,y∈Rd,i∈[m], and j∈[n].

(b)

If the individual functions

{fi,j }m,n

i,j=1

are

-mean-squared smooth, then the

objective function fis ¯

L-smooth, i.e., we have

∥∇f(x)− ∇f(y)∥ ≤ ¯

L∥x−y∥

for all x,y∈Rd.

(c)

For any

L >

0and

L >

0such that

L≥L

, there exist functions

{fi,j }m,n

i,j=1

which satisfy Assumption 2and 3with the tight smoothness parameters

and

respectively.

Remark 1.The statements (a) and (b) of Proposition 1imply the upper bounds

with respect to the tight global smoothness parameter

in Assumption 2is potential

sharper than the upper bounds with respect to the global mean-squared smoothness

parameter

in Assumption 3(global mean-squared smooth). The statement (c) of

Proposition 1means the ratio between the tight parameters

and

can be arbitrary

large, which means considering the diﬀerence between

and

is very necessary in

ﬁnite-sum nonconvex optimization.

We also present other smoothness assumptions used in related work [

89] for comparison.

Assumption 4 (local smooth).We suppose each local function

Rd→R

Lℓ-smooth for some Lℓ>0, i.e., we have

∥∇fi(x)− ∇fi(y)∥ ≤ Lℓ∥x−y∥(10)

for all i∈[m]and x,y∈Rd.

Assumption 5 (local mean-squared smooth).We suppose the individual functions

{fi,j }n

j=1 on each agent are ¯

Lℓ-mean-squared smooth for some ¯

Lℓ>0, i.e., we have

j=1 ∥∇fi,j (x)− ∇fi,j (y)∥2≤¯

ℓ∥x−y∥2(11)

for all i∈[m]and x,y∈Rd.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OntheComplexityofDecentralizedSmoothNonconvexFinite-SumOptimization∗LuoLuo1,YunyanBai1,LesiChen2,YuxingLiu3,HaishanYe4*1SchoolofDataScience,FudanUniversity.2InstituteforInterdisciplinaryInformationSciences,TsinghuaUniverisity.3SiebelSchoolofComputingandDataScience,UniversityofIllinoisUrbana-Champaig...

展开>> 收起<<

On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization Luo Luo1 Yunyan Bai1 Lesi Chen2 Yuxing Liu3 Haishan Ye4.pdf

共79页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization Luo Luo1 Yunyan Bai1 Lesi Chen2 Yuxing Liu3 Haishan Ye4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: