On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization Luo Luo1 Yunyan Bai1 Lesi Chen2 Yuxing Liu3 Haishan Ye4

2025-05-02 0 0 1.4MB 79 页 10玖币
侵权投诉
On the Complexity of Decentralized Smooth
Nonconvex Finite-Sum Optimization
Luo Luo1, Yunyan Bai1, Lesi Chen2, Yuxing Liu3, Haishan Ye4*
1School of Data Science, Fudan University.
2Institute for Interdisciplinary Information Sciences, Tsinghua Univerisity.
3Siebel School of Computing and Data Science, University of Illinois
Urbana-Champaign.
4School of Management, Xi’an Jiaotong University.
*Corresponding author(s). E-mail(s): yehaishan@xjtu.edu.cn;
Contributing authors: luoluo@fudan.edu.cn;yybai22@m.fudan.edu.cn;
chenlc23@mails.tsinghua.edu.cn;yuxing6@illinois.edu;
Abstract
We study the decentralized optimization problem
minxRdf
(x)
1
mPm
i=1 fi
(x),
where the local function on the
i
-th agent has the form of
fi
(x)
1
nPn
j=1 fi,j
(x)
and every individual
fi,j
is smooth but possibly nonconvex. We propose a
stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt
deScenT (
DEAREST+
) method, which achieves an
ϵ
-stationary point at each
agent with the communication rounds of
˜
O
(
2/γ
), the computation rounds
of
˜
O
(
n
+ (
L
+
min{nL, pn/m ¯
L}
)
ϵ2
), and the local incremental first-oracle calls
of
O
(
mn
+
min{mnL, mn ¯
L}ϵ2
), where
L
is the smoothness parameter of the
objective function,
¯
L
is the mean-squared smoothness parameter of all individual
functions, and
γ
is the spectral gap of the mixing matrix associated with the
network. We then establish the lower bounds to show that the proposed method
is near-optimal. Notice that the smoothness parameters
L
and
¯
L
used in our
algorithm design and analysis are global, leading to sharper complexity bounds
than existing results that depend on the local smoothness. We further extend
DEAREST+
to solve the decentralized finite-sum optimization problem under the
Polyak–
L
ojasiewicz condition, also achieving the near-optimal complexity bounds.
Keywords: decentralized optimization, nonconvex optimization, smoothness parameter,
variance reduction, Polyak– Lojasiewicz condition
The conference version of this manuscript is published on ICML 2024 [
10
], which contains
the results under the PL condition (Section 6) in the special case of L=¯
Land n= 1.
1
arXiv:2210.13931v4 [math.OC] 11 Jan 2025
1 Introduction
We study the decentralized optimization problem
min
xRdf(x)1
m
m
X
i=1
fi(x),(1)
over a connected network with
m
agents, where
fi
:
RdR
is the local function on
the
i
-th agent that has the finite-sum structure with
n
individual functions as follows
fi(x)1
n
n
X
j=1
fi,j (x).(2)
We suppose each individual function
fi,j
:
RdR
is smooth but possibly nonconvex.
This formulation includes a lot of applications in statistics [
6
,
91
], signal processing
[
23
,
56
,
74
,
75
], and machine learning [
4
,
14
,
29
,
77
,
82
]. In decentralized scenario, all
of agents target to collaboratively solve problem (1) and each of them can only commu-
nicate with its neighbors. We focus on the complexity for achieving the approximate
stationary point of the global objective at every agent.
For decentralized optimization, the limitation on the communication protocol
leads to the local agents cannot access the exact global information at each round,
which leads to the requirement of communication rounds to reduce the consensus
error. The gradient tracking [
55
,
63
,
65
,
70
] is a useful technique to approximate the
average of local gradients and make the local first-order estimator accurate. Directly
extending (stochastic) gradient descent methods to decentralized setting [
20
,
21
,
43
,
48
,
54
,
72
,
76
,
84
] cannot take the advantage of the popular finite-sum structure
in local functions. It is well-known that the algorithms with stochastic recursive
gradient estimator [
16
,
24
,
42
,
59
,
60
] can achieve the optimal incremental first-order
oracle (IFO) complexity for finding the approximate stationary point of the finite-
sum nonconvex function under the mean-squared smooth assumption. Sun et al. [73]
first combined stochastic recursive gradient estimator with gradient tracking to solve
decentralized nonconvex finite-sum optimization problem by proposing Decentralized
Gradient Estimation and Tracking (
D-GET
). Later, Xin et al.
[81]
and Zhan et al.
[89]
proposed
GT-SARAH
and efficient decentralized stochastic gradient descent (
EDSGD
)
respectively to improve the complexity in terms of the dependency on the numbers
of agents and individual functions. Li et al.
[39]
proposed DEcentralized STochastic
REcurSive gradient methodS (
DESTRESS
), which introduces Chebyshev acceleration [
7
]
to achieve the tighter dependency on the spectral gap of the mixing matrix associated
with the network. Metelev et al.
[51]
further considered the nonconvex problem over
the time-varying network. Additionally, Lu and De Sa
[48]
, Yuan et al.
[85]
studied
the tightness of decentralized nonconvex optimization in the online setting.
It is worth noting that existing works [
39
,
48
,
51
,
73
,
81
,
85
] for decentralized
nonconvex optimization only consider the local smoothness parameters, which may be
arbitrary larger than the global ones. Furthermore, their analysis for the computation
complexity focuses on one of the overall local incremental first-order oracle (LIFO)
2
calls and the number of computation rounds. Noticing that in each computation
round, a distributed algorithm can make partial agents access their LIFO and allow
other agents skip their local computation steps [
49
,
52
]. Therefore, the LIFO calls and
the computation rounds should be addressed separately. Recently, Liu et al.
[46]
, Ye
et al.
[83]
studied the distributed optimization by considering the global smoothness
dependency and the partial participated protocol, while their results only address the
convex problem.
In this paper, we refine the setting of decentralized smooth nonconvex finite-
sum optimization (1) by distinguishing the different smoothness parameters and
considering the partial participation protocol. We proposed a novel stochastic algo-
rithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (
DEAREST+
),
achieving the
ϵ
-stationary point at every agent with the communication rounds
of
˜
O
(
2/γ
), the computation rounds of
˜
O
(
n
+ (
L
+
min{nL, pn/m ¯
L}
)
ϵ2
),
and the LIFO calls of
O
(
mn
+
min{mnL, mn ¯
L}ϵ2
), where
L
is the smoothness
parameter of the global objective function
f
,
¯
L
is the mean-squared smoothness
parameter of all individual functions
{fi,j }m,n
i,j=1
, and
γ
is the spectral gap of the
mixing matrix associated with the network. We then establish lower complexity
bounds with respect to
ϵ
,
m
,
n
,
L
,
¯
L
, and
γ
to show the near-optimality of our
method. Notice that the smoothness parameters
L
and
¯
L
in our results are global,
leading to sharper complexity bounds than existing ones that depend on the local
smoothness. For single-machine scenario (i.e.,
m
= 1), our theory indicates incremen-
tal first-order (IFO) complexity of
O
(
n
+
min{nL, n¯
L}ϵ2
), which is a trade-off
between the complexity of
O
(
nLϵ2
) from vanilla gradient descent [
17
,
18
,
58
]
and the complexity of
O
(
n
+
n¯
2
) from stochastic variance-reduced methods
[
24
,
42
,
61
,
78
,
92
,
93
]. We further apply
DEAREST+
to solve the decentralized finite-sum
problem under the Polyak–
L
ojasiewicz (PL) condition [
47
,
62
,
87
], which achieves the
ϵ
-suboptimal solution at every agent with the communication rounds of
˜
O
(
κln
(1
)),
the computation rounds
˜
O
((
n
+
κ
+
min{nκ, pn/m¯κ}
)
ln
(1
)), and the LIFO calls
of
O
((
mn
+
min{mnκ, mn¯κ}
)
ln
(1
)), where
κL/µ
is the global condition
number
,
¯κ¯
L/µ
is the mean-squared condition number, and
µ
is the PL parameter. We also
provide lower bounds to show above upper complexity bounds under the PL condition
are near-optimal.
2 Preliminaries
We formally introduce the notations and problem settings used in this paper.
2.1 Notations
We use bold lower-case letters for vectors and bold upper-case letters for matrices.
The notations
∥·∥
and
∥ · ∥2
are used to denote the Frobenius norm and the spectral
norm of the matrix, respectively, as well as the Euclidean norm of the vector. We
let 1= [1
,··· ,
1]
Rm
and denote I
Rm×m
as the identity matrix. We define
3
aggregated variables for all agents as
X=
x1
.
.
.
xm
Rm×d,(3)
where each x
iR1×d
are the local variable on the
i
-th agent. We use the lower case
with the bar to represent the mean vector, such that
¯
x=1
m
m
X
i=1
xi.
We also introduce the matrix for aggregated gradients of local functions
f1, . . . , fm
as
F(X) =
f1(x1)
.
.
.
fm(xm)
Rm×d.(4)
For ease of presentation, we let the input of a function can be also organized as a row
vector, such as f(¯
x), fi(xi) and fi(xi) for some i[m].
2.2 Problem Settings
We suppose the formulations (1)–(2) satisfy the following assumptions.
Assumption 1 (lower bounded).We suppose the objective function
f
:
RdR
is
lower bounded, i.e., we have
finf
xRdf(x)>−∞.(5)
Assumption 2 (global smooth).We suppose the differentiable function
f
:
RdR
is L-smooth for some L > 0, i.e.,
∥∇f(x)− ∇f(y)∥ ≤ Lxy(6)
for all x,yRd.
Assumption 3 (global mean-squared smooth).We suppose the individual functions
{fi,j }m,n
i,j=1 are ¯
L-mean-squared smooth for some ¯
L > 0, i.e., we have
1
mn
m
X
i=1
n
X
j=1 ∥∇fi,j (x)− ∇fi,j (y)2¯
L2xy2(7)
for all x,yRd.
We present the relationship between the smoothness parameters
L
and
¯
L
as follows.
4
Proposition 1. The smoothness conditions in Assumptions 2and 3have the following
relationships:
(a)
If the individual functions
{fi,j }m,n
i,j=1
are
¯
L
-mean-squared smooth, then each of
{fi,j }m,n
i,j=1 and {fi}m
i=1 is mn¯
L-smooth, i.e., we have
∥∇fi,j (x)− ∇fi,j (y)∥ ≤ mn¯
Lxy(8)
and
∥∇fi(x)− ∇fi(y)∥ ≤ mn¯
Lxy(9)
for all x,yRd,i[m], and j[n].
(b)
If the individual functions
{fi,j }m,n
i,j=1
are
¯
L
-mean-squared smooth, then the
objective function fis ¯
L-smooth, i.e., we have
∥∇f(x)− ∇f(y)∥ ≤ ¯
Lxy
for all x,yRd.
(c)
For any
L >
0and
¯
L >
0such that
¯
LL
, there exist functions
{fi,j }m,n
i,j=1
which satisfy Assumption 2and 3with the tight smoothness parameters
L
and
¯
L
,
respectively.
Remark 1.The statements (a) and (b) of Proposition 1imply the upper bounds
with respect to the tight global smoothness parameter
L
in Assumption 2is potential
sharper than the upper bounds with respect to the global mean-squared smoothness
parameter
¯
L
in Assumption 3(global mean-squared smooth). The statement (c) of
Proposition 1means the ratio between the tight parameters
¯
L
and
L
can be arbitrary
large, which means considering the difference between
¯
L
and
L
is very necessary in
finite-sum nonconvex optimization.
We also present other smoothness assumptions used in related work [
39
,
51
,
73
,
81
,
89] for comparison.
Assumption 4 (local smooth).We suppose each local function
fi
:
RdR
is
L-smooth for some L>0, i.e., we have
∥∇fi(x)− ∇fi(y)∥ ≤ Lxy(10)
for all i[m]and x,yRd.
Assumption 5 (local mean-squared smooth).We suppose the individual functions
{fi,j }n
j=1 on each agent are ¯
L-mean-squared smooth for some ¯
L>0, i.e., we have
1
n
n
X
j=1 ∥∇fi,j (x)− ∇fi,j (y)2¯
L2
xy2(11)
for all i[m]and x,yRd.
5
摘要:

OntheComplexityofDecentralizedSmoothNonconvexFinite-SumOptimization∗LuoLuo1,YunyanBai1,LesiChen2,YuxingLiu3,HaishanYe4*1SchoolofDataScience,FudanUniversity.2InstituteforInterdisciplinaryInformationSciences,TsinghuaUniverisity.3SiebelSchoolofComputingandDataScience,UniversityofIllinoisUrbana-Champaig...

展开>> 收起<<
On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization Luo Luo1 Yunyan Bai1 Lesi Chen2 Yuxing Liu3 Haishan Ye4.pdf

共79页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:79 页 大小:1.4MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 79
客服
关注