Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz Inequality Ilyas Fatkhullin

2025-05-03 0 0 964.29KB 36 页 10玖币
侵权投诉
Sharp Analysis of Stochastic Optimization under
Global Kurdyka-Łojasiewicz Inequality
Ilyas Fatkhullin
ETH AI Center & ETH Zurich Jalal Etesami*
EPFL
Niao He
ETH Zurich Negar Kiyavash
EPFL
Abstract
We study the complexity of finding the global solution to stochastic nonconvex
optimization when the objective function satisfies global Kurdyka-Łojasiewicz
(KŁ) inequality and the queries from stochastic gradient oracles satisfy mild ex-
pected smoothness assumption. We first introduce a general framework to analyze
Stochastic Gradient Descent (SGD) and its associated nonlinear dynamics under
the setting. As a byproduct of our analysis, we obtain a sample complexity of
O((4α))
for SGD when the objective satisfies the so called
α
-PŁ condition,
where
α
is the degree of gradient domination. Furthermore, we show that a modi-
fied SGD with variance reduction and restarting (PAGER) achieves an improved
sample complexity of
O(2)
when the objective satisfies the average smooth-
ness assumption. This leads to the first optimal algorithm for the important case of
α= 1
which appears in applications such as policy optimization in reinforcement
learning.
1 Introduction
Nonconvex optimization problems are ubiquitous in machine learning domains such as training deep
neural networks [
22
] or policy optimization in reinforcement learning [
52
]. Stochastic Gradient
Descent (SGD) and its variants are driving the practical success of machine learning approaches.
Naturally, understanding the limits of performance of SGD in the nonconvex setting has become an
important avenue of research in recent years [21, 4, 30, 44, 23, 15, 59].
We are interested in solving the unconstrained stochastic,nonconvex optimization problem of the
form
min
xRdf(x) := Eξ∼D [fξ(x)] ,(1)
where
f(·)
is smooth and possibly nonconvex, and
ξ
is a random vector drawn from a distribution
D
.
Moreover, we are interested in an important special case of
(1)
, when the expectation can be written
as the average of nsmooth functions:
min
xRdhf(x) := 1
n
n
X
i=1
fi(x)i.(2)
For a general nonconvex differentiable objective
f:RdR
, finding a global minimum of
f
is
in general intractable [
42
,
54
]. There are two common strategies to analyze optimization methods
for nonconvex functions. The first one is to scale down the requirements on the solution of interest
from global optimality to some relaxed version, e.g., first-order stationary point. However, such
solutions do not exclude the possibility of approaching a suboptimal local minima or a saddle point.
Another approach is to study nonconvex problems with additional structural assumption in the hope of
First two authors have equal contribution.
École polytechnique fédérale de Lausanne
Submitted for review in May, 2022. Accepted to NeurIPS 2022 in Sep, 2022.
arXiv:2210.01748v1 [math.OC] 4 Oct 2022
convergence to global solutions. In this direction, several relaxations of convexity have been proposed
and analyzed, for instance, star convexity, quasar-convexity, error bounds condition, restricted secant
inequality, and quadratic growth [
26
,
24
,
23
]. Many of the aforementioned relaxations have limited
application in real-world problems.
Recently, there has been a surge of interest in the analysis of functions satisfying the so-called
Kurdyka-Łojasiewicz (KŁ) inequality [
9
,
10
]. Of particular interest is the family of functions that
satisfy global KŁ inequality. Specifically, we say that
f(·)
satisfies (global) KŁ inequality if there
exists some continuous function
φ(·)
such that
||∇f(x)|| ≥ φ(f(x)infxf(x))
for all
xRd
.
If this inequality is satisfied for
φ(t) = 2µ t1
, then we say that (global)
α
-PŁ condition is
satisfied for
f(·)
. The special case of KŁ condition,
2
-PŁ, often referred as Polyak-Łojasiewicz
or PŁ condition, was originally discovered independently in the seminal works of B. Polyak, T.
Ležanski and S. Łojasiewicz [
48
,
33
,
38
,
39
]. Notably, this class of problems has found many
interesting emerging applications in machine learning, for instance, policy gradient (PG) methods
in reinforcement learning [
41
,
1
,
56
], generalized linear models [
40
], over-parameterized neural
networks [
2
,
57
], linear quadratic regulator in optimal control [
13
,
19
], and low-rank matrix recovery
[7].
Despite increased popularity of KŁ and
α
-PŁ assumptions, the analysis of stochastic optimization
under it remains limited and the majority of works focus on deterministic gradient methods. Indeed
until recently, only the special case of
α
-PŁ with
α= 2
was mainly addressed in the literature
[
26
,
23
,
30
,
53
,
49
]. In this paper, we study the sample complexities of stochastic optimization for
the broader class of nonconvex functions with global KŁ property.
1.1 Related Works and Open Questions
Stochastic gradient descent.
A plethora of existing works has studied the sample complexity of
SGD and its variants for finding an
-stationary point of general nonconvex function
f
, that is, a
point
xRd
for which
E[||∇f(ˆx)||]
. For instance, [
21
] showed that for a smooth objective
(one with Lipschitz gradients) under bounded variance (BV) assumption, SGD with properly chosen
stepsizes reaches an
-stationary point with the sample complexity of
O(4)
. Recently, [
30
,
56
]
further extended the result under a much milder expected smoothness (ES) assumption on stochastic
gradient. While this sample complexity is known to be optimal for general nonconvex functions, a
naive application of this result to the function value using
α
-PŁ condition would lead to a suboptimal
O(4
/α
f)
sample complexity for finding an
f
-optimal solution, i.e.,
E[f(x)f?]f
. Recently,
[
20
] studied SGD and established convergence rates for
α
-PŁ functions under BV assumption. Their
sample complexity result is
O((4α)
/α
f)
in our notation. Later [
35
] considered SGD scheme with
random reshuffling under local and global KŁ conditions and provided convergence in the iterates for
α(1,2]
. We note that our proof techniques are different from [
20
] and [
35
] and are not limited
merely to the case of BV assumption. In this work, we will answer the following open question:
What is the exact performance limit of SGD under global KŁ condition and a more
practical model of stochastic oracle?
Variance reduction.
There has been extensive research on development of algorithms which
improve the dependence on
n
and/or
for both problems
(1)
and
(2)
(over simple methods such as
SGD and Gradient Descent (GD) ). One important family of techniques
3
is variance reduction, which
has emerged from the seminal works of Blatt et. al [
8
]. The main idea of variance reduction is to make
use of the stochastic gradients computed at previous iterations to construct a better gradient estimator
at a relatively small computational cost. Various modifications, generalizations, and improvements of
the variance reduction technique appeared in subsequent work, for instance, [
50
,
28
,
16
] to name a
few.
Finite-sum case.
A number of recently proposed algorithms such as SNVRG [
58
], SARAH [
45
],
STORM [
15
], SPIDER [
17
], and PAGE [
36
] achieve the sample complexity
On+n
2
when
minimizing a general nonconvex function with finite sum structure
(2)
. This result is also known to
be optimal in this setting [
36
]. [
27
] studies SARAH in finite sum case under local KŁ assumption and
proves convergence in the iterates. The study in [
27
] is only asymptotic analysis and the dependence
3Another independent direction is to make use of higher order information [43, 18, 3].
2
on the parameters
κ
and
n
, which are important in practice for quantifying the improvement over GD
and SGD are ignored. [
34
] proposes an SVRG-based algorithm for KŁ functions and [
55
,
45
,
46
]
study other variance reduction techniques, but they only analyze the special case
α= 2
. Under
2-PŁ condition, these methods further improve to
O(n+κn) log( 1
f)
sample complexity
4
for
finding an
f
-optimal solution. However, it is not clear if it is possible to provide any non-asymptotic
guaranties for variance reduced methods under
α
-PŁ condition for any
α[1,2)
. In our work, we
will answer the following open question:
What is the extent of improvement any variance reduction scheme can provide
under global α-PŁ condition for finite-sum objectives of the form (2)?
Online/streaming case.
While variance reduction methods have been initially designed for prob-
lems of the form
(2)
, it was later discovered that they also improve over SGD when solving
(1)
[
32
,
37
]
5
. The analysis of these methods was obtained for general nonconvex functions (for minimiz-
ing the norm of the gradient,
E[k∇f(ˆx)k]
) and later extended to
2
-PŁ objectives for minimizing
the function value,
E[f(x)f?]f
. For example, the methods in [
58
,
45
,
17
,
36
] achieve
O(3)
complexity improving over
O(4)
complexity of SGD for finding an
-stationary point. Under the
2
-PŁ condition, these results can be extended to global convergence with
O(1
f)
sample complexity
[
36
]. However, in contrast to a general nonconvex case, variance reduction under
2
-PŁ assumption
does not show any improvement over SGD in terms of
f
. We highlight that all existing analysis
of variance reduction under
α
-PŁ condition is restricted only to a special case
α= 2
. We refer
the reader to Appendix C, where we elaborate on the key difficulties in the analysis for the cases
α[1,2)
. Since the direct analysis for
α[1,2)
is challenging, in order to obtain the global
convergence in this setting, one could naively translate the complexity for finding a stationary point
of a general nonconvex function (which is
O(3)
) to convergence in a function value by using
α
-PŁ
condition:
2µ(f(ˆx)f?)1
/α≤ ||∇f(ˆx)||
. This would result in
O3
/α
f
sample complexity.
However, there are two serious issues with this approach. First, this complexity does not provide any
improvement over SGD in the most interesting practical case
α= 1
and gives strictly worse result
for all
α > 1
. Second, the guarantees for general nonconvex optimization hold on average, in the
sense that the point
ˆx
is sampled uniformly from all the iterates of the algorithm. It would be more
desirable to instead derive last iterate convergence guarantees under KŁ (
α
-PŁ) condition. In this
work, we will address the following open question:
Is it possible to accelerate the
O(4α)
/α
f
sample complexity of SGD under
global α-PŁ condition for stochastic objectives of the form (1)?
1.2 Contributions
In this work, we provide an extensive analysis of stochastic optimization under global KŁ condition
and answer all the above questions. More precisely, our contributions are as follows
We provide a new framework for the analysis of the dynamics of SGD under global KŁ
condition (see Section 3). It is based on the analysis of SGD dynamic which is governed by
a recursive inequality (see Equation
(6)
). As a result of this analysis, we introduce a set of
conditions (see Theorem 1) for designing proper stepsizes to guarantee convergence.
Using this framework, we provide sharp analysis of SGD under a general ES assumption
(Assumption 4) and demonstrate that the sample complexity
O(4α)
/α
f
is tight for the
dynamical system describing SGD.
Next, we propose PAGER, a new variance reduction scheme with parameter restart. A
carefully chosen sequence of parameters of PAGER allows the algorithm to adapt to the
nonconvex geometry of the problem and establish state-of-the-art convergence guarantees for
minimizing α-PŁ functions. In online setting (1), we obtain O2
/α
fsample complexity
of PAGER, which beats
O(4α)
/α
f
complexity of SGD for the whole spectrum of
4κ=L
/µis the analogue of condition number, Lis defined in Assumption 6.
5
Under additional assumptions such as smoothness of individual functions
fξ(·)
or even milder condition
such as average L-smoothness (Assumption 6).
3
parameters
α[1,2)
. In particular, for the important special case of
1
-PŁ, this leads to the
first optimal algorithm with
O(2
f)
sample complexity, which already matches with the
lower bound known for stochastic convex optimization [42].
Furthermore, we obtain faster rates with PAGER in finite sum case
(2)
, providing the first
acceleration over GD and SGD under α-PŁ condition.
In Table 1, we summarize the sample complexity results for stochastic optimization under
α
-PŁ and
BV assumptions. We also establish sharp convergence results for convergence in the iterates to the
set X?of optimal points and provide a summary in Table 2 in the Appendix.
Table 1: Summary of sample complexity results for
α
-PŁ functions (Assumption 3) under average
L
-smoothness (Assumptions 6) and bounded variance (Assumptions 5). Quantities:
α
= PŁ power;
µ
= PŁ constant;
κ=L/µ
;
σ2
= variance. The entries of the table show the expected number of
stochastic gradient calls to achieve E[f(xk)f?]f.
Method Finite sum case Online case
GD O1
f2α
αN/A
SGD Oκσ2
µ1
f4α
αOκσ2
µ1
f4α
α
PAGER e
On+1
f2α
α(new) Oσ2
µ+κ21
f2
α(new)
2 Assumptions and Discussion
In this section, we introduce the assumptions we make throughout the paper.
Assumption 1.
The gradient of
f(·)
is Lipschitz continuous, that is, for all
x
and
y
,
k∇f(x)− ∇f(y)k ≤ Lkxyk,L > 0is referred to as the Lipschitz constant.
Furthermore, we assume that the objective function
f
is lower bounded, i.e.,
f:= infxf(x)>−∞
,
and it satisfies the following inequality
Assumption 2
(global KŁ or global Kurdyka-Łojasiewicz)
.
Let
φ:R+R+
be a continuous
function such that
φ(0) = 0
and
φ2(·)
is convex. The function
f(·)
is said to satisfy global Kurdyka-
Łojasiewicz inequality if
||∇f(x)|| ≥ φ(f(x)f)for all xRd.(3)
Assumption 3 (α-PŁ or Polyak-Łojasiewicz).There exists α[1,2] and µ > 0such that
k∇f(x)kα(2µ)α
/2(f(x)f)for all xRd.(4)
We refer to αas the PŁ power and µas the PŁ constant.
It is straightforward to see that the α-PŁ is a special case of KŁ with φ(t) = 2µ t1.
Connections with other assumptions.
Another commonly adopted way to define the global KŁ
property is to assume that
ρ0(f(x)f?)· k∇f(x)k ≥ 1
for all
xRd
, where
ρ(t)
is called a
disingularizing function and
ρ0(·)
denotes its derivative. Moreover, disingularizing function satisfies
the following conditions, it is continuous, concave, ρ(0) = 0, and ρ0(t)>0[35].
If the above assumption holds for
ρ(t) := 1
θtθ
and
θ > 0
, then Assumption 3 is satisfied with PŁ
power
α=1
1θ
. However,
α
-PŁ condition is more general since it allows to consider the case
α= 1
.
For example, consider the function of one variable
f(x) = (ex+ex)/21
, then
|f0(x)| ≥ f(x)
for all
x
. Thus
f(x)
satisfies Assumption 3 with
α= 1
and
µ=1
/2
. Moreover, this function is
convex, but it does not satisfy inequality ρ0(f(x)f?)· k∇f(x)k ≥ 1for any choice of ρ(t).
We also provide several non-convex problems for which
α
-PŁ holds with
α[1,2]
in the Appendix A.
Other forms of
φ(t)
also appear in practice, e.g., squared cross entropy loss function satisfies the KL
condition with φ(t) = min{t, t}, [51].
4
The intuition behind the special case
α= 1
is that the function is allowed to be flat near the set of
optimal points X?= arg minxf(x).
Assumption 4
(
k
-ES, Expected Smoothness of order
k
)
.
The stochastic gradient estimator
gk(x, ξ)
is an unbiased estimate of the gradient f(x)at any given point xand its second moment satisfies
Ehkgk(x, ξ)k2i2A·hf(x)f+B· ||∇f(x)||2+C
bk
,for all xRd,(5)
where
A, B, C
are non-negative constants.
h:R+R+
is a concave continuously differentiable
function with
h0(t)0
,
h(0) = 0
. The expectation is taken over random vector
ξ∼ D
. We call
bk
the cost of such estimator.
This assumption encompasses previous assumptions in the literature. For instance, it is straightforward
to see that an estimator satisfies the standard bounded variance assumption [
21
] when
h(t)=0, B =
1
, and
bk= 1
in
(5)
. Gradient estimators with relaxed growth assumption [
6
,
11
] are also special
cases of
(5)
for
h(t)=0
and
bk= 1
. A closely related assumption to the relaxed growth was
introduced in [
53
] which holds when
h(t) = 0
and
C= 0
in
(5)
.Expected smoothness assumption
[
30
,
24
,
56
] is the closest assumption to
k
-ES and it holds when
h(t) = t
and
bk= 1
. Notably,
ES assumption is satisfied in practical scenarios such as mini-batching, importance sampling and
compressed communication [
30
]. More recently, it has been shown that PG method with softmax
policies and log barrier regularization can be modeled using ES assumption [
56
]. Note that due to
the first term in
(5)
, i.e.,
2A·h(f(x)f)
, the second moment can be large when the objective
gap at
x
is large. Such property is not captured by standard bounded variance (Assumption 5). The
flexibility and advantages of introducing such additional term are elaborated in detail in the literature
[24, 30, 25, 56].
We highlight two special cases for the sequence
bk
:
bk= Θ(kτ)
with
τ0
and
bk= Θ(qk)
with
q > 1
. For example, Monte Carlo sampling and mini-batching allow us to design estimators with
such
{bk}k0
sequences. When a gradient estimator satisfies Assumption 4, unless
bk
is bounded
for all
k
, it essentially means that we have a mechanism to reduce its variance. More precisely, the
variance decreases according to the sequence
1/bk
. As we show in Section 4, if such estimator exists,
it results in a better convergence rate compared to vanilla SGD and the improvement is captured by
sequence
bk
. On the other hand, usually the access to such estimator comes with a cost proportional
to
bk
, e.g., mini-batch setting described in Section 4. Thus we refer to
bk
as the cost of the estimator
gk(x, ξ).
3 Stochastic Gradient Method
Algorithm 1 summarizes the steps of a slightly modified SGD which we analyze in this work. We
call this algorithm SGD with restarts.
6
This algorithm updates the point
x
for
T
number of iterations
within an inner-loop. Note that, in the inner-loop, the step-size remains unchanged and the iterates are
updated via
xt+1 =xtηgk(xt, ξt)
, where
η
is the step-size and
{ξt}t0
are independent random
vectors. The cost
bk
of the gradient estimator
gk(x, ξ)
remain the same within the inner loop of
Algorithm 1.
Algorithm 1: SGD with restarts
1: Initialization: x, T, K, {ηk:k= 0, ..., K 1}
2: for k= 0, . . . , K 1do
3: ηηk
4: for t= 0, . . . , T 1do
5: xxηgk(x, ξt)
6: return x
3.1 Dynamics of SGD
Let
{xt}t0
be the sequence of points generated by the inner loop of Algorithm 1, and Assumptions
1, 2, 4 are satisfied. Then the dynamics of SGD in the inner-loop of Algorithm 1 is characterized by
6Note that if we set K= 1, then Algorithm 1 reduces to SGD with constant step-size.
5
摘要:

SharpAnalysisofStochasticOptimizationunderGlobalKurdyka-ojasiewiczInequalityIlyasFatkhullinETHAICenterÐZurichJalalEtesami*EPFLyNiaoHeETHZurichNegarKiyavashEPFL†AbstractWestudythecomplexityofndingtheglobalsolutiontostochasticnonconvexoptimizationwhentheobjectivefunctionsatisesglobalKurdyka-oj...

展开>> 收起<<
Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz Inequality Ilyas Fatkhullin.pdf

共36页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:36 页 大小:964.29KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 36
客服
关注