Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz Inequality Ilyas Fatkhullin

2025-05-03 0 0 964.29KB 36 页 10玖币

侵权投诉

Sharp Analysis of Stochastic Optimization under

Global Kurdyka-Łojasiewicz Inequality

Ilyas Fatkhullin∗

ETH AI Center & ETH Zurich Jalal Etesami*

EPFL†

Niao He

ETH Zurich Negar Kiyavash

EPFL†

Abstract

We study the complexity of ﬁnding the global solution to stochastic nonconvex

optimization when the objective function satisﬁes global Kurdyka-Łojasiewicz

(KŁ) inequality and the queries from stochastic gradient oracles satisfy mild ex-

pected smoothness assumption. We ﬁrst introduce a general framework to analyze

Stochastic Gradient Descent (SGD) and its associated nonlinear dynamics under

the setting. As a byproduct of our analysis, we obtain a sample complexity of

O(−(4−α)/α)

for SGD when the objective satisﬁes the so called

-PŁ condition,

where

is the degree of gradient domination. Furthermore, we show that a modi-

ﬁed SGD with variance reduction and restarting (PAGER) achieves an improved

sample complexity of

O(−2/α)

when the objective satisﬁes the average smooth-

ness assumption. This leads to the ﬁrst optimal algorithm for the important case of

α= 1

which appears in applications such as policy optimization in reinforcement

learning.

1 Introduction

Nonconvex optimization problems are ubiquitous in machine learning domains such as training deep

neural networks [

] or policy optimization in reinforcement learning [

]. Stochastic Gradient

Descent (SGD) and its variants are driving the practical success of machine learning approaches.

Naturally, understanding the limits of performance of SGD in the nonconvex setting has become an

important avenue of research in recent years [21, 4, 30, 44, 23, 15, 59].

We are interested in solving the unconstrained stochastic,nonconvex optimization problem of the

form

min

x∈Rdf(x) := Eξ∼D [fξ(x)] ,(1)

where

f(·)

is smooth and possibly nonconvex, and

is a random vector drawn from a distribution

Moreover, we are interested in an important special case of

(1)

, when the expectation can be written

as the average of nsmooth functions:

min

x∈Rdhf(x) := 1

i=1

fi(x)i.(2)

For a general nonconvex differentiable objective

f:Rd→R

, ﬁnding a global minimum of

in general intractable [

]. There are two common strategies to analyze optimization methods

for nonconvex functions. The ﬁrst one is to scale down the requirements on the solution of interest

from global optimality to some relaxed version, e.g., ﬁrst-order stationary point. However, such

solutions do not exclude the possibility of approaching a suboptimal local minima or a saddle point.

Another approach is to study nonconvex problems with additional structural assumption in the hope of

∗First two authors have equal contribution.

†École polytechnique fédérale de Lausanne

Submitted for review in May, 2022. Accepted to NeurIPS 2022 in Sep, 2022.

arXiv:2210.01748v1 [math.OC] 4 Oct 2022

convergence to global solutions. In this direction, several relaxations of convexity have been proposed

and analyzed, for instance, star convexity, quasar-convexity, error bounds condition, restricted secant

inequality, and quadratic growth [

]. Many of the aforementioned relaxations have limited

application in real-world problems.

Recently, there has been a surge of interest in the analysis of functions satisfying the so-called

Kurdyka-Łojasiewicz (KŁ) inequality [

]. Of particular interest is the family of functions that

satisfy global KŁ inequality. Speciﬁcally, we say that

f(·)

satisﬁes (global) KŁ inequality if there

exists some continuous function

φ(·)

such that

||∇f(x)|| ≥ φ(f(x)−infxf(x))

for all

x∈Rd

If this inequality is satisﬁed for

φ(t) = √2µ t1/α

, then we say that (global)

-PŁ condition is

satisﬁed for

f(·)

. The special case of KŁ condition,

-PŁ, often referred as Polyak-Łojasiewicz

or PŁ condition, was originally discovered independently in the seminal works of B. Polyak, T.

Ležanski and S. Łojasiewicz [

]. Notably, this class of problems has found many

interesting emerging applications in machine learning, for instance, policy gradient (PG) methods

in reinforcement learning [

], generalized linear models [

], over-parameterized neural

networks [

], linear quadratic regulator in optimal control [

], and low-rank matrix recovery

[7].

Despite increased popularity of KŁ and

-PŁ assumptions, the analysis of stochastic optimization

under it remains limited and the majority of works focus on deterministic gradient methods. Indeed

until recently, only the special case of

-PŁ with

α= 2

was mainly addressed in the literature

[

]. In this paper, we study the sample complexities of stochastic optimization for

the broader class of nonconvex functions with global KŁ property.

1.1 Related Works and Open Questions

Stochastic gradient descent.

A plethora of existing works has studied the sample complexity of

SGD and its variants for ﬁnding an



-stationary point of general nonconvex function

, that is, a

point

x∈Rd

for which

E[||∇f(ˆx)||]≤

. For instance, [

] showed that for a smooth objective

(one with Lipschitz gradients) under bounded variance (BV) assumption, SGD with properly chosen

stepsizes reaches an



-stationary point with the sample complexity of

O(−4)

. Recently, [

]

further extended the result under a much milder expected smoothness (ES) assumption on stochastic

gradient. While this sample complexity is known to be optimal for general nonconvex functions, a

naive application of this result to the function value using

-PŁ condition would lead to a suboptimal

O(−4

/α

sample complexity for ﬁnding an

f

-optimal solution, i.e.,

E[f(x)−f?]≤f

. Recently,

[

] studied SGD and established convergence rates for

-PŁ functions under BV assumption. Their

sample complexity result is

O(−(4−α)

/α

in our notation. Later [

] considered SGD scheme with

random reshufﬂing under local and global KŁ conditions and provided convergence in the iterates for

α∈(1,2]

. We note that our proof techniques are different from [

] and [

] and are not limited

merely to the case of BV assumption. In this work, we will answer the following open question:

What is the exact performance limit of SGD under global KŁ condition and a more

practical model of stochastic oracle?

Variance reduction.

There has been extensive research on development of algorithms which

improve the dependence on

and/or



for both problems

(1)

and

(2)

(over simple methods such as

SGD and Gradient Descent (GD) ). One important family of techniques

is variance reduction, which

has emerged from the seminal works of Blatt et. al [

]. The main idea of variance reduction is to make

use of the stochastic gradients computed at previous iterations to construct a better gradient estimator

at a relatively small computational cost. Various modiﬁcations, generalizations, and improvements of

the variance reduction technique appeared in subsequent work, for instance, [

] to name a

few.

Finite-sum case.

A number of recently proposed algorithms such as SNVRG [

], SARAH [

STORM [

], SPIDER [

], and PAGE [

] achieve the sample complexity

On+√n

2

when

minimizing a general nonconvex function with ﬁnite sum structure

(2)

. This result is also known to

be optimal in this setting [

]. [

] studies SARAH in ﬁnite sum case under local KŁ assumption and

proves convergence in the iterates. The study in [

] is only asymptotic analysis and the dependence

3Another independent direction is to make use of higher order information [43, 18, 3].

on the parameters

and

, which are important in practice for quantifying the improvement over GD

and SGD are ignored. [

] proposes an SVRG-based algorithm for KŁ functions and [

]

study other variance reduction techniques, but they only analyze the special case

α= 2

. Under

2-PŁ condition, these methods further improve to

O(n+κ√n) log( 1

f)

sample complexity

for

ﬁnding an

f

-optimal solution. However, it is not clear if it is possible to provide any non-asymptotic

guaranties for variance reduced methods under

-PŁ condition for any

α∈[1,2)

. In our work, we

will answer the following open question:

What is the extent of improvement any variance reduction scheme can provide

under global α-PŁ condition for ﬁnite-sum objectives of the form (2)?

Online/streaming case.

While variance reduction methods have been initially designed for prob-

lems of the form

(2)

, it was later discovered that they also improve over SGD when solving

(1)

[

]

. The analysis of these methods was obtained for general nonconvex functions (for minimiz-

ing the norm of the gradient,

E[k∇f(ˆx)k]≤

) and later extended to

-PŁ objectives for minimizing

the function value,

E[f(x)−f?]≤f

. For example, the methods in [

] achieve

O(−3)

complexity improving over

O(−4)

complexity of SGD for ﬁnding an



-stationary point. Under the

-PŁ condition, these results can be extended to global convergence with

O(−1

sample complexity

[

]. However, in contrast to a general nonconvex case, variance reduction under

-PŁ assumption

does not show any improvement over SGD in terms of

f

. We highlight that all existing analysis

of variance reduction under

-PŁ condition is restricted only to a special case

α= 2

. We refer

the reader to Appendix C, where we elaborate on the key difﬁculties in the analysis for the cases

α∈[1,2)

. Since the direct analysis for

α∈[1,2)

is challenging, in order to obtain the global

convergence in this setting, one could naively translate the complexity for ﬁnding a stationary point

of a general nonconvex function (which is

O(−3)

) to convergence in a function value by using

-PŁ

condition:

√2µ(f(ˆx)−f?)1

/α≤ ||∇f(ˆx)||

. This would result in

O−3

/α

f

sample complexity.

However, there are two serious issues with this approach. First, this complexity does not provide any

improvement over SGD in the most interesting practical case

α= 1

and gives strictly worse result

for all

α > 1

. Second, the guarantees for general nonconvex optimization hold on average, in the

sense that the point

ˆx

is sampled uniformly from all the iterates of the algorithm. It would be more

desirable to instead derive last iterate convergence guarantees under KŁ (

-PŁ) condition. In this

work, we will address the following open question:

Is it possible to accelerate the

O−(4−α)

/α

f

sample complexity of SGD under

global α-PŁ condition for stochastic objectives of the form (1)?

1.2 Contributions

In this work, we provide an extensive analysis of stochastic optimization under global KŁ condition

and answer all the above questions. More precisely, our contributions are as follows

•

We provide a new framework for the analysis of the dynamics of SGD under global KŁ

condition (see Section 3). It is based on the analysis of SGD dynamic which is governed by

a recursive inequality (see Equation

(6)

). As a result of this analysis, we introduce a set of

conditions (see Theorem 1) for designing proper stepsizes to guarantee convergence.

•

Using this framework, we provide sharp analysis of SGD under a general ES assumption

(Assumption 4) and demonstrate that the sample complexity

O−(4−α)

/α

f

is tight for the

dynamical system describing SGD.

•

Next, we propose PAGER, a new variance reduction scheme with parameter restart. A

carefully chosen sequence of parameters of PAGER allows the algorithm to adapt to the

nonconvex geometry of the problem and establish state-of-the-art convergence guarantees for

minimizing α-PŁ functions. In online setting (1), we obtain O−2

/α

fsample complexity

of PAGER, which beats

O−(4−α)

/α

f

complexity of SGD for the whole spectrum of

4κ=L

/µis the analogue of condition number, Lis deﬁned in Assumption 6.

Under additional assumptions such as smoothness of individual functions

fξ(·)

or even milder condition

such as average L-smoothness (Assumption 6).

parameters

α∈[1,2)

. In particular, for the important special case of

-PŁ, this leads to the

ﬁrst optimal algorithm with

O(−2

sample complexity, which already matches with the

lower bound known for stochastic convex optimization [42].

•

Furthermore, we obtain faster rates with PAGER in ﬁnite sum case

(2)

, providing the ﬁrst

acceleration over GD and SGD under α-PŁ condition.

In Table 1, we summarize the sample complexity results for stochastic optimization under

-PŁ and

BV assumptions. We also establish sharp convergence results for convergence in the iterates to the

set X?of optimal points and provide a summary in Table 2 in the Appendix.

Table 1: Summary of sample complexity results for

-PŁ functions (Assumption 3) under average

-smoothness (Assumptions 6) and bounded variance (Assumptions 5). Quantities:

= PŁ power;

= PŁ constant;

κ=L/µ

;

σ2

= variance. The entries of the table show the expected number of

stochastic gradient calls to achieve E[f(xk)−f?]≤f.

Method Finite sum case Online case

GD Onκ 1

f2−α

αN/A

SGD Oκσ2

µ1

f4−α

αOκσ2

µ1

f4−α

α

PAGER e

On+√nκ 1

f2−α

α(new) Oσ2

µ+κ21

f2

α(new)

2 Assumptions and Discussion

In this section, we introduce the assumptions we make throughout the paper.

Assumption 1.

The gradient of

f(·)

is Lipschitz continuous, that is, for all

and

k∇f(x)− ∇f(y)k ≤ Lkx−yk,L > 0is referred to as the Lipschitz constant.

Furthermore, we assume that the objective function

is lower bounded, i.e.,

f∗:= infxf(x)>−∞

and it satisﬁes the following inequality

Assumption 2

(global KŁ or global Kurdyka-Łojasiewicz)

Let

φ:R+→R+

be a continuous

function such that

φ(0) = 0

and

φ2(·)

is convex. The function

f(·)

is said to satisfy global Kurdyka-

Łojasiewicz inequality if

||∇f(x)|| ≥ φ(f(x)−f∗)for all x∈Rd.(3)

Assumption 3 (α-PŁ or Polyak-Łojasiewicz).There exists α∈[1,2] and µ > 0such that

k∇f(x)kα≥(2µ)α

/2(f(x)−f∗)for all x∈Rd.(4)

We refer to αas the PŁ power and µas the PŁ constant.

It is straightforward to see that the α-PŁ is a special case of KŁ with φ(t) = √2µ t1/α.

Connections with other assumptions.

Another commonly adopted way to deﬁne the global KŁ

property is to assume that

ρ0(f(x)−f?)· k∇f(x)k ≥ 1

for all

x∈Rd

, where

ρ(t)

is called a

disingularizing function and

ρ0(·)

denotes its derivative. Moreover, disingularizing function satisﬁes

the following conditions, it is continuous, concave, ρ(0) = 0, and ρ0(t)>0[35].

If the above assumption holds for

ρ(t) := 1

θtθ

and

θ > 0

, then Assumption 3 is satisﬁed with PŁ

power

α=1

1−θ

. However,

-PŁ condition is more general since it allows to consider the case

α= 1

For example, consider the function of one variable

f(x) = (ex+e−x)/2−1

, then

|f0(x)| ≥ f(x)

for all

. Thus

f(x)

satisﬁes Assumption 3 with

α= 1

and

µ=1

. Moreover, this function is

convex, but it does not satisfy inequality ρ0(f(x)−f?)· k∇f(x)k ≥ 1for any choice of ρ(t).

We also provide several non-convex problems for which

-PŁ holds with

α∈[1,2]

in the Appendix A.

Other forms of

φ(t)

also appear in practice, e.g., squared cross entropy loss function satisﬁes the KL

condition with φ(t) = min{t, √t}, [51].

The intuition behind the special case

α= 1

is that the function is allowed to be ﬂat near the set of

optimal points X?= arg minxf(x).

Assumption 4

(

-ES, Expected Smoothness of order

)

The stochastic gradient estimator

gk(x, ξ)

is an unbiased estimate of the gradient ∇f(x)at any given point xand its second moment satisﬁes

Ehkgk(x, ξ)k2i≤2A·hf(x)−f∗+B· ||∇f(x)||2+C

,for all x∈Rd,(5)

where

A, B, C

are non-negative constants.

h:R+→R+

is a concave continuously differentiable

function with

h0(t)≥0

h(0) = 0

. The expectation is taken over random vector

ξ∼ D

. We call

the cost of such estimator.

This assumption encompasses previous assumptions in the literature. For instance, it is straightforward

to see that an estimator satisﬁes the standard bounded variance assumption [

] when

h(t)=0, B =

, and

bk= 1

(5)

. Gradient estimators with relaxed growth assumption [

] are also special

cases of

(5)

for

h(t)=0

and

bk= 1

. A closely related assumption to the relaxed growth was

introduced in [

] which holds when

h(t) = 0

and

C= 0

(5)

.Expected smoothness assumption

[

] is the closest assumption to

-ES and it holds when

h(t) = t

and

bk= 1

. Notably,

ES assumption is satisﬁed in practical scenarios such as mini-batching, importance sampling and

compressed communication [

]. More recently, it has been shown that PG method with softmax

policies and log barrier regularization can be modeled using ES assumption [

]. Note that due to

the ﬁrst term in

(5)

, i.e.,

2A·h(f(x)−f∗)

, the second moment can be large when the objective

gap at

is large. Such property is not captured by standard bounded variance (Assumption 5). The

ﬂexibility and advantages of introducing such additional term are elaborated in detail in the literature

[24, 30, 25, 56].

We highlight two special cases for the sequence

bk= Θ(kτ)

with

τ≥0

and

bk= Θ(qk)

with

q > 1

. For example, Monte Carlo sampling and mini-batching allow us to design estimators with

such

{bk}k≥0

sequences. When a gradient estimator satisﬁes Assumption 4, unless

is bounded

for all

, it essentially means that we have a mechanism to reduce its variance. More precisely, the

variance decreases according to the sequence

1/bk

. As we show in Section 4, if such estimator exists,

it results in a better convergence rate compared to vanilla SGD and the improvement is captured by

sequence

. On the other hand, usually the access to such estimator comes with a cost proportional

, e.g., mini-batch setting described in Section 4. Thus we refer to

as the cost of the estimator

gk(x, ξ).

3 Stochastic Gradient Method

Algorithm 1 summarizes the steps of a slightly modiﬁed SGD which we analyze in this work. We

call this algorithm SGD with restarts.

This algorithm updates the point

for

number of iterations

within an inner-loop. Note that, in the inner-loop, the step-size remains unchanged and the iterates are

updated via

xt+1 =xt−ηgk(xt, ξt)

, where

is the step-size and

{ξt}t≥0

are independent random

vectors. The cost

of the gradient estimator

gk(x, ξ)

remain the same within the inner loop of

Algorithm 1.

Algorithm 1: SGD with restarts

1: Initialization: x, T, K, {ηk:k= 0, ..., K −1}

2: for k= 0, . . . , K −1do

3: η←ηk

4: for t= 0, . . . , T −1do

5: x←x−ηgk(x, ξt)

6: return x

3.1 Dynamics of SGD

Let

{xt}t≥0

be the sequence of points generated by the inner loop of Algorithm 1, and Assumptions

1, 2, 4 are satisﬁed. Then the dynamics of SGD in the inner-loop of Algorithm 1 is characterized by

6Note that if we set K= 1, then Algorithm 1 reduces to SGD with constant step-size.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SharpAnalysisofStochasticOptimizationunderGlobalKurdyka-ojasiewiczInequalityIlyasFatkhullinETHAICenterÐZurichJalalEtesami*EPFLyNiaoHeETHZurichNegarKiyavashEPFLAbstractWestudythecomplexityofndingtheglobalsolutiontostochasticnonconvexoptimizationwhentheobjectivefunctionsatisesglobalKurdyka-oj...

收起<<

Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz Inequality Ilyas Fatkhullin.pdf

共36页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sharp Analysis of Stochastic Optimization under Global Kurdyka-Łojasiewicz Inequality Ilyas Fatkhullin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: