
convergence to global solutions. In this direction, several relaxations of convexity have been proposed
and analyzed, for instance, star convexity, quasar-convexity, error bounds condition, restricted secant
inequality, and quadratic growth [
26
,
24
,
23
]. Many of the aforementioned relaxations have limited
application in real-world problems.
Recently, there has been a surge of interest in the analysis of functions satisfying the so-called
Kurdyka-Łojasiewicz (KŁ) inequality [
9
,
10
]. Of particular interest is the family of functions that
satisfy global KŁ inequality. Specifically, we say that
f(·)
satisfies (global) KŁ inequality if there
exists some continuous function
φ(·)
such that
||∇f(x)|| ≥ φ(f(x)−infxf(x))
for all
x∈Rd
.
If this inequality is satisfied for
φ(t) = √2µ t1/α
, then we say that (global)
α
-PŁ condition is
satisfied for
f(·)
. The special case of KŁ condition,
2
-PŁ, often referred as Polyak-Łojasiewicz
or PŁ condition, was originally discovered independently in the seminal works of B. Polyak, T.
Ležanski and S. Łojasiewicz [
48
,
33
,
38
,
39
]. Notably, this class of problems has found many
interesting emerging applications in machine learning, for instance, policy gradient (PG) methods
in reinforcement learning [
41
,
1
,
56
], generalized linear models [
40
], over-parameterized neural
networks [
2
,
57
], linear quadratic regulator in optimal control [
13
,
19
], and low-rank matrix recovery
[7].
Despite increased popularity of KŁ and
α
-PŁ assumptions, the analysis of stochastic optimization
under it remains limited and the majority of works focus on deterministic gradient methods. Indeed
until recently, only the special case of
α
-PŁ with
α= 2
was mainly addressed in the literature
[
26
,
23
,
30
,
53
,
49
]. In this paper, we study the sample complexities of stochastic optimization for
the broader class of nonconvex functions with global KŁ property.
1.1 Related Works and Open Questions
Stochastic gradient descent.
A plethora of existing works has studied the sample complexity of
SGD and its variants for finding an
-stationary point of general nonconvex function
f
, that is, a
point
x∈Rd
for which
E[||∇f(ˆx)||]≤
. For instance, [
21
] showed that for a smooth objective
(one with Lipschitz gradients) under bounded variance (BV) assumption, SGD with properly chosen
stepsizes reaches an
-stationary point with the sample complexity of
O(−4)
. Recently, [
30
,
56
]
further extended the result under a much milder expected smoothness (ES) assumption on stochastic
gradient. While this sample complexity is known to be optimal for general nonconvex functions, a
naive application of this result to the function value using
α
-PŁ condition would lead to a suboptimal
O(−4
/α
f)
sample complexity for finding an
f
-optimal solution, i.e.,
E[f(x)−f?]≤f
. Recently,
[
20
] studied SGD and established convergence rates for
α
-PŁ functions under BV assumption. Their
sample complexity result is
O(−(4−α)
/α
f)
in our notation. Later [
35
] considered SGD scheme with
random reshuffling under local and global KŁ conditions and provided convergence in the iterates for
α∈(1,2]
. We note that our proof techniques are different from [
20
] and [
35
] and are not limited
merely to the case of BV assumption. In this work, we will answer the following open question:
What is the exact performance limit of SGD under global KŁ condition and a more
practical model of stochastic oracle?
Variance reduction.
There has been extensive research on development of algorithms which
improve the dependence on
n
and/or
for both problems
(1)
and
(2)
(over simple methods such as
SGD and Gradient Descent (GD) ). One important family of techniques
3
is variance reduction, which
has emerged from the seminal works of Blatt et. al [
8
]. The main idea of variance reduction is to make
use of the stochastic gradients computed at previous iterations to construct a better gradient estimator
at a relatively small computational cost. Various modifications, generalizations, and improvements of
the variance reduction technique appeared in subsequent work, for instance, [
50
,
28
,
16
] to name a
few.
Finite-sum case.
A number of recently proposed algorithms such as SNVRG [
58
], SARAH [
45
],
STORM [
15
], SPIDER [
17
], and PAGE [
36
] achieve the sample complexity
On+√n
2
when
minimizing a general nonconvex function with finite sum structure
(2)
. This result is also known to
be optimal in this setting [
36
]. [
27
] studies SARAH in finite sum case under local KŁ assumption and
proves convergence in the iterates. The study in [
27
] is only asymptotic analysis and the dependence
3Another independent direction is to make use of higher order information [43, 18, 3].
2