
2 D.-Y. LIM, A. NEUFELD, S. SABANIS, AND Y. ZHANG
estimation, vector quantization, CVaR minimization, and regularized optimization problems involving
ReLU neural networks, see, e.g., [9,25,47,60].
We consider the following optimization problem:
minimize Rd∋θ7→ u(θ) := E[U(θ, X)],(1)
where
U:Rd×Rm→R
is a measurable function, and
X
is a given
Rm
-valued random variable with
probability law
L(X)
. To obtain approximate minimizers of
(1)
, one of the approaches is to apply the
stochastic gradient Langevin dynamics (SGLD) algorithm introduced in [70], which can be viewed as a
variant of the Euler discretization of the Langevin SDE defined on t∈[0,∞)given by
dZt=−h(Zt) dt+p2β−1dBt, Z0=θ0,(2)
where
θ0
is an
Rd
-valued random variable,
h:= ∇u
,
β > 0
is the inverse temperature parameter, and
(Bt)t≥0
is a
d
-dimensional Brownian motion. The associated stochastic gradient of the SGLD algorithm
is defined as a measurable function
H:Rd×Rm→Rd
which satisfies
h(θ) = E[H(θ, X)]
for all
θ∈Rd
. One notes that, under mild conditions, the Langevin SDE
(2)
admits a unique invariant measure
πβ(dθ)exp(−βu(θ))dθ
with
β > 0
. It has been shown in [
36
] that
πβ
concentrates around the
minimizers of
u
when
β
takes sufficiently large values. Therefore, minimizing
(1)
is equivalent to
sampling from
πβ
with large
β
. The convergence properties of the SGLD algorithm to
πβ
in suitable
distances have been well studied in the literature, under the conditions that the (stochastic) gradient of
u
is globally Lipschitz continuous and satisfies a (local) dissipativity or convexity at infinity condition,
see, e.g., [
10
,
13
,
57
,
72
,
75
] and references therein. Recent research focuses on the relaxation of the
global Lipschitz condition imposed on the (stochastic) gradient of
u
so as to accommodate optimization
problems involving neural networks. However, the SGLD algorithm is unstable when applying to objective
functions with highly non-linear (stochastic) gradients, and the absolute moments of the approximations
generated by the SGLD algorithm could diverge to infinity at a finite time point, see [
34
]. To address this
issue, [
49
] proposed a tamed unadjusted stochastic Langevin algorithm (TUSLA), which is obtained by
applying the taming technique, developed in, e.g., [7,35,62,63], to the SGLD algorithm. Convergence
results of TUSLA are provided in [
49
] under the condition that the stochastic gradient of
u
is polynomially
Lipschitz growing. In [
47
], the applicability of TUSLA is further extended to the case where the stochastic
gradient of
u
is discontinuous, and the polynomial Lipschitz condition is replaced by a more relaxed
locally Lipschitz in average condition. The latter condition is similar to [
9
, Eqn. (6)] and [
25
, H4],
which well accommodates optimization problems with ReLU neural networks. One may also refer to
[
9
,
20
,
21
,
25
,
50
] for convergence results of the Langevin dynamics based algorithms with discontinuous
(stochastic) gradients.
Despite their established theoretical guarantees, TUSLA and other Langevin dynamics based algorithms
are less popular in practice, especially when training deep learning models, compared to adaptive learning
rate methods including ADAM and AMSGrad. This is due to the superior empirical performance of
the latter group of algorithms in terms of the test accuracy and training speed. In [
46
], a new class of
Langevin dynamics based algorithms, namely TH
ε
O POULA, is proposed based on the advances of
polygonal Euler approximations, see [
43
,
44
]. More precisely, the design of TH
ε
O POULA relies on
a combination of a componentwise taming function and a componentwise boosting function, which
simultaneously address the exploding and vanishing gradient problems. Furthermore, such a design
allows TH
ε
O POULA to convert from an adaptive learning rate method to a Langevin dynamics based
algorithm when approaching an optimal point, preserving the feature of a fast training speed of the
former and the feature of a good generalization of the latter. In addition, [
46
] provides a convergence
analysis of TH
ε
O POULA for non-convex regularized optimization problems. Under the condition that
the (stochastic) gradient is locally Lipschitz continuous, non-asymptotic error bounds for TH
ε
O POULA
in Wasserstein distances are established, and a non-asymptotic estimate for the expected excess risk
is provided. However, the local Lipschitz condition fails to accommodate optimization problems with
discontinuous stochastic gradients.
In this paper, we propose the algorithm e-TH
ε
O POULA, which combines the advantages of utilizing
Euler’s polygonal approximations of TH
ε
O POULA [
46
] resulting in its superior empirical performance,
together with a relaxed condition on its stochastic gradient as explained below. We aim to demonstrate
both theoretically and numerically the applicability of e-TH
ε
O POULA for optimization problems with
discontinuous stochastic gradients. From a theoretical point of view, our goal is to provide theoretical
guarantees for e-TH
ε
O POULA to find approximate minimizers of
u
with discontinuous stochastic