LANGEVIN DYNAMICS BASED ALGORITHM E-TH εO POULA FOR STOCHASTIC OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT DONG-YOUNG LIM ARIEL NEUFELD SOTIRIOS SABANIS AND YING ZHANG

2025-05-03 0 0 1.61MB 45 页 10玖币
侵权投诉
LANGEVIN DYNAMICS BASED ALGORITHM E-THεO POULA FOR STOCHASTIC
OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT
DONG-YOUNG LIM, ARIEL NEUFELD, SOTIRIOS SABANIS, AND YING ZHANG
ABSTRACT.
We introduce a new Langevin dynamics based algorithm, called e-TH
ε
O POULA, to solve
optimization problems with discontinuous stochastic gradients which naturally appear in real-world applica-
tions such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization
problems involving ReLU neural networks. We demonstrate both theoretically and numerically the ap-
plicability of the e-TH
ε
O POULA algorithm. More precisely, under the conditions that the stochastic
gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish
non-asymptotic error bounds for e-TH
ε
O POULA in Wasserstein distances and provide a non-asymptotic
estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications
in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in
multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with
(Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illus-
trate the superior empirical performance of e-TH
ε
O POULA compared to SGLD, TUSLA, ADAM, and
AMSGrad in terms of model accuracy.
1. INTRODUCTION
A wide range of problems in economics, finance, and quantitative risk management can be represented
as stochastic optimization problems. Traditional approaches to solve such problems typically face the
curse of dimensionality in practical settings, which motivates researchers and practitioners to apply
machine learning approaches to obtain approximated solutions. Consequently, deep learning have been
widely adopted to almost all aspects in, e.g., financial applications including option pricing, implied
volatility, prediction, hedging, and portfolio optimization [
3
,
5
,
8
,
11
,
24
,
32
,
37
,
48
,
54
,
55
,
67
,
68
], and
applications in insurance [
12
,
27
,
30
,
31
,
40
,
41
,
45
,
52
,
56
,
69
,
71
,
73
]. While the aforementioned results
justify the use of deep neural networks through the universal approximation theorem, it is not a trivial
problem to train a deep neural network, which is equivalent to minimizing an associated loss function,
using efficient optimization algorithms. Stochastic gradient descent (SGD) and its variants are popular
methods to solve such non-convex and large scale optimization problems. However, it is well known that
SGD methods are only proven to converge to a stationary point in non-convex settings. Despite the lack
of theoretical guarantees for the SGD methods, the literature on deep learning in finance, insurance, and
their related fields heavily rely on popular optimization methods such as SGD and its variants including,
e.g., ADAM [
42
] and AMSGrad [
58
]. In [
38
], the author explicitly highlights the importance of research
on stochastic optimization methods for problems in finance: ‘The choice of optimisation engine in deep
learning is vitally important in obtaining sensible results, but a topic rarely discussed (at least within
the financial mathematics community)’. The aim of this paper is thus to bridge the theoretical gap
and to extend the empirical understanding of training deep learning models in applications in finance
and insurance. We achieve these by investigating the properties of a newly proposed algorithm, i.e.,
the extended Tamed Hybrid
ε
-Order POlygonal Unadjusted Langevin Algorithm (e-TH
ε
O POULA),
which can be applied to optimization problems with discontinuous stochastic gradients including quantile
Key words and phrases. Langevin dynamics based algorithm, discontinuous stochastic gradient, non-convex stochastic
optimization, non-asymptotic convergence bound, artificial neural networks, ReLU activation function, taming technique,
super-linearly growing coefficients.
Financial supports by The Alan Turing Institute, London under the EPSRC grant EP/N510129/1, the MOE AcRF Tier 2
Grant MOE-T2EP20222-0013, the European Union’s Horizon 2020 research and innovation programme under the Marie
Skłodowska-Curie grant agreement No. 801215, the University of Edinburgh Data-Driven Innovation programme, part of
the Edinburgh and South East Scotland City Region Deal, Institute of Information & communications Technology Planning
& Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01336, Artificial Intelligence Graduate
School Program (UNIST)), National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)
(No. RS-2023-00253002), and the Guangzhou-HKUST(GZ) Joint Funding Program (No. 2024A03J0630) are gratefully
acknowledged.
1
arXiv:2210.13193v3 [math.OC] 30 Jun 2024
2 D.-Y. LIM, A. NEUFELD, S. SABANIS, AND Y. ZHANG
estimation, vector quantization, CVaR minimization, and regularized optimization problems involving
ReLU neural networks, see, e.g., [9,25,47,60].
We consider the following optimization problem:
minimize Rdθ7→ u(θ) := E[U(θ, X)],(1)
where
U:Rd×RmR
is a measurable function, and
X
is a given
Rm
-valued random variable with
probability law
L(X)
. To obtain approximate minimizers of
(1)
, one of the approaches is to apply the
stochastic gradient Langevin dynamics (SGLD) algorithm introduced in [70], which can be viewed as a
variant of the Euler discretization of the Langevin SDE defined on t[0,)given by
dZt=h(Zt) dt+p2β1dBt, Z0=θ0,(2)
where
θ0
is an
Rd
-valued random variable,
h:= u
,
β > 0
is the inverse temperature parameter, and
(Bt)t0
is a
d
-dimensional Brownian motion. The associated stochastic gradient of the SGLD algorithm
is defined as a measurable function
H:Rd×RmRd
which satisfies
h(θ) = E[H(θ, X)]
for all
θRd
. One notes that, under mild conditions, the Langevin SDE
(2)
admits a unique invariant measure
πβ(dθ)exp(βu(θ))dθ
with
β > 0
. It has been shown in [
36
] that
πβ
concentrates around the
minimizers of
u
when
β
takes sufficiently large values. Therefore, minimizing
(1)
is equivalent to
sampling from
πβ
with large
β
. The convergence properties of the SGLD algorithm to
πβ
in suitable
distances have been well studied in the literature, under the conditions that the (stochastic) gradient of
u
is globally Lipschitz continuous and satisfies a (local) dissipativity or convexity at infinity condition,
see, e.g., [
10
,
13
,
57
,
72
,
75
] and references therein. Recent research focuses on the relaxation of the
global Lipschitz condition imposed on the (stochastic) gradient of
u
so as to accommodate optimization
problems involving neural networks. However, the SGLD algorithm is unstable when applying to objective
functions with highly non-linear (stochastic) gradients, and the absolute moments of the approximations
generated by the SGLD algorithm could diverge to infinity at a finite time point, see [
34
]. To address this
issue, [
49
] proposed a tamed unadjusted stochastic Langevin algorithm (TUSLA), which is obtained by
applying the taming technique, developed in, e.g., [7,35,62,63], to the SGLD algorithm. Convergence
results of TUSLA are provided in [
49
] under the condition that the stochastic gradient of
u
is polynomially
Lipschitz growing. In [
47
], the applicability of TUSLA is further extended to the case where the stochastic
gradient of
u
is discontinuous, and the polynomial Lipschitz condition is replaced by a more relaxed
locally Lipschitz in average condition. The latter condition is similar to [
9
, Eqn. (6)] and [
25
, H4],
which well accommodates optimization problems with ReLU neural networks. One may also refer to
[
9
,
20
,
21
,
25
,
50
] for convergence results of the Langevin dynamics based algorithms with discontinuous
(stochastic) gradients.
Despite their established theoretical guarantees, TUSLA and other Langevin dynamics based algorithms
are less popular in practice, especially when training deep learning models, compared to adaptive learning
rate methods including ADAM and AMSGrad. This is due to the superior empirical performance of
the latter group of algorithms in terms of the test accuracy and training speed. In [
46
], a new class of
Langevin dynamics based algorithms, namely TH
ε
O POULA, is proposed based on the advances of
polygonal Euler approximations, see [
43
,
44
]. More precisely, the design of TH
ε
O POULA relies on
a combination of a componentwise taming function and a componentwise boosting function, which
simultaneously address the exploding and vanishing gradient problems. Furthermore, such a design
allows TH
ε
O POULA to convert from an adaptive learning rate method to a Langevin dynamics based
algorithm when approaching an optimal point, preserving the feature of a fast training speed of the
former and the feature of a good generalization of the latter. In addition, [
46
] provides a convergence
analysis of TH
ε
O POULA for non-convex regularized optimization problems. Under the condition that
the (stochastic) gradient is locally Lipschitz continuous, non-asymptotic error bounds for TH
ε
O POULA
in Wasserstein distances are established, and a non-asymptotic estimate for the expected excess risk
is provided. However, the local Lipschitz condition fails to accommodate optimization problems with
discontinuous stochastic gradients.
In this paper, we propose the algorithm e-TH
ε
O POULA, which combines the advantages of utilizing
Euler’s polygonal approximations of TH
ε
O POULA [
46
] resulting in its superior empirical performance,
together with a relaxed condition on its stochastic gradient as explained below. We aim to demonstrate
both theoretically and numerically the applicability of e-TH
ε
O POULA for optimization problems with
discontinuous stochastic gradients. From a theoretical point of view, our goal is to provide theoretical
guarantees for e-TH
ε
O POULA to find approximate minimizers of
u
with discontinuous stochastic
3
gradient. More concretely, we aim to relax the local Lipschitz condition, and replace it with a local
Lipschitz in average condition, see Assumption 2. In addition, [
46
] considers regularized optimization
problems which assume a certain structure of the stochastic gradients of the corresponding objective
functions. More precisely, [
46
] assumes that
u(θ) := g(θ) + η|θ|2r+1/(2r+ 1)
,
θRd
, where
g:RdR
,
η > 0
, and
r > 0
. The second term on the RHS of
u
is the regularization term, and the
stochastic gradient of
u
, denoted by
H:Rd×RmRd
, is given by
H(θ, x) = G(θ, x) + ηθ|θ|2r
where
θg(θ) = E[G(θ, X)]
. We aim to generalize the structure of
H
by replacing
ηθ|θ|2r
with any
arbitrary function
F:Rd×RmRd
which satisfies a local Lipschitz condition and a convexity at
infinity condition, see
(7)
and Assumptions 3and 4. In our setting, the gradient of the regularization term
is a particular feasible example for the choice of
F
. In addition to the aforementioned assumptions, by
further imposing conditions on the initial value of e-TH
ε
O POULA and on the second argument of
H
, see
Assumption 1, we establish non-asymptotic error bounds of e-TH
ε
O POULA in Wasserstein distances
and a non-asymptotic upper estimate of the expected excess risk given by
E[u(ˆ
θ)] infθRdu(θ)
with
ˆ
θ
denoting an estimator generated by e-TH
ε
O POULA, which can be controlled to be arbitrarily small.
From a numerical point of view, we illustrate the powerful empirical performance of e-TH
ε
O POULA
by providing key examples in finance and insurance using real-world datasets, i.e., the multi-period
portfolio optimization, transfer learning in the multi-period portfolio optimization, and the insurance
claim prediction via neural network-based non-linear regression. Numerical experiments show that
e-TH
ε
O POULA outperforms SGLD, TUSLA, ADAM, and AMSGrad in most cases
1
with regard to test
accuracy.
We conclude this section by introducing some notation. For
a, b R
, denote by
ab= min{a, b}
and
ab= max{a, b}
. Let
(Ω,F, P )
be a probability space. We denote by
E[Z]
the expectation of a
random variable
Z
. For
1p <
,
Lp
is used to denote the usual space of
p
-integrable real-valued
random variables. Fix integers
d, m 1
. For an
Rd
-valued random variable
Z
, its law on
B(Rd)
, i.e.
the Borel sigma-algebra of
Rd
, is denoted by
L(Z)
. For a positive real number
a
, we denote by
a
its
integer part, and
a:= a+ 1
. The Euclidean scalar product is denoted by
⟨·,·⟩
, with
|·|
standing
for the corresponding norm (where the dimension of the space may vary depending on the context). For
any integer
q1
, let
P(Rq)
denote the set of probability measures on
B(Rq)
. For
µ∈ P(Rd)
and for a
µ
-integrable function
f:RdR
, the notation
µ(f) := RRdf(θ)µ(dθ)
is used. For
µ, ν ∈ P(Rd)
, let
C(µ, ν)
denote the set of probability measures
ζ
on
B(R2d)
such that its respective marginals are
µ, ν
.
For two Borel probability measures
µ
and
ν
defined on
Rd
with finite
p
-th moments, the Wasserstein
distance of order p1is defined as
Wp(µ, ν) := inf
ζ∈C(µ,ν)ZRdZRd|θ¯
θ|pζ(dθ, d¯
θ)1/p
.
2. E-THεO POULA: SETTING AND DEFINITION
2.1. Setting. Let
U:Rd×RmR
be a Borel measurable function, and let
X
be an
Rm
-valued random
variable defined on the probability space
(Ω,F, P )
with probability law
L(X)
satisfying
E[|U(θ, X)|]<
for all
θRd
. We assume that
u:RdR
defined by
u(θ) := E[U(θ, X)]
,
θRd
, is a continuously
differentiable function, and denote by h:= uits gradient. In addition, for any β > 0, we define
πβ(A) := RAeβu(θ)dθ
RRdeβu(θ)dθ, A ∈ B(Rd),(3)
where we assume RRdeβu(θ)dθ < .
Denote by
(Gn)nN0
a given filtration representing the flow of past information, and denote by
G:= σ(SnN0Gn)
. Moreover, let
(Xn)nN0
be a
(Gn)
-adapted process such that
(Xn)nN0
is a
sequence of i.i.d.
Rm
-valued random variables with probability law
L(X)
. In addition, let
(ξn)nN0
be a
sequence of independent standard
d
-dimensional Gaussian random variables. We assume throughout the
paper that the Rd-valued random variable θ0(initial condition), G, and (ξn)nN0are independent.
Let
H:Rd×RmRd
be an unbiased estimator of
h
, i.e.,
h(θ) = E[H(θ, X0)]
, for all
θRd
,
which takes the following form: for all θRd, x Rm,
H(θ, x) := G(θ, x) + F(θ, x),(4)
1while it performs as good as the best alternative method in the remaining cases.
4 D.-Y. LIM, A. NEUFELD, S. SABANIS, AND Y. ZHANG
where
G= (G(1), . . . , G(d)) : Rd×RmRd
is Borel measurable and
F= (F(1), . . . , F (d)) :
Rd×RmRdis continuous.
Remark 2.1. We consider
H
taking the form of
(4)
with
G
containing discontinuities and
F
being
locally Lipschitz continuous (see also Assumptions 2and 3in Section 4) as it is satisfied by a wide
range of real-world applications including quantile estimation, vector quantization, CVaR minimization,
and regularized optimization problems involving ReLU neural networks, see, e.g., [
9
,
25
,
47
,
60
]. For
illustrative purposes, we provide concrete examples for each of the applications mentioned above:
(i)
For quantile estimation, we aim to identify the
q
-th quantile of a given distribution
L(X)
. To this
end, we consider the following regularized optimization problem:
minimize Rθ7→ u(θ) := E[lq(Xθ)] + η
2(r+ 1)|θ|2(r+1),
where 0<q<1,η > 0, r 0are regularization and growth constants, respectively, and
lq(z) = (qz, z 0,
(q1)z, z < 0.
Then, we have that H(θ, x) := G(θ, x) + F(θ, x)with θR, x R,
F(θ, x) := ηθ|θ|2r, G(θ, x) := q+{x<θ}.
(ii)
For vector quantization, our aim is to optimally quantize a given
Rd
-valued random vector
X
by
an
Rd
-valued random vector taking at most
NN
values. For the ease of notation, we consider
the case d= 1. For any θ= (θ(1), . . . , θ(N))RNwe define the associated Voronoi cells as
V(i)(θ) := xR:|xθ(i)|= min
j∈{1,...,N}|xθ(j)|, i = 1, . . . , N.
Then, we quantize the values of
X
in
V(i)(θ)
to
θ(i)
in the following way. We consider minimizing
the mean squared quantization error:
minimize RNθ7→ u(θ) :=
N
X
i=1
Eh|Xθ(i)|2V(i)(θ)(X)i+η
2(r+ 1)|θ|2(r+1),
where η > 0, r 0. This implies that H(θ, x) := G(θ, x) + F(θ, x)with θRN, x R,
F(θ, x) := ηθ|θ|2r, G(θ, x) := (G(1)(θ, x), . . . , G(N)(θ, x)),
where, for i= 1, . . . , N,
G(i)(θ, x) = 2(xθ(i))V(i)(θ)(x).
We note that, in the case where
XUniform[0,1]
and
N= 2
, Voronoi cells take the form
V(1)(θ) = [0,(θ(1) +θ(2))/2] and V(2)(θ) = [(θ(1) +θ(2))/2,1].
(iii)
For CVaR minimization, we consider the problem of obtaining VaR and obtaining optimal weights
which minimize CVaR of a given portfolio consisting of NNassets, i.e., we consider
minimize RN+1 θ7→ u(θ) := E
1
1q N
X
i=1
gi(w)X(i)θ!+
+θ
+η
2(r+ 1)|θ|2(r+1),
where
θ:= (θ, w)=(θ, w(1), . . . , w(N))RN+1
, for each
i= 1, . . . , N
,
X(i)R
denotes
the loss of the i-th asset, gi:RNRdenotes the (parameterized) weight of the i-th asset with
gi(w) := ew(i)
PN
j=1 ew(j)(0,1)
,
0<q<1
,
(x)+:= max{0, x}
for
xR
,
η > 0
, and
r0
.
Then, we have that H(θ, x) := G(θ, x) + F(θ, x)with θRN+1, x RN,
F(θ, x) := ηθ|θ|2r, G(θ, x) := (Gθ(θ, x), Gw(1) (θ, x), . . . , Gw(N)(θ, x)),
where for i= 1, . . . , N,
Gθ(θ, x) := 1 1
1q{PN
i=1 gi(w)x(i)θ},
5
Gw(j)(θ, x) := 1
1q
N
X
i=1
w(j)gi(w)x(i)
{PN
i=1 gi(w)x(i)θ}.
(iv)
For the regularized optimization problems involving ReLU neural networks, we consider an
example of identifying the best regularized mean-square estimator
2
. We consider the following
regularized optimization problem:
minimize R2θ7→ u(θ) := E[(YN(θ, Z))2] + η
2(r+ 1)|θ|2(r+1),
2. where N:R2×RRis the neural network given by
N(θ, z) := K1σ1(c0z+b0),
with K1the weight parameter, σ1(y) = max{0, y},yR, the ReLU activation function, c0the
fixed (pre-trained non-zero) input weight,
z
the input data,
b0
the bias parameter, and where
θ= (K1,b0)R2
is the parameter of the optimization problem,
Y
is the
R
-valued target
random variable,
Z
is the
R
-valued input random variable, and
η, r > 0
. Then, we have that
H(θ, x) := G(θ, x) + F(θ, x)with θR2, x = (y, z)R2,
F(θ, x) := ηθ|θ|2r, G(θ, x) := (GK1(θ, x), Gb0(θ, x))
where
GK1(θ, x) = 2(yN(θ, z))σ1(c0z+b0),
Gb0(θ, x) = 2(yN(θ, z))K1{z≥−b0/c0}.
We note that all the examples (i)-(iv) satisfy Assumptions 1-4in Section 4.1, see, e.g., [
9
,
25
,
47
,
64
]
for detailed proofs, and hence can be solved using e-TH
ε
O POULA with its performance backed by
theoretical results presented in Section 4.2. While examples (i)-(iii) are presented to illustrate the wide
applicability of e-TH
ε
O POULA, we focus in this paper on a general case of example (iv) in Section 3.2
and demonstrate the superior empirical performance of e-TH
ε
O POULA in Section 3compared to other
alternatives including SGLD, TUSLA, ADAM, and AMSGrad.
2.2. Algorithm. We define the extended Tamed Hybrid
ε
-Order POlygonal Unadjusted Langevin Algo-
rithm (e-THεO POULA) by
θλ
0:= θ0, θλ
n+1 := θλ
nλHλ(θλ
n, Xn+1) + p2λβ1ξn+1, n N0,(5)
where
λ > 0
is the stepsize,
β > 0
is the inverse temperature parameter, and where
Hλ(θ, x)
is defined,
for all θRd, x Rm, by
Hλ(θ, x) := Gλ(θ, x) + Fλ(θ, x),(6)
with Gλ(θ, x)=(G(1)
λ(θ, x), . . . , G(d)
λ(θ, x)) and Fλ(θ, x)=(F(1)
λ(θ, x), . . . , F(d)
λ(θ, x)) given by
G(i)
λ(θ, x) := G(i)(θ, x)
1 + λ|G(i)(θ, x)| 1 + λ
ε+|G(i)(θ, x)|!, F (i)
λ(θ, x) := F(i)(θ, x)
1 + λ|θ|2r,(7)
for any i= 1, . . . , d with fixed 0<ε<1,r > 0.
Remark 2.2. Recall that the general form of the stochastic gradient Langevin dynamics (SGLD) algorithm
is given by
θSGLD
0:= θ0, θSGLD
n+1 := θSGLD
nλH(θSGLD
n, Xn+1) + p2λβ1ξn+1, n N0.(8)
Therefore, e-TH
ε
O POULA is obtained by replacing
H
in the SGLD algorithm with
Hλ
given in
(6)
-
(7)
.
More precisely, one part of
Hλ
, i.e.,
Fλ
, is obtained by multiplying
F
with the taming factor
1 + λ|θ|2r
,
while the other part of
Hλ
, i.e.,
Gλ
, is defined by dividing
G
componentwise with the taming factor
1 + λ|G(i)(θ, x)|
and, importantly, with the boosting function
1 + λ
ε+|G(i)(θ,x)|
. One observes that,
when
|G(i)(θ, x)|
is small, the boosting function takes a large value, which, in turn, contributes to the
step-size and helps prevent the vanishing gradient problem which occurs when the stochastic gradient
is extremely small resulting in insignificant updates of the algorithm before reaching an optimal point,
2
For the ease of presentation, we consider the case where the input and target variables are both one dimensional. For the
multi-dimensional version, we refer to Section 3.2 and the corresponding Proposition 3.1.
摘要:

LANGEVINDYNAMICSBASEDALGORITHME-THεOPOULAFORSTOCHASTICOPTIMIZATIONPROBLEMSWITHDISCONTINUOUSSTOCHASTICGRADIENTDONG-YOUNGLIM,ARIELNEUFELD,SOTIRIOSSABANIS,ANDYINGZHANGABSTRACT.WeintroduceanewLangevindynamicsbasedalgorithm,callede-THεOPOULA,tosolveoptimizationproblemswithdiscontinuousstochasticgradients...

展开>> 收起<<
LANGEVIN DYNAMICS BASED ALGORITHM E-TH εO POULA FOR STOCHASTIC OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT DONG-YOUNG LIM ARIEL NEUFELD SOTIRIOS SABANIS AND YING ZHANG.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:1.61MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注