LANGEVIN DYNAMICS BASED ALGORITHM E-TH εO POULA FOR STOCHASTIC OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT DONG-YOUNG LIM ARIEL NEUFELD SOTIRIOS SABANIS AND YING ZHANG

2025-05-03 0 0 1.61MB 45 页 10玖币

侵权投诉

LANGEVIN DYNAMICS BASED ALGORITHM E-THεO POULA FOR STOCHASTIC

OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT

DONG-YOUNG LIM, ARIEL NEUFELD, SOTIRIOS SABANIS, AND YING ZHANG

ABSTRACT.

We introduce a new Langevin dynamics based algorithm, called e-TH

O POULA, to solve

optimization problems with discontinuous stochastic gradients which naturally appear in real-world applica-

tions such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization

problems involving ReLU neural networks. We demonstrate both theoretically and numerically the ap-

plicability of the e-TH

O POULA algorithm. More precisely, under the conditions that the stochastic

gradient is locally Lipschitz in average and satisﬁes a certain convexity at inﬁnity condition, we establish

non-asymptotic error bounds for e-TH

O POULA in Wasserstein distances and provide a non-asymptotic

estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications

in ﬁnance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in

multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with

(Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illus-

trate the superior empirical performance of e-TH

O POULA compared to SGLD, TUSLA, ADAM, and

AMSGrad in terms of model accuracy.

1. INTRODUCTION

A wide range of problems in economics, ﬁnance, and quantitative risk management can be represented

as stochastic optimization problems. Traditional approaches to solve such problems typically face the

curse of dimensionality in practical settings, which motivates researchers and practitioners to apply

machine learning approaches to obtain approximated solutions. Consequently, deep learning have been

widely adopted to almost all aspects in, e.g., ﬁnancial applications including option pricing, implied

volatility, prediction, hedging, and portfolio optimization [

], and

applications in insurance [

]. While the aforementioned results

justify the use of deep neural networks through the universal approximation theorem, it is not a trivial

problem to train a deep neural network, which is equivalent to minimizing an associated loss function,

using efﬁcient optimization algorithms. Stochastic gradient descent (SGD) and its variants are popular

methods to solve such non-convex and large scale optimization problems. However, it is well known that

SGD methods are only proven to converge to a stationary point in non-convex settings. Despite the lack

of theoretical guarantees for the SGD methods, the literature on deep learning in ﬁnance, insurance, and

their related ﬁelds heavily rely on popular optimization methods such as SGD and its variants including,

e.g., ADAM [

] and AMSGrad [

]. In [

], the author explicitly highlights the importance of research

on stochastic optimization methods for problems in ﬁnance: ‘The choice of optimisation engine in deep

learning is vitally important in obtaining sensible results, but a topic rarely discussed (at least within

the ﬁnancial mathematics community)’. The aim of this paper is thus to bridge the theoretical gap

and to extend the empirical understanding of training deep learning models in applications in ﬁnance

and insurance. We achieve these by investigating the properties of a newly proposed algorithm, i.e.,

the extended Tamed Hybrid

-Order POlygonal Unadjusted Langevin Algorithm (e-TH

O POULA),

which can be applied to optimization problems with discontinuous stochastic gradients including quantile

Key words and phrases. Langevin dynamics based algorithm, discontinuous stochastic gradient, non-convex stochastic

optimization, non-asymptotic convergence bound, artiﬁcial neural networks, ReLU activation function, taming technique,

super-linearly growing coefﬁcients.

Financial supports by The Alan Turing Institute, London under the EPSRC grant EP/N510129/1, the MOE AcRF Tier 2

Grant MOE-T2EP20222-0013, the European Union’s Horizon 2020 research and innovation programme under the Marie

Skłodowska-Curie grant agreement No. 801215, the University of Edinburgh Data-Driven Innovation programme, part of

the Edinburgh and South East Scotland City Region Deal, Institute of Information & communications Technology Planning

& Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2020-0-01336, Artiﬁcial Intelligence Graduate

School Program (UNIST)), National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)

(No. RS-2023-00253002), and the Guangzhou-HKUST(GZ) Joint Funding Program (No. 2024A03J0630) are gratefully

acknowledged.

arXiv:2210.13193v3 [math.OC] 30 Jun 2024

2 D.-Y. LIM, A. NEUFELD, S. SABANIS, AND Y. ZHANG

estimation, vector quantization, CVaR minimization, and regularized optimization problems involving

ReLU neural networks, see, e.g., [9,25,47,60].

We consider the following optimization problem:

minimize Rd∋θ7→ u(θ) := E[U(θ, X)],(1)

where

U:Rd×Rm→R

is a measurable function, and

is a given

-valued random variable with

probability law

L(X)

. To obtain approximate minimizers of

(1)

, one of the approaches is to apply the

stochastic gradient Langevin dynamics (SGLD) algorithm introduced in [70], which can be viewed as a

variant of the Euler discretization of the Langevin SDE deﬁned on t∈[0,∞)given by

dZt=−h(Zt) dt+p2β−1dBt, Z0=θ0,(2)

where

θ0

is an

-valued random variable,

h:= ∇u

β > 0

is the inverse temperature parameter, and

(Bt)t≥0

is a

-dimensional Brownian motion. The associated stochastic gradient of the SGLD algorithm

is deﬁned as a measurable function

H:Rd×Rm→Rd

which satisﬁes

h(θ) = E[H(θ, X)]

for all

θ∈Rd

. One notes that, under mild conditions, the Langevin SDE

(2)

admits a unique invariant measure

πβ(dθ)exp(−βu(θ))dθ

with

β > 0

. It has been shown in [

] that

πβ

concentrates around the

minimizers of

when

takes sufﬁciently large values. Therefore, minimizing

(1)

is equivalent to

sampling from

πβ

with large

. The convergence properties of the SGLD algorithm to

πβ

in suitable

distances have been well studied in the literature, under the conditions that the (stochastic) gradient of

is globally Lipschitz continuous and satisﬁes a (local) dissipativity or convexity at inﬁnity condition,

see, e.g., [

] and references therein. Recent research focuses on the relaxation of the

global Lipschitz condition imposed on the (stochastic) gradient of

so as to accommodate optimization

problems involving neural networks. However, the SGLD algorithm is unstable when applying to objective

functions with highly non-linear (stochastic) gradients, and the absolute moments of the approximations

generated by the SGLD algorithm could diverge to inﬁnity at a ﬁnite time point, see [

]. To address this

issue, [

] proposed a tamed unadjusted stochastic Langevin algorithm (TUSLA), which is obtained by

applying the taming technique, developed in, e.g., [7,35,62,63], to the SGLD algorithm. Convergence

results of TUSLA are provided in [

] under the condition that the stochastic gradient of

is polynomially

Lipschitz growing. In [

], the applicability of TUSLA is further extended to the case where the stochastic

gradient of

is discontinuous, and the polynomial Lipschitz condition is replaced by a more relaxed

locally Lipschitz in average condition. The latter condition is similar to [

, Eqn. (6)] and [

, H4],

which well accommodates optimization problems with ReLU neural networks. One may also refer to

[

] for convergence results of the Langevin dynamics based algorithms with discontinuous

(stochastic) gradients.

Despite their established theoretical guarantees, TUSLA and other Langevin dynamics based algorithms

are less popular in practice, especially when training deep learning models, compared to adaptive learning

rate methods including ADAM and AMSGrad. This is due to the superior empirical performance of

the latter group of algorithms in terms of the test accuracy and training speed. In [

], a new class of

Langevin dynamics based algorithms, namely TH

O POULA, is proposed based on the advances of

polygonal Euler approximations, see [

]. More precisely, the design of TH

O POULA relies on

a combination of a componentwise taming function and a componentwise boosting function, which

simultaneously address the exploding and vanishing gradient problems. Furthermore, such a design

allows TH

O POULA to convert from an adaptive learning rate method to a Langevin dynamics based

algorithm when approaching an optimal point, preserving the feature of a fast training speed of the

former and the feature of a good generalization of the latter. In addition, [

] provides a convergence

analysis of TH

O POULA for non-convex regularized optimization problems. Under the condition that

the (stochastic) gradient is locally Lipschitz continuous, non-asymptotic error bounds for TH

O POULA

in Wasserstein distances are established, and a non-asymptotic estimate for the expected excess risk

is provided. However, the local Lipschitz condition fails to accommodate optimization problems with

discontinuous stochastic gradients.

In this paper, we propose the algorithm e-TH

O POULA, which combines the advantages of utilizing

Euler’s polygonal approximations of TH

O POULA [

] resulting in its superior empirical performance,

together with a relaxed condition on its stochastic gradient as explained below. We aim to demonstrate

both theoretically and numerically the applicability of e-TH

O POULA for optimization problems with

discontinuous stochastic gradients. From a theoretical point of view, our goal is to provide theoretical

guarantees for e-TH

O POULA to ﬁnd approximate minimizers of

with discontinuous stochastic

gradient. More concretely, we aim to relax the local Lipschitz condition, and replace it with a local

Lipschitz in average condition, see Assumption 2. In addition, [

] considers regularized optimization

problems which assume a certain structure of the stochastic gradients of the corresponding objective

functions. More precisely, [

] assumes that

u(θ) := g(θ) + η|θ|2r+1/(2r+ 1)

θ∈Rd

, where

g:Rd→R

η > 0

, and

r > 0

. The second term on the RHS of

is the regularization term, and the

stochastic gradient of

, denoted by

H:Rd×Rm→Rd

, is given by

H(θ, x) = G(θ, x) + ηθ|θ|2r

where

∇θg(θ) = E[G(θ, X)]

. We aim to generalize the structure of

by replacing

ηθ|θ|2r

with any

arbitrary function

F:Rd×Rm→Rd

which satisﬁes a local Lipschitz condition and a convexity at

inﬁnity condition, see

(7)

and Assumptions 3and 4. In our setting, the gradient of the regularization term

is a particular feasible example for the choice of

. In addition to the aforementioned assumptions, by

further imposing conditions on the initial value of e-TH

O POULA and on the second argument of

, see

Assumption 1, we establish non-asymptotic error bounds of e-TH

O POULA in Wasserstein distances

and a non-asymptotic upper estimate of the expected excess risk given by

E[u(ˆ

θ)] −infθ∈Rdu(θ)

with

denoting an estimator generated by e-TH

O POULA, which can be controlled to be arbitrarily small.

From a numerical point of view, we illustrate the powerful empirical performance of e-TH

O POULA

by providing key examples in ﬁnance and insurance using real-world datasets, i.e., the multi-period

portfolio optimization, transfer learning in the multi-period portfolio optimization, and the insurance

claim prediction via neural network-based non-linear regression. Numerical experiments show that

e-TH

O POULA outperforms SGLD, TUSLA, ADAM, and AMSGrad in most cases

with regard to test

accuracy.

We conclude this section by introducing some notation. For

a, b ∈R

, denote by

a∧b= min{a, b}

and

a∨b= max{a, b}

. Let

(Ω,F, P )

be a probability space. We denote by

E[Z]

the expectation of a

random variable

. For

1≤p < ∞

is used to denote the usual space of

-integrable real-valued

random variables. Fix integers

d, m ≥1

. For an

-valued random variable

, its law on

B(Rd)

, i.e.

the Borel sigma-algebra of

, is denoted by

L(Z)

. For a positive real number

, we denote by

⌊a⌋

its

integer part, and

⌈a⌉:= ⌊a⌋+ 1

. The Euclidean scalar product is denoted by

⟨·,·⟩

, with

|·|

standing

for the corresponding norm (where the dimension of the space may vary depending on the context). For

any integer

q≥1

, let

P(Rq)

denote the set of probability measures on

B(Rq)

. For

µ∈ P(Rd)

and for a

-integrable function

f:Rd→R

, the notation

µ(f) := RRdf(θ)µ(dθ)

is used. For

µ, ν ∈ P(Rd)

, let

C(µ, ν)

denote the set of probability measures

B(R2d)

such that its respective marginals are

µ, ν

For two Borel probability measures

and

deﬁned on

with ﬁnite

-th moments, the Wasserstein

distance of order p≥1is deﬁned as

Wp(µ, ν) := inf

ζ∈C(µ,ν)ZRdZRd|θ−¯

θ|pζ(dθ, d¯

θ)1/p

2. E-THεO POULA: SETTING AND DEFINITION

2.1. Setting. Let

U:Rd×Rm→R

be a Borel measurable function, and let

be an

-valued random

variable deﬁned on the probability space

(Ω,F, P )

with probability law

L(X)

satisfying

E[|U(θ, X)|]<

∞

for all

θ∈Rd

. We assume that

u:Rd→R

deﬁned by

u(θ) := E[U(θ, X)]

θ∈Rd

, is a continuously

differentiable function, and denote by h:= ∇uits gradient. In addition, for any β > 0, we deﬁne

πβ(A) := RAe−βu(θ)dθ

RRde−βu(θ)dθ, A ∈ B(Rd),(3)

where we assume RRde−βu(θ)dθ < ∞.

Denote by

(Gn)n∈N0

a given ﬁltration representing the ﬂow of past information, and denote by

G∞:= σ(Sn∈N0Gn)

. Moreover, let

(Xn)n∈N0

be a

(Gn)

-adapted process such that

(Xn)n∈N0

is a

sequence of i.i.d.

-valued random variables with probability law

L(X)

. In addition, let

(ξn)n∈N0

be a

sequence of independent standard

-dimensional Gaussian random variables. We assume throughout the

paper that the Rd-valued random variable θ0(initial condition), G∞, and (ξn)n∈N0are independent.

Let

H:Rd×Rm→Rd

be an unbiased estimator of

, i.e.,

h(θ) = E[H(θ, X0)]

, for all

θ∈Rd

which takes the following form: for all θ∈Rd, x ∈Rm,

H(θ, x) := G(θ, x) + F(θ, x),(4)

1while it performs as good as the best alternative method in the remaining cases.

4 D.-Y. LIM, A. NEUFELD, S. SABANIS, AND Y. ZHANG

where

G= (G(1), . . . , G(d)) : Rd×Rm→Rd

is Borel measurable and

F= (F(1), . . . , F (d)) :

Rd×Rm→Rdis continuous.

Remark 2.1. We consider

taking the form of

(4)

with

containing discontinuities and

being

locally Lipschitz continuous (see also Assumptions 2and 3in Section 4) as it is satisﬁed by a wide

range of real-world applications including quantile estimation, vector quantization, CVaR minimization,

and regularized optimization problems involving ReLU neural networks, see, e.g., [

]. For

illustrative purposes, we provide concrete examples for each of the applications mentioned above:

(i)

For quantile estimation, we aim to identify the

-th quantile of a given distribution

L(X)

. To this

end, we consider the following regularized optimization problem:

minimize R∋θ7→ u(θ) := E[lq(X−θ)] + η

2(r+ 1)|θ|2(r+1),

where 0<q<1,η > 0, r ≥0are regularization and growth constants, respectively, and

lq(z) = (qz, z ≥0,

(q−1)z, z < 0.

Then, we have that H(θ, x) := G(θ, x) + F(θ, x)with θ∈R, x ∈R,

F(θ, x) := ηθ|θ|2r, G(θ, x) := −q+{x<θ}.

(ii)

For vector quantization, our aim is to optimally quantize a given

-valued random vector

-valued random vector taking at most

N∈N

values. For the ease of notation, we consider

the case d= 1. For any θ= (θ(1), . . . , θ(N))∈RNwe deﬁne the associated Voronoi cells as

V(i)(θ) := x∈R:|x−θ(i)|= min

j∈{1,...,N}|x−θ(j)|, i = 1, . . . , N.

Then, we quantize the values of

V(i)(θ)

θ(i)

in the following way. We consider minimizing

the mean squared quantization error:

minimize RN∋θ7→ u(θ) :=

i=1

Eh|X−θ(i)|2V(i)(θ)(X)i+η

2(r+ 1)|θ|2(r+1),

where η > 0, r ≥0. This implies that H(θ, x) := G(θ, x) + F(θ, x)with θ∈RN, x ∈R,

F(θ, x) := ηθ|θ|2r, G(θ, x) := (G(1)(θ, x), . . . , G(N)(θ, x)),

where, for i= 1, . . . , N,

G(i)(θ, x) = −2(x−θ(i))V(i)(θ)(x).

We note that, in the case where

X∼Uniform[0,1]

and

N= 2

, Voronoi cells take the form

V(1)(θ) = [0,(θ(1) +θ(2))/2] and V(2)(θ) = [(θ(1) +θ(2))/2,1].

(iii)

For CVaR minimization, we consider the problem of obtaining VaR and obtaining optimal weights

which minimize CVaR of a given portfolio consisting of N∈Nassets, i.e., we consider

minimize RN+1 ∋θ7→ u(θ) := E

1

1−q N

i=1

gi(w)X(i)−θ!+

+θ

+η

2(r+ 1)|θ|2(r+1),

where

θ:= (θ, w)=(θ, w(1), . . . , w(N))∈RN+1

, for each

i= 1, . . . , N

X(i)∈R

denotes

the loss of the i-th asset, gi:RN→Rdenotes the (parameterized) weight of the i-th asset with

gi(w) := ew(i)

j=1 ew(j)∈(0,1)

0<q<1

(x)+:= max{0, x}

for

x∈R

η > 0

, and

r≥0

Then, we have that H(θ, x) := G(θ, x) + F(θ, x)with θ∈RN+1, x ∈RN,

F(θ, x) := ηθ|θ|2r, G(θ, x) := (Gθ(θ, x), Gw(1) (θ, x), . . . , Gw(N)(θ, x)),

where for i= 1, . . . , N,

Gθ(θ, x) := 1 −1

1−q{PN

i=1 gi(w)x(i)≥θ},

Gw(j)(θ, x) := 1

1−q

i=1

∂w(j)gi(w)x(i)

{PN

i=1 gi(w)x(i)≥θ}.

(iv)

For the regularized optimization problems involving ReLU neural networks, we consider an

example of identifying the best regularized mean-square estimator

. We consider the following

regularized optimization problem:

minimize R2∋θ7→ u(θ) := E[(Y−N(θ, Z))2] + η

2(r+ 1)|θ|2(r+1),

2. where N:R2×R→Ris the neural network given by

N(θ, z) := K1σ1(c0z+b0),

with K1the weight parameter, σ1(y) = max{0, y},y∈R, the ReLU activation function, c0the

ﬁxed (pre-trained non-zero) input weight,

the input data,

the bias parameter, and where

θ= (K1,b0)∈R2

is the parameter of the optimization problem,

is the

-valued target

random variable,

is the

-valued input random variable, and

η, r > 0

. Then, we have that

H(θ, x) := G(θ, x) + F(θ, x)with θ∈R2, x = (y, z)∈R2,

F(θ, x) := ηθ|θ|2r, G(θ, x) := (GK1(θ, x), Gb0(θ, x))

where

GK1(θ, x) = −2(y−N(θ, z))σ1(c0z+b0),

Gb0(θ, x) = −2(y−N(θ, z))K1{z≥−b0/c0}.

We note that all the examples (i)-(iv) satisfy Assumptions 1-4in Section 4.1, see, e.g., [

]

for detailed proofs, and hence can be solved using e-TH

O POULA with its performance backed by

theoretical results presented in Section 4.2. While examples (i)-(iii) are presented to illustrate the wide

applicability of e-TH

O POULA, we focus in this paper on a general case of example (iv) in Section 3.2

and demonstrate the superior empirical performance of e-TH

O POULA in Section 3compared to other

alternatives including SGLD, TUSLA, ADAM, and AMSGrad.

2.2. Algorithm. We deﬁne the extended Tamed Hybrid

-Order POlygonal Unadjusted Langevin Algo-

rithm (e-THεO POULA) by

θλ

0:= θ0, θλ

n+1 := θλ

n−λHλ(θλ

n, Xn+1) + p2λβ−1ξn+1, n ∈N0,(5)

where

λ > 0

is the stepsize,

β > 0

is the inverse temperature parameter, and where

Hλ(θ, x)

is deﬁned,

for all θ∈Rd, x ∈Rm, by

Hλ(θ, x) := Gλ(θ, x) + Fλ(θ, x),(6)

with Gλ(θ, x)=(G(1)

λ(θ, x), . . . , G(d)

λ(θ, x)) and Fλ(θ, x)=(F(1)

λ(θ, x), . . . , F(d)

λ(θ, x)) given by

G(i)

λ(θ, x) := G(i)(θ, x)

1 + √λ|G(i)(θ, x)| 1 + √λ

ε+|G(i)(θ, x)|!, F (i)

λ(θ, x) := F(i)(θ, x)

1 + √λ|θ|2r,(7)

for any i= 1, . . . , d with ﬁxed 0<ε<1,r > 0.

Remark 2.2. Recall that the general form of the stochastic gradient Langevin dynamics (SGLD) algorithm

is given by

θSGLD

0:= θ0, θSGLD

n+1 := θSGLD

n−λH(θSGLD

n, Xn+1) + p2λβ−1ξn+1, n ∈N0.(8)

Therefore, e-TH

O POULA is obtained by replacing

in the SGLD algorithm with

Hλ

given in

(6)

(7)

More precisely, one part of

Hλ

, i.e.,

Fλ

, is obtained by multiplying

with the taming factor

1 + √λ|θ|2r

while the other part of

Hλ

, i.e.,

Gλ

, is deﬁned by dividing

componentwise with the taming factor

1 + √λ|G(i)(θ, x)|

and, importantly, with the boosting function

1 + √λ

ε+|G(i)(θ,x)|

. One observes that,

when

|G(i)(θ, x)|

is small, the boosting function takes a large value, which, in turn, contributes to the

step-size and helps prevent the vanishing gradient problem which occurs when the stochastic gradient

is extremely small resulting in insigniﬁcant updates of the algorithm before reaching an optimal point,

For the ease of presentation, we consider the case where the input and target variables are both one dimensional. For the

multi-dimensional version, we refer to Section 3.2 and the corresponding Proposition 3.1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LANGEVINDYNAMICSBASEDALGORITHME-THεOPOULAFORSTOCHASTICOPTIMIZATIONPROBLEMSWITHDISCONTINUOUSSTOCHASTICGRADIENTDONG-YOUNGLIM,ARIELNEUFELD,SOTIRIOSSABANIS,ANDYINGZHANGABSTRACT.WeintroduceanewLangevindynamicsbasedalgorithm,callede-THεOPOULA,tosolveoptimizationproblemswithdiscontinuousstochasticgradients...

展开>> 收起<<

LANGEVIN DYNAMICS BASED ALGORITHM E-TH εO POULA FOR STOCHASTIC OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT DONG-YOUNG LIM ARIEL NEUFELD SOTIRIOS SABANIS AND YING ZHANG.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LANGEVIN DYNAMICS BASED ALGORITHM E-TH εO POULA FOR STOCHASTIC OPTIMIZATION PROBLEMS WITH DISCONTINUOUS STOCHASTIC GRADIENT DONG-YOUNG LIM ARIEL NEUFELD SOTIRIOS SABANIS AND YING ZHANG

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: