Divergence Results and Convergence of a Variance Reduced Version of ADAM Ruiqi Wang Diego Klabjan Northwestern University Northwestern University

2025-04-24 0 0 706.42KB 32 页 10玖币

侵权投诉

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Ruiqi Wang Diego Klabjan

Northwestern University Northwestern University

Abstract

Stochastic optimization algorithms using expo-

nential moving averages of the past gradients,

such as ADAM, RMSProp and AdaGrad, have

been having great successes in many applica-

tions, especially in training deep neural networks.

ADAM in particular stands out as efﬁcient and

robust. Despite of its outstanding performance,

ADAM has been proved to be divergent for

some speciﬁc problems. We revisit the divergent

question and provide divergent examples under

stronger conditions such as in expectation or high

probability. Under a variance reduction assump-

tion, we show that an ADAM-type algorithm con-

verges, which means that it is the variance of

gradients that causes the divergence of original

ADAM. To this end, we propose a variance re-

duced version of ADAM and provide a convergent

analysis of the algorithm. Numerical experiments

show that the proposed algorithm has as good per-

formance as ADAM. Our work suggests a new

direction for ﬁxing the convergence issues.

1 Introduction

Stochastic optimization based on mini-batch is a com-

mon training procedure in machine learning. Suppose we

have ﬁnitely many differentiable objectives

{fn(w)}N

n=1

deﬁned on

with

being the size of the training set.

In each iteration, a random index set

is selected from

{1, . . . , N}

and the update is made based on the mini-batch

loss

FBt(w) = 1

bPn∈Btfn(w)

, where

b=|Bt|

is the

batch size. The goal is to minimize the empirical risk

minw∈RdF(w) := 1

NPN

n=1 fn(w).

First order methods, which make updates based on the infor-

mation of the gradient of mini-batch loss functions, prevail

in practice, [

Goodfellow et al., 2016

]. A simple method is

stochastic gradient descent (SGD), where the model pa-

rameters are updated at the negative direction of the mini-

batch loss gradient in each iteration. Although SGD is

straightforward and is proved to be convergent, the steps

of SGD near the minima are very noisy and take longer to

converge. Several adaptive variants of SGD, such as Ada-

Grad [

Duchi et al., 2011

], RMSProp [

Hinton et al., 2012

]

and ADAM [

Kingma and Ba, 2015

], are proved to converge

faster than SGD in practice. These methods take the histori-

cal gradients into account. Speciﬁcally, instead of using a

predeﬁned learning rate schema, they adjust the step size

automatically based on the information from the past mini-

batch losses. AdaGrad is the earliest algorithm in the adap-

tive method family and performs better than SGD when gra-

dients are sparse. Although AdaGrad has great theoretical

properties for convex loss, it does not work well practically

in training. RMSProp replaces the sum of square scaling

in AdaGrad with exponential moving average and ﬁxes the

rapid decay of the learning rate in AdaGrad. ADAM-type

algorithms combine the exponential moving average of both

ﬁrst and second order moments. The original ADAM enjoys

the advantages of AdaGrad in sparse problems and RM-

SProp in non-stationary problems and became one of the

most popular optimization methods in practice.

Yet, ADAM may fail to solve some problems. Reddi et al.

[

Reddi et al., 2018

] found a ﬂaw in the proof of convergence

in [

Kingma and Ba, 2015

] and proposed a divergent exam-

ple for online ADAM. Based on the divergent example, they

pointed out that when some large, informative but rare gra-

dients occur, the exponential moving average would make

them decay quickly and hence would lead to the failure of

convergence. To this end, Reddi et al. proposed two variants

of ADAM to ﬁx this problem. The ﬁrst proposal, known

as AMSGrad, suggests taking the historical maximum of

the ADAM state

in order to obtain ‘long-term memories’

and prevent the large and informative gradients from being

forgotten. Although this helps keeping the information of

large gradients, it hurts the adaptability of ADAM. If the

algorithm is exposed to a large gradient at early iterations,

the

parameter will stay constant, hence the algorithm will

not automatically adapt the step size, and it will degenerate

to a momentum method. Another intuitive criticism is that

keeping

increasing is against what one expects, since

if the algorithm converges, the norm of gradients should

decrease and

vt+1 −vt= (1 −β2)(g2

t−vt)

is more likely

to be negative, where

is the stochastic gradient in step

and β2is a hyper parameter.

Several other proposals tried to ﬁx the divergent problem of

arXiv:2210.05607v1 [cs.LG] 11 Oct 2022

Divergence Results and Convergence of a Variance Reduced Version of ADAM

ADAM. The second variant proposed in [

Reddi et al., 2018

called ADAMNC, requires the second order moment hyper-

parameter

β2

to increase and to satisfy several conditions.

However the conditions are hard to check. Although they

claim that

β2,t = 1 −1/t

satisﬁes the conditions, this

case is actually AdaGrad, which is already well-known

for its convergence. Zhou et al. [

Zhou et al., 2019

] an-

alyzed the divergent example in [

Reddi et al., 2018

], and

pointed out that the correlation of

and

causes di-

vergence of ADAM, and proposed a decorrelated variant

of ADAM. The theoretical analysis in [

Zhou et al., 2019

]

is based on complex assumptions and they do not pro-

vide a convergence analysis of their algorithm. Sev-

eral other works, such as [

Guo et al., 2021

Shi et al., 2020

Wang et al., 2019

Zou et al., 2019

] suggested properly tun-

ing the hyper-parameters of ADAM-type algorithms had

helped with convergence in practice.

It is empirically well-known that larger batch size reduces

the variance of the loss of a stochastic optimization algo-

rithm. [

Qian and Klabjan, 2020

] gave a theoretical proof

that the variance of the stochastic gradient is proportional

1/b

. Although several works connected the convergence

of ADAM with the mini-batch size, the direct connection

between convergence and variance is wanted. For the full-

batch case (i.e., where there is no variance), [

De et al., 2018

]

showed that ADAM converges under some speciﬁc schedul-

ing of learning rates. [

Shi et al., 2020

] showed the con-

vergence of full gradient ADAM and RMSProp with the

learning rate schedule

αt=α/√t

and constants

β1

and

β2

satisifying

β1<√β2

. For the stochastic setting with a

ﬁxed batch size, Zaheer et al. [

Zaheer et al., 2018

] proved

that the expected norm of the gradient can be bounded into a

neighborhood of 0, whose size is proportional to

1/b

. They

suggested to increase the batch size with the number of itera-

tions in order to establish convergence. One question is that

whether there exists a threshold of batch size

b∗< N

, such

that any batch size larger than

b∗

guarantees convergence.

We show that even when

b=N−1

, there still exist diver-

gent examples of ADAM. This means that although large

batch size helps tighten the optimality gap, the convergence

issue is not solved as long as the variance exists. Another

possible convergent result is to analyze the convergence in

expectation or high probability under a stochastic starting

point. However our divergent result holds for any initial

point, which rules out this possibility.

Without relying on the mini-batch size, we make a direct

analysis of variance and the convergence of ADAM. We

ﬁrst show a motivating result which points out that the con-

vergence of an ADAM-type algorithm can be implied by

reducing the variance. Motivated by this, we propose a

variance reduced version of ADAM, called VRADAM, and

show that VRADAM converges. We provide two options

regarding to resetting of ADAM states during the full gra-

dient steps, and recommend the resetting option based on a

theoretical analysis herein and computational experiments.

Finally, we conduct several computational experiments, and

show that our algorithm performs as well as the original

version of ADAM.

In Section 3, we show a divergent example. Using con-

tradiction by assuming the algorithm converges, we show

that the expected update of iterates is larger than a positive

constant, which means that it is impossible for the algorithm

to converge to an optimal solution, which contradicts with

the assumption. In Section 5, we prove the convergence of

VRADAM. The main proof technique applied is to properly

bound the difference between the estimated gradients and

the true value of gradients. By bounding the update of the

objective function in each iterate, we can further employ the

strong convexity assumption and conclude convergence.

Our contributions are as follows.

We provide an unconstrained and strongly convex

stochastic optimization problem on which the origi-

nal ADAM diverges. We show that the divergence

holds for any initial point, which rules out all of the

possible weaker convergent results under stochastic

starting point.

We construct a divergent mini-batch problem with

b=N−1

, and conclude that there does not exist

a convergent threshold for the mini-batch size.

We propose a variance reduced version of ADAM. We

provide convergence results of the variance reduced

version for strongly convex objectives to optimality or

non-convex objectives. We show by experiments that

the variance reduction does not harm the numerical

performance of ADAM.

In Section 2, we review the literature on the topics of the

convergence/divergence issue of ADAM and variance re-

duction optimization methods. In Section 3 we provide

divergent examples for stochastic ADAM. We show that the

example is divergent for large batch sizes, which disproves

the existence of a convergence threshold of mini-batch size.

In Section 4 we start from a reducing variance condition and

prove the convergence of an ADAM-type algorithm under

this condition. In Section 5 we propose a variance reduced

version of ADAM. We show that resetting the states in the

algorithm helps with the performance. We also provide a

convergence result of our variance reduced ADAM. In Sec-

tion 6 we conduct several numerical experiments and show

the convergence and sensitivity of the proposed algorithm.

2 Literature Review

Convergence of ADAM:

Reddi et al. [

Reddi et al., 2018

]

ﬁrstly pointed out the convergence issue of ADAM and

proposed two convergent variants: (a) AMSGrad takes the

Ruiqi Wang, Diego Klabjan

historical maximum value of

to keep the step size de-

creasing and (b) ADAMNC requires the hyper-parameters

to satisfy speciﬁc conditions. Both of the approaches re-

quire that

β1

varies with time, which is inconsistent with

practice. Fang and Klabjan [

Fang and Klabjan, 2019

] gave

a convergence proof for AMSGrad with constant

β1

and

[

Alacaoglu et al., 2020

] provided a tighter bound. Enlarg-

ing the mini-batch size is another direction. [

De et al., 2018

]

and [

Shi et al., 2020

] proved the convergence of ADAM

for full batch gradients and [

Zaheer et al., 2018

] showed

the convergence of ADAM as long as the batch size is

of the same order as the maximum number of iterations,

but one criticism is that such a setting for the batch size

is very inefﬁcient in practice since the calculation of a

large batch gradient is expensive. Several works, such as

[

Guo et al., 2021

Zou et al., 2019

Wang et al., 2019

] pro-

posed guidelines on setting hyper-parameters in order to

obtain convergent results. [

Guo et al., 2021

] showed that as

long as

β1

is close enough to 1, in particular,

1−β1,t ∝

1/√t

, ADAM establishes a convergent rate of

O(1/√T)

However, since [

Reddi et al., 2018

] proposed the divergent

example for any ﬁxed

β1

and

β2

such that

β1<√β2

there is no hope to extend the results of [

Guo et al., 2021

] to

constant momentum parameters. [

Zou et al., 2019

] also pro-

vided a series of conditions under which ADAM could con-

verge. Speciﬁcally, they require the quantity

αt/p1−β2,t

to be ‘almost’ non-increasing. [

Wang et al., 2019

] proposed

to set the denominator hyper-parameter



to be

1/t

, and

showed the convergence of ADAM for strongly convex ob-

jectives. The aforementioned works focus on setting the

hyper-parameters in ADAM. On contrary, our work pro-

poses a new algorithm that only requires basic and com-

mon conditions. We show a

O(T−p)

convergence rate for

0<p<1where pis dependent on hyper-parameters.

Variance reduction:

The computational efﬁciency issues

of full gradient descent methods get more severe with a

large data size, but employing stochastic gradient descent

may cause divergence because of the issue of variance. One

classic method for variance reduction is to use mini-batch

losses with a larger batch size, which however does not

guarantee the variance to converge to zero. As an estima-

tion of the full gradient, the stochastic average gradient

(SAG) method [

Le Roux et al., 2012

] uses an average of

∇fi(xki)

, where

is the most recent step index when sam-

ple

is picked. Although the convergence analysis of SAG

provided in [

Schmidt et al., 2017

] showed its remarkable

linear convergence, the estimator of the descent direction

is biased and the analysis of SAG is complicated. SAGA

[

Defazio et al., 2014

], an unbiased variant to SAG intro-

duced a concept called ‘covariates’ and guarantees linear

convergence as well. Both SAG and SAGA require the mem-

ory of

O(Nd)

, which is expensive when the data set is large.

SVRG [

Johnson and Zhang, 2013

] constructs two layers of

iterations and calculates the full gradient as an auxiliary

vector for variance reduction before starting each inner loop.

It only requires a memory of

O(d)

. Most of the literature

on variance reduction focus on the convergence rate and

memory requirement on the plain SGD algorithm. Recently,

[

Dubois-Taine et al., 2021

] combined AdaGrad with SVRG

for robustness in the learning rate. Our work introduces the

idea of variance reduction to the convergence analysis of

ADAM. It initiates the idea of the dynamic learning rate to

SVRG.

3 Divergent examples for stochastic ADAM

with large batch size

Several recent works [

Shi et al., 2020

Zaheer et al., 2018

De et al., 2018

] have suggested increasing the mini-batch

size may help with convergence of ADAM. In particular,

vanilla ADAM is convergent if the mini-batch size

is equal

to the size of the training set, or it increases in the same

order as training iterates. An interesting question is whether

there exists a threshold of the batch size b∗=b(N), which

is smaller than N, such that b>b∗implies convergence of

ADAM. If such a threshold exists, the convergence can be

guaranteed by a sufﬁciently large, but neither increasing nor

as large as the training set size, batch size. Unfortunately,

such a threshold does not exist. In fact, we show in this

section that as long as the algorithm is not full batch, one

can ﬁnd a divergent example of ADAM.

Another aspect of interest is if ADAM converges on aver-

age or with high probability. Our example establishes non-

convergence for any initial data point (even starting with an

optimal one). We conclude that a probabilistic statement is

impossible if stochasticity comes from either sampling or

the initial point.

Reddi et al. ﬁrstly proposed a divergent example for ADAM

in [

Reddi et al., 2018

]. The example, which is under the

population loss minimization framework, consists of two

linear functions deﬁned on a ﬁnite interval. One drawback of

this example is that the optimization problem is constrained,

yet training in machine learning is usually an unconstrained

problem. Under the unconstrained framework, the example

proposed in [

Reddi et al., 2018

] does not have a minimum

solution, hence it does not satisfy the basic requirements.

We ﬁrstly propose an unconstrained problem under the pop-

ulation loss minimization framework.

Let a random variable

take discrete value from the set

{1,2}

, and set

P(ξ= 1) = 1+δ

1+δ4

for some

δ > 1

. Further-

more, we deﬁne the estimation of gradients by

G(w; 1) = w

δ+δ4and G(w; 2) = w

δ−1,

which implies the stochastic optimization problem with the

loss functions

f1(w) = w2

2δ+δ4wand f2(w) = w2

2δ−w

Divergence Results and Convergence of a Variance Reduced Version of ADAM

with the corresponded probability distribution with respect

. The population loss is given as

F(w) = Eξ[fξ(w)]

We call this stochastic optimization problem the

Original

Problem

(

), or

(

) for short. We should note that OP(

)

is deﬁned on

, thus it is unconstrained. In addition, it is a

strongly convex problem. As a divergent property of OP(

we show the following result.

Theorem 1

There exists a

δ∗>2

such that for any

δ > δ∗

and any initial point

, ADAM diverges in expectation on

OP(

), i.e.,

E[F(wt)] 6→ F∗

where

F∗

is the optimal value

of F(w).

The proof of Theorem1 is given in the appendix, where we

show that for large enough

, the expectation of the ADAM

update between two consequential iterates is always positive.

As a consequence, the iterates keep drifting from the optimal

solution. The divergent example also tells us that strong

convexity and the relaxation of constraints cannot help with

the convergence of ADAM.

Based on the construction of OP(

), we can give the diver-

gent examples for any ﬁxed mini-batch size.

Theorem 2

For any ﬁxed

, there exists an

N∗

, such that

for any

N > N ∗

, there exists a mini-batch problem with

sample size

and batch size

where ADAM diverges for

any initial point.

Even if the batch size is unreasonably large, say

b=N−1

we can still construct the divergent example based on OP(

)

as stated next.

Theorem 3

There exists an

N∗

such that for any

N > N∗

there exists a mini-batch problem with sample size

and

batch size

b=N−1

where ADAM diverges for any initial

point.

In conclusion, Theorem 2 and Theorem 3 extinguish the

hope of ﬁnding a large enough batch size for stochastic

ADAM to converge. Among the related works regarding

the convergence of ADAM and batch size, larger batch size

is always suggested, but the results in this section have

enlightened the limitations of such approaches.

4 Motivation

In this section, we stick with the general ADAM algorithm

described in Algorithm 1. To analyze, we make several

assumptions on the gradient estimator and objective.

Assumption 1

The gradient estimator

G:Rd×Ω→Rd

and objective F:Rd→Rsatisfy the following:

1. G

is unbiased, i.e., for any

w∈Rd

Eξ[G(w;ξ)] =

∇F(w).

Algorithm 1 General ADAM

Require:

Gradient estimation

G(·;·)

, seed generation rule

Pξ

, initial point

, mini-batch size

, learning rate

αt

, ex-

ponential decay rates

β1, β2∈[0,1)

, denominator hyper-

parameter  > 0.

m0←0,v0←0

for t∈1,...T do

Sample ξt∼Pξ

gt← G(wt;ξt)

mt←β1mt−1+ (1 −β1)gt

vt←β2vt−1+ (1 −β2)gtgt

Vt←diag(vt) + Id

wt+1 ←wt−αtV−1/2

tmt

end for

There exists a constant

0<L<+∞

, such that

for any

ξ∈Ω

and

w, ¯w∈Rd

, we have

kG(w;ξ)−

G( ¯w;ξ)k2≤Lkw−¯wk2

and

k∇F(w)−∇F( ¯w)k2≤

Lkw−¯wk2.

There exists a constant

0<G<+∞

, such that for

any

ξ∈Ω

and

w∈Rd

, we have

kG(w;ξ)k2≤G

and

kF(w)k2≤G.

As this point convexity is not needed. We mainly focus on

the variance of the gradient estimator. The common assump-

tions in the literature are that the variance is bounded by

a constant, [

Zaheer et al., 2018

], or a linear function of the

square of the norm of the objective

Var(Gi(w;ξ)) ≤C1+

C2k∇F(w)k2

, [

Bottou et al., 2018

]. Another assumption

made in [

Shi et al., 2020

Vaswani et al., 2019

] is called the

‘strongly growth condition’ which is

n=1 k∇fn(w)k2

2≤

Ck∇F(w)k2

for some

C > 0

. Note that for vanilla ADAM

where

G(w;ξ) = ∇FB(w)

the strongly growth condition

implies that

∇FB(w∗)=0

if and only if

∇F(w∗)=0

As a result the strongly growth condition implies that

Var(G(w;ξ)) ≤2LE[kw−w∗k2

, given Lipshitz smooth

gradients for full-batch and mini-batch losses. For those

iterates close to a saddle point, the variance is automati-

cally reduced, because

kw−w∗k2

is small. However, the

strongly growth condition is so strong that the majority of

practical problems do not satisfy it. In fact, one observation

of OP(

) is that the variance is a constant, which also breaks

the strongly growth condition.

In this section, as a motivative result, let us assume the

variance of the gradient estimator is reduced a priori. Let us

denote a series of positive constants

{λt}T

t=1

such that for

any

t= 1, . . . , T

, we have

Var(G(wt;ξt)) ≤λt.

For the

objective with a ﬁnite lower bound, we have the following

result.

Theorem 4

Let Assumption 1 be satisﬁed, and assume that

F(w)

is lower bounded by

Finf >−∞

. Then for any initial

Ruiqi Wang, Diego Klabjan

point w1, ADAM satisﬁes

min

1≤t≤T

Ehk∇F(wt)k2

2i≤ O PT

t=1 α2

t=1 αt

+PT

t=1 αtλt

t=1 αt!.

The proof is in the appendix. Let us assume that the two

common conditions

P∞

t=1 αt=∞

and

P∞

t=1 α2

t<∞

are satisﬁed. Theorem 4 shows that ADAM converges if

P∞

t=1 αtλt<+∞

. In fact,

λt→0

t→ ∞

implies that

t=1 αtλt/PT

t=1 αt→0

, and hence it leads to conver-

gence of the algorithm.

We emphasize that since the assumption on variance is made

on the algorithmic iterates

{wt}T

t=1

, it is very difﬁcult to be

checked for a speciﬁc problem in advance. However, we

showed that if the variance is convergent, an ADAM-type

algorithm converges. We show next that the algorithm we

propose has convergent variance and furthermore is conver-

gent.

5 Variance Reduced ADAM

Algorithm 2 Variance Reduced ADAM

Require:

Loss functions

{fn(w)}N

n=1

, initial point

ew1

learning rate

αt

, exponential decay rates

β1, β2∈[0,1)

denominator hyper-parameter

 > 0

, inner iteration size

m. Initialize m(0)

m←0,v(0)

m←0.

for t= 1, . . . , T do

Compute full-batch gradient ∇F(ewt)

w(t)

1←˜wt

Option A: (Resetting) m(t)

0←0,v(t)

0←0

Option B:

(No Resetting)

m(t)

0←m(t−1)

and

v(t)

0←vt−1

for k= 1, . . . , m do

Sample B(t)

kfrom {1, . . . , N}with B(t)

k=b.

g(t)

k← ∇FB(t)

kw(t)

k−∇FB(t)

k(ewt)+∇F(ewt)

m(t)

k←β1m(t)

k−1+ (1 −β1)g(t)

v(t)

k←β2v(t)

k−1+ (1 −β2)g(t)

kg(t)

Option A:em(t)

k←m(t)

1−βk

,ev(t)

k←v(t)

1−βk

Option B

em(t)

k←m(t)

1−βk+(t−1)m

ev(t)

k←

v(t)

1−βk+(t−1)m

V(t)

k←diag ev(t)

k+

w(t)

k+1 ←w(t)

k−αtV(t)

k−1/2

em(t)

end for

ewt+1 ←w(t)

m+1

end for

Variance reduction for random variables is a common topic

in many ﬁelds. In general, an unbiased variance reduction

of a random variable

X=X−Y+EY

, which

establishes the variance

Var( ˜

X) = Var(X) + Var(Y)−

2Cov(X, Y )<Var(X)

given

Cov(X, Y )>Var(Y)/2

i.e.,

and

are positively correlated at a sufﬁcient

level. In the context of stochastic gradient descent, the

random variable for variance reduction is

G(wt;ξt)

, the gra-

dient of mini-batch loss

∇FBt(wt)

. Johnson and Zhang

[

Johnson and Zhang, 2013

] proposed a solution for SGD.

They suggested the associate random variable to be the

gradient of the same mini-batch loss at a previous iter-

ate

˜w

. Since the expectation of a mini-batch gradient

is the full-batch gradient, the descent direction becomes

gt=∇FBt(wt)−∇FBt( ˜w)+∇F( ˜w)

. Vector

˜w

is known

as the snapshot model. Since calculation of the full batch

gradient at

˜w

is required, [

Johnson and Zhang, 2013

] pro-

posed to save the snapshot model every

iterations, which

is known as the SVRG algorithm. Inspired by SVRG and

motivated by the result in Section 4, we propose the com-

bination of the variance reduce method and ADAM, called

VRADAM (Algorithm 2).

An intuitive analysis of variance of the update direction

Var g(t)

k,i= Var ∇iFB(t)

kw(t)

k− ∇iFB(t)

k( ˜wt)

≤E∇iFB(t)

kw(t)

k− ∇iFB(t)

k( ˜wt)2

≤L2E



w(t)

k−˜wt



2.

is that as the iterates become close to the optimal point, the

variance is reduced simultaneously, which guarantees a sim-

ilar condition of variance as the strongly growth condition.

5.1 Resetting/No resetting options

We provide two options with regard to the update of ADAM

states. In one option, we reinitialize the ADAM states at

the beginning of each outer iteration, while the other option

keeps the state through the whole training process. Although

for the original ADAM, resetting the states harms the perfor-

mance of the algorithm, we computationally found that the

resetting option works better in VRADAM. Intuitively, this

is because in each inner loop, the ﬁrst step

g(t)

is always

the full gradient direction, which makes a more efﬁcient

update than the direction adapted by previous ADAM states.

In order to support our argument, we provide a theoretical

analysis of an example. If we ﬁx the initial point

w1∈R

and the mini-batch losses

FB(1)

1, F B(1)

2, . . . , F B(1)

, the iter-

ates are identical between the two options through the

t= 1

iteration. We consider the objective values after the ﬁrst

update in the second outer iteration, i.e.

Fw(2)

2

. At the

end of

t= 1

iteration, we obtain

w(1)

m+1 =w(2)

1= ˜w2

and

the ADAM states

m(1)

m+1

and

v(1)

m+1

. Then while

t= 2

, the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DivergenceResultsandConvergenceofaVarianceReducedVersionofADAMRuiqiWangDiegoKlabjanNorthwesternUniversityNorthwesternUniversityAbstractStochasticoptimizationalgorithmsusingexpo-nentialmovingaveragesofthepastgradients,suchasADAM,RMSPropandAdaGrad,havebeenhavinggreatsuccessesinmanyapplica-tions,especi...

展开>> 收起<<

Divergence Results and Convergence of a Variance Reduced Version of ADAM Ruiqi Wang Diego Klabjan Northwestern University Northwestern University.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Divergence Results and Convergence of a Variance Reduced Version of ADAM Ruiqi Wang Diego Klabjan Northwestern University Northwestern University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: