Divergence Results and Convergence of a Variance Reduced Version of ADAM Ruiqi Wang Diego Klabjan Northwestern University Northwestern University

2025-04-24 0 0 706.42KB 32 页 10玖币
侵权投诉
Divergence Results and Convergence of a Variance Reduced Version of ADAM
Ruiqi Wang Diego Klabjan
Northwestern University Northwestern University
Abstract
Stochastic optimization algorithms using expo-
nential moving averages of the past gradients,
such as ADAM, RMSProp and AdaGrad, have
been having great successes in many applica-
tions, especially in training deep neural networks.
ADAM in particular stands out as efficient and
robust. Despite of its outstanding performance,
ADAM has been proved to be divergent for
some specific problems. We revisit the divergent
question and provide divergent examples under
stronger conditions such as in expectation or high
probability. Under a variance reduction assump-
tion, we show that an ADAM-type algorithm con-
verges, which means that it is the variance of
gradients that causes the divergence of original
ADAM. To this end, we propose a variance re-
duced version of ADAM and provide a convergent
analysis of the algorithm. Numerical experiments
show that the proposed algorithm has as good per-
formance as ADAM. Our work suggests a new
direction for fixing the convergence issues.
1 Introduction
Stochastic optimization based on mini-batch is a com-
mon training procedure in machine learning. Suppose we
have finitely many differentiable objectives
{fn(w)}N
n=1
defined on
Rd
with
N
being the size of the training set.
In each iteration, a random index set
Bt
is selected from
{1, . . . , N}
and the update is made based on the mini-batch
loss
FBt(w) = 1
bPn∈Btfn(w)
, where
b=|Bt|
is the
batch size. The goal is to minimize the empirical risk
minwRdF(w) := 1
NPN
n=1 fn(w).
First order methods, which make updates based on the infor-
mation of the gradient of mini-batch loss functions, prevail
in practice, [
Goodfellow et al., 2016
]. A simple method is
stochastic gradient descent (SGD), where the model pa-
rameters are updated at the negative direction of the mini-
batch loss gradient in each iteration. Although SGD is
straightforward and is proved to be convergent, the steps
of SGD near the minima are very noisy and take longer to
converge. Several adaptive variants of SGD, such as Ada-
Grad [
Duchi et al., 2011
], RMSProp [
Hinton et al., 2012
]
and ADAM [
Kingma and Ba, 2015
], are proved to converge
faster than SGD in practice. These methods take the histori-
cal gradients into account. Specifically, instead of using a
predefined learning rate schema, they adjust the step size
automatically based on the information from the past mini-
batch losses. AdaGrad is the earliest algorithm in the adap-
tive method family and performs better than SGD when gra-
dients are sparse. Although AdaGrad has great theoretical
properties for convex loss, it does not work well practically
in training. RMSProp replaces the sum of square scaling
in AdaGrad with exponential moving average and fixes the
rapid decay of the learning rate in AdaGrad. ADAM-type
algorithms combine the exponential moving average of both
first and second order moments. The original ADAM enjoys
the advantages of AdaGrad in sparse problems and RM-
SProp in non-stationary problems and became one of the
most popular optimization methods in practice.
Yet, ADAM may fail to solve some problems. Reddi et al.
[
Reddi et al., 2018
] found a flaw in the proof of convergence
in [
Kingma and Ba, 2015
] and proposed a divergent exam-
ple for online ADAM. Based on the divergent example, they
pointed out that when some large, informative but rare gra-
dients occur, the exponential moving average would make
them decay quickly and hence would lead to the failure of
convergence. To this end, Reddi et al. proposed two variants
of ADAM to fix this problem. The first proposal, known
as AMSGrad, suggests taking the historical maximum of
the ADAM state
vt
in order to obtain ‘long-term memories’
and prevent the large and informative gradients from being
forgotten. Although this helps keeping the information of
large gradients, it hurts the adaptability of ADAM. If the
algorithm is exposed to a large gradient at early iterations,
the
vt
parameter will stay constant, hence the algorithm will
not automatically adapt the step size, and it will degenerate
to a momentum method. Another intuitive criticism is that
keeping
vt
increasing is against what one expects, since
if the algorithm converges, the norm of gradients should
decrease and
vt+1 vt= (1 β2)(g2
tvt)
is more likely
to be negative, where
gt
is the stochastic gradient in step
t
and β2is a hyper parameter.
Several other proposals tried to fix the divergent problem of
arXiv:2210.05607v1 [cs.LG] 11 Oct 2022
Divergence Results and Convergence of a Variance Reduced Version of ADAM
ADAM. The second variant proposed in [
Reddi et al., 2018
],
called ADAMNC, requires the second order moment hyper-
parameter
β2
to increase and to satisfy several conditions.
However the conditions are hard to check. Although they
claim that
β2,t = 1 1/t
satisfies the conditions, this
case is actually AdaGrad, which is already well-known
for its convergence. Zhou et al. [
Zhou et al., 2019
] an-
alyzed the divergent example in [
Reddi et al., 2018
], and
pointed out that the correlation of
vt
and
gt
causes di-
vergence of ADAM, and proposed a decorrelated variant
of ADAM. The theoretical analysis in [
Zhou et al., 2019
]
is based on complex assumptions and they do not pro-
vide a convergence analysis of their algorithm. Sev-
eral other works, such as [
Guo et al., 2021
,
Shi et al., 2020
,
Wang et al., 2019
,
Zou et al., 2019
] suggested properly tun-
ing the hyper-parameters of ADAM-type algorithms had
helped with convergence in practice.
It is empirically well-known that larger batch size reduces
the variance of the loss of a stochastic optimization algo-
rithm. [
Qian and Klabjan, 2020
] gave a theoretical proof
that the variance of the stochastic gradient is proportional
to
1/b
. Although several works connected the convergence
of ADAM with the mini-batch size, the direct connection
between convergence and variance is wanted. For the full-
batch case (i.e., where there is no variance), [
De et al., 2018
]
showed that ADAM converges under some specific schedul-
ing of learning rates. [
Shi et al., 2020
] showed the con-
vergence of full gradient ADAM and RMSProp with the
learning rate schedule
αt=α/t
and constants
β1
and
β2
satisifying
β1<β2
. For the stochastic setting with a
fixed batch size, Zaheer et al. [
Zaheer et al., 2018
] proved
that the expected norm of the gradient can be bounded into a
neighborhood of 0, whose size is proportional to
1/b
. They
suggested to increase the batch size with the number of itera-
tions in order to establish convergence. One question is that
whether there exists a threshold of batch size
b< N
, such
that any batch size larger than
b
guarantees convergence.
We show that even when
b=N1
, there still exist diver-
gent examples of ADAM. This means that although large
batch size helps tighten the optimality gap, the convergence
issue is not solved as long as the variance exists. Another
possible convergent result is to analyze the convergence in
expectation or high probability under a stochastic starting
point. However our divergent result holds for any initial
point, which rules out this possibility.
Without relying on the mini-batch size, we make a direct
analysis of variance and the convergence of ADAM. We
first show a motivating result which points out that the con-
vergence of an ADAM-type algorithm can be implied by
reducing the variance. Motivated by this, we propose a
variance reduced version of ADAM, called VRADAM, and
show that VRADAM converges. We provide two options
regarding to resetting of ADAM states during the full gra-
dient steps, and recommend the resetting option based on a
theoretical analysis herein and computational experiments.
Finally, we conduct several computational experiments, and
show that our algorithm performs as well as the original
version of ADAM.
In Section 3, we show a divergent example. Using con-
tradiction by assuming the algorithm converges, we show
that the expected update of iterates is larger than a positive
constant, which means that it is impossible for the algorithm
to converge to an optimal solution, which contradicts with
the assumption. In Section 5, we prove the convergence of
VRADAM. The main proof technique applied is to properly
bound the difference between the estimated gradients and
the true value of gradients. By bounding the update of the
objective function in each iterate, we can further employ the
strong convexity assumption and conclude convergence.
Our contributions are as follows.
1.
We provide an unconstrained and strongly convex
stochastic optimization problem on which the origi-
nal ADAM diverges. We show that the divergence
holds for any initial point, which rules out all of the
possible weaker convergent results under stochastic
starting point.
2.
We construct a divergent mini-batch problem with
b=N1
, and conclude that there does not exist
a convergent threshold for the mini-batch size.
3.
We propose a variance reduced version of ADAM. We
provide convergence results of the variance reduced
version for strongly convex objectives to optimality or
non-convex objectives. We show by experiments that
the variance reduction does not harm the numerical
performance of ADAM.
In Section 2, we review the literature on the topics of the
convergence/divergence issue of ADAM and variance re-
duction optimization methods. In Section 3 we provide
divergent examples for stochastic ADAM. We show that the
example is divergent for large batch sizes, which disproves
the existence of a convergence threshold of mini-batch size.
In Section 4 we start from a reducing variance condition and
prove the convergence of an ADAM-type algorithm under
this condition. In Section 5 we propose a variance reduced
version of ADAM. We show that resetting the states in the
algorithm helps with the performance. We also provide a
convergence result of our variance reduced ADAM. In Sec-
tion 6 we conduct several numerical experiments and show
the convergence and sensitivity of the proposed algorithm.
2 Literature Review
Convergence of ADAM:
Reddi et al. [
Reddi et al., 2018
]
firstly pointed out the convergence issue of ADAM and
proposed two convergent variants: (a) AMSGrad takes the
Ruiqi Wang, Diego Klabjan
historical maximum value of
vt
to keep the step size de-
creasing and (b) ADAMNC requires the hyper-parameters
to satisfy specific conditions. Both of the approaches re-
quire that
β1
varies with time, which is inconsistent with
practice. Fang and Klabjan [
Fang and Klabjan, 2019
] gave
a convergence proof for AMSGrad with constant
β1
and
[
Alacaoglu et al., 2020
] provided a tighter bound. Enlarg-
ing the mini-batch size is another direction. [
De et al., 2018
]
and [
Shi et al., 2020
] proved the convergence of ADAM
for full batch gradients and [
Zaheer et al., 2018
] showed
the convergence of ADAM as long as the batch size is
of the same order as the maximum number of iterations,
but one criticism is that such a setting for the batch size
is very inefficient in practice since the calculation of a
large batch gradient is expensive. Several works, such as
[
Guo et al., 2021
,
Zou et al., 2019
,
Wang et al., 2019
] pro-
posed guidelines on setting hyper-parameters in order to
obtain convergent results. [
Guo et al., 2021
] showed that as
long as
β1
is close enough to 1, in particular,
1β1,t
1/t
, ADAM establishes a convergent rate of
O(1/T)
.
However, since [
Reddi et al., 2018
] proposed the divergent
example for any fixed
β1
and
β2
such that
β1<β2
,
there is no hope to extend the results of [
Guo et al., 2021
] to
constant momentum parameters. [
Zou et al., 2019
] also pro-
vided a series of conditions under which ADAM could con-
verge. Specifically, they require the quantity
αt/p1β2,t
to be ‘almost’ non-increasing. [
Wang et al., 2019
] proposed
to set the denominator hyper-parameter
to be
1/t
, and
showed the convergence of ADAM for strongly convex ob-
jectives. The aforementioned works focus on setting the
hyper-parameters in ADAM. On contrary, our work pro-
poses a new algorithm that only requires basic and com-
mon conditions. We show a
O(Tp)
convergence rate for
0<p<1where pis dependent on hyper-parameters.
Variance reduction:
The computational efficiency issues
of full gradient descent methods get more severe with a
large data size, but employing stochastic gradient descent
may cause divergence because of the issue of variance. One
classic method for variance reduction is to use mini-batch
losses with a larger batch size, which however does not
guarantee the variance to converge to zero. As an estima-
tion of the full gradient, the stochastic average gradient
(SAG) method [
Le Roux et al., 2012
] uses an average of
fi(xki)
, where
ki
is the most recent step index when sam-
ple
i
is picked. Although the convergence analysis of SAG
provided in [
Schmidt et al., 2017
] showed its remarkable
linear convergence, the estimator of the descent direction
is biased and the analysis of SAG is complicated. SAGA
[
Defazio et al., 2014
], an unbiased variant to SAG intro-
duced a concept called ‘covariates’ and guarantees linear
convergence as well. Both SAG and SAGA require the mem-
ory of
O(Nd)
, which is expensive when the data set is large.
SVRG [
Johnson and Zhang, 2013
] constructs two layers of
iterations and calculates the full gradient as an auxiliary
vector for variance reduction before starting each inner loop.
It only requires a memory of
O(d)
. Most of the literature
on variance reduction focus on the convergence rate and
memory requirement on the plain SGD algorithm. Recently,
[
Dubois-Taine et al., 2021
] combined AdaGrad with SVRG
for robustness in the learning rate. Our work introduces the
idea of variance reduction to the convergence analysis of
ADAM. It initiates the idea of the dynamic learning rate to
SVRG.
3 Divergent examples for stochastic ADAM
with large batch size
Several recent works [
Shi et al., 2020
,
Zaheer et al., 2018
,
De et al., 2018
] have suggested increasing the mini-batch
size may help with convergence of ADAM. In particular,
vanilla ADAM is convergent if the mini-batch size
b
is equal
to the size of the training set, or it increases in the same
order as training iterates. An interesting question is whether
there exists a threshold of the batch size b=b(N), which
is smaller than N, such that b>bimplies convergence of
ADAM. If such a threshold exists, the convergence can be
guaranteed by a sufficiently large, but neither increasing nor
as large as the training set size, batch size. Unfortunately,
such a threshold does not exist. In fact, we show in this
section that as long as the algorithm is not full batch, one
can find a divergent example of ADAM.
Another aspect of interest is if ADAM converges on aver-
age or with high probability. Our example establishes non-
convergence for any initial data point (even starting with an
optimal one). We conclude that a probabilistic statement is
impossible if stochasticity comes from either sampling or
the initial point.
Reddi et al. firstly proposed a divergent example for ADAM
in [
Reddi et al., 2018
]. The example, which is under the
population loss minimization framework, consists of two
linear functions defined on a finite interval. One drawback of
this example is that the optimization problem is constrained,
yet training in machine learning is usually an unconstrained
problem. Under the unconstrained framework, the example
proposed in [
Reddi et al., 2018
] does not have a minimum
solution, hence it does not satisfy the basic requirements.
We firstly propose an unconstrained problem under the pop-
ulation loss minimization framework.
Let a random variable
ξ
take discrete value from the set
{1,2}
, and set
P(ξ= 1) = 1+δ
1+δ4
for some
δ > 1
. Further-
more, we define the estimation of gradients by
G(w; 1) = w
δ+δ4and G(w; 2) = w
δ1,
which implies the stochastic optimization problem with the
loss functions
f1(w) = w2
2δ+δ4wand f2(w) = w2
2δw
Divergence Results and Convergence of a Variance Reduced Version of ADAM
with the corresponded probability distribution with respect
to
ξ
. The population loss is given as
F(w) = Eξ[fξ(w)]
.
We call this stochastic optimization problem the
Original
Problem
(
δ
), or
OP
(
δ
) for short. We should note that OP(
δ
)
is defined on
R
, thus it is unconstrained. In addition, it is a
strongly convex problem. As a divergent property of OP(
δ
),
we show the following result.
Theorem 1
There exists a
δ>2
such that for any
δ > δ
and any initial point
w1
, ADAM diverges in expectation on
OP(
δ
), i.e.,
E[F(wt)] 6→ F
where
F
is the optimal value
of F(w).
The proof of Theorem1 is given in the appendix, where we
show that for large enough
δ
, the expectation of the ADAM
update between two consequential iterates is always positive.
As a consequence, the iterates keep drifting from the optimal
solution. The divergent example also tells us that strong
convexity and the relaxation of constraints cannot help with
the convergence of ADAM.
Based on the construction of OP(
δ
), we can give the diver-
gent examples for any fixed mini-batch size.
Theorem 2
For any fixed
b
, there exists an
N
b
, such that
for any
N > N
b
, there exists a mini-batch problem with
sample size
N
and batch size
b
where ADAM diverges for
any initial point.
Even if the batch size is unreasonably large, say
b=N1
,
we can still construct the divergent example based on OP(
δ
)
as stated next.
Theorem 3
There exists an
N
such that for any
N > N
,
there exists a mini-batch problem with sample size
N
and
batch size
b=N1
where ADAM diverges for any initial
point.
In conclusion, Theorem 2 and Theorem 3 extinguish the
hope of finding a large enough batch size for stochastic
ADAM to converge. Among the related works regarding
the convergence of ADAM and batch size, larger batch size
is always suggested, but the results in this section have
enlightened the limitations of such approaches.
4 Motivation
In this section, we stick with the general ADAM algorithm
described in Algorithm 1. To analyze, we make several
assumptions on the gradient estimator and objective.
Assumption 1
The gradient estimator
G:Rd×Rd
and objective F:RdRsatisfy the following:
1. G
is unbiased, i.e., for any
wRd
,
Eξ[G(w;ξ)] =
F(w).
Algorithm 1 General ADAM
Require:
Gradient estimation
G(·;·)
, seed generation rule
Pξ
, initial point
w1
, mini-batch size
b
, learning rate
αt
, ex-
ponential decay rates
β1, β2[0,1)
, denominator hyper-
parameter  > 0.
m00,v00
for t1,...T do
Sample ξtPξ
gt← G(wt;ξt)
mtβ1mt1+ (1 β1)gt
vtβ2vt1+ (1 β2)gtgt
Vtdiag(vt) + Id
wt+1 wtαtV1/2
tmt
end for
2.
There exists a constant
0<L<+
, such that
for any
ξ
and
w, ¯wRd
, we have
kG(w;ξ)
G( ¯w;ξ)k2Lkw¯wk2
and
k∇F(w)F( ¯w)k2
Lkw¯wk2.
3.
There exists a constant
0<G<+
, such that for
any
ξ
and
wRd
, we have
kG(w;ξ)k2G
and
kF(w)k2G.
As this point convexity is not needed. We mainly focus on
the variance of the gradient estimator. The common assump-
tions in the literature are that the variance is bounded by
a constant, [
Zaheer et al., 2018
], or a linear function of the
square of the norm of the objective
Var(Gi(w;ξ)) C1+
C2k∇F(w)k2
2
, [
Bottou et al., 2018
]. Another assumption
made in [
Shi et al., 2020
,
Vaswani et al., 2019
] is called the
‘strongly growth condition’ which is
PN
n=1 k∇fn(w)k2
2
Ck∇F(w)k2
2
for some
C > 0
. Note that for vanilla ADAM
where
G(w;ξ) = FB(w)
the strongly growth condition
implies that
FB(w)=0
if and only if
F(w)=0
.
As a result the strongly growth condition implies that
Var(G(w;ξ)) 2LE[kwwk2
2]
, given Lipshitz smooth
gradients for full-batch and mini-batch losses. For those
iterates close to a saddle point, the variance is automati-
cally reduced, because
kwwk2
2
is small. However, the
strongly growth condition is so strong that the majority of
practical problems do not satisfy it. In fact, one observation
of OP(
δ
) is that the variance is a constant, which also breaks
the strongly growth condition.
In this section, as a motivative result, let us assume the
variance of the gradient estimator is reduced a priori. Let us
denote a series of positive constants
{λt}T
t=1
such that for
any
t= 1, . . . , T
, we have
Var(G(wt;ξt)) λt.
For the
objective with a finite lower bound, we have the following
result.
Theorem 4
Let Assumption 1 be satisfied, and assume that
F(w)
is lower bounded by
Finf >−∞
. Then for any initial
Ruiqi Wang, Diego Klabjan
point w1, ADAM satisfies
min
1tT
Ehk∇F(wt)k2
2i≤ O PT
t=1 α2
t
PT
t=1 αt
+PT
t=1 αtλt
PT
t=1 αt!.
The proof is in the appendix. Let us assume that the two
common conditions
P
t=1 αt=
and
P
t=1 α2
t<
are satisfied. Theorem 4 shows that ADAM converges if
P
t=1 αtλt<+
. In fact,
λt0
as
t→ ∞
implies that
PT
t=1 αtλt/PT
t=1 αt0
, and hence it leads to conver-
gence of the algorithm.
We emphasize that since the assumption on variance is made
on the algorithmic iterates
{wt}T
t=1
, it is very difficult to be
checked for a specific problem in advance. However, we
showed that if the variance is convergent, an ADAM-type
algorithm converges. We show next that the algorithm we
propose has convergent variance and furthermore is conver-
gent.
5 Variance Reduced ADAM
Algorithm 2 Variance Reduced ADAM
Require:
Loss functions
{fn(w)}N
n=1
, initial point
ew1
,
learning rate
αt
, exponential decay rates
β1, β2[0,1)
,
denominator hyper-parameter
 > 0
, inner iteration size
m. Initialize m(0)
m0,v(0)
m0.
for t= 1, . . . , T do
Compute full-batch gradient F(ewt)
w(t)
1˜wt
Option A: (Resetting) m(t)
00,v(t)
00
Option B:
(No Resetting)
m(t)
0m(t1)
m
and
v(t)
0vt1
m
for k= 1, . . . , m do
Sample B(t)
kfrom {1, . . . , N}with B(t)
k=b.
g(t)
k← ∇FB(t)
kw(t)
kFB(t)
k(ewt)+F(ewt)
m(t)
kβ1m(t)
k1+ (1 β1)g(t)
k
v(t)
kβ2v(t)
k1+ (1 β2)g(t)
kg(t)
k
Option A:em(t)
km(t)
k
1βk
1
,ev(t)
kv(t)
k
1βk
2
Option B
:
em(t)
km(t)
k
1βk+(t1)m
1
,
ev(t)
k
v(t)
k
1βk+(t1)m
2
V(t)
kdiag ev(t)
k+
w(t)
k+1 w(t)
kαtV(t)
k1/2
em(t)
k
end for
ewt+1 w(t)
m+1
end for
Variance reduction for random variables is a common topic
in many fields. In general, an unbiased variance reduction
of a random variable
X
is
˜
X=XY+EY
, which
establishes the variance
Var( ˜
X) = Var(X) + Var(Y)
2Cov(X, Y )<Var(X)
given
Cov(X, Y )>Var(Y)/2
,
i.e.,
X
and
Y
are positively correlated at a sufficient
level. In the context of stochastic gradient descent, the
random variable for variance reduction is
G(wt;ξt)
, the gra-
dient of mini-batch loss
FBt(wt)
. Johnson and Zhang
[
Johnson and Zhang, 2013
] proposed a solution for SGD.
They suggested the associate random variable to be the
gradient of the same mini-batch loss at a previous iter-
ate
˜w
. Since the expectation of a mini-batch gradient
is the full-batch gradient, the descent direction becomes
gt=FBt(wt)FBt( ˜w)+F( ˜w)
. Vector
˜w
is known
as the snapshot model. Since calculation of the full batch
gradient at
˜w
is required, [
Johnson and Zhang, 2013
] pro-
posed to save the snapshot model every
m
iterations, which
is known as the SVRG algorithm. Inspired by SVRG and
motivated by the result in Section 4, we propose the com-
bination of the variance reduce method and ADAM, called
VRADAM (Algorithm 2).
An intuitive analysis of variance of the update direction
Var g(t)
k,i= Var iFB(t)
kw(t)
k− ∇iFB(t)
k( ˜wt)
EiFB(t)
kw(t)
k− ∇iFB(t)
k( ˜wt)2
L2E
w(t)
k˜wt
2
2.
is that as the iterates become close to the optimal point, the
variance is reduced simultaneously, which guarantees a sim-
ilar condition of variance as the strongly growth condition.
5.1 Resetting/No resetting options
We provide two options with regard to the update of ADAM
states. In one option, we reinitialize the ADAM states at
the beginning of each outer iteration, while the other option
keeps the state through the whole training process. Although
for the original ADAM, resetting the states harms the perfor-
mance of the algorithm, we computationally found that the
resetting option works better in VRADAM. Intuitively, this
is because in each inner loop, the first step
g(t)
1
is always
the full gradient direction, which makes a more efficient
update than the direction adapted by previous ADAM states.
In order to support our argument, we provide a theoretical
analysis of an example. If we fix the initial point
w1R
and the mini-batch losses
FB(1)
1, F B(1)
2, . . . , F B(1)
m
, the iter-
ates are identical between the two options through the
t= 1
iteration. We consider the objective values after the first
update in the second outer iteration, i.e.
Fw(2)
2
. At the
end of
t= 1
iteration, we obtain
w(1)
m+1 =w(2)
1= ˜w2
and
the ADAM states
m(1)
m+1
and
v(1)
m+1
. Then while
t= 2
, the
摘要:

DivergenceResultsandConvergenceofaVarianceReducedVersionofADAMRuiqiWangDiegoKlabjanNorthwesternUniversityNorthwesternUniversityAbstractStochasticoptimizationalgorithmsusingexpo-nentialmovingaveragesofthepastgradients,suchasADAM,RMSPropandAdaGrad,havebeenhavinggreatsuccessesinmanyapplica-tions,especi...

展开>> 收起<<
Divergence Results and Convergence of a Variance Reduced Version of ADAM Ruiqi Wang Diego Klabjan Northwestern University Northwestern University.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:706.42KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注