
Divergence Results and Convergence of a Variance Reduced Version of ADAM
ADAM. The second variant proposed in [
Reddi et al., 2018
],
called ADAMNC, requires the second order moment hyper-
parameter
β2
to increase and to satisfy several conditions.
However the conditions are hard to check. Although they
claim that
β2,t = 1 −1/t
satisfies the conditions, this
case is actually AdaGrad, which is already well-known
for its convergence. Zhou et al. [
Zhou et al., 2019
] an-
alyzed the divergent example in [
Reddi et al., 2018
], and
pointed out that the correlation of
vt
and
gt
causes di-
vergence of ADAM, and proposed a decorrelated variant
of ADAM. The theoretical analysis in [
Zhou et al., 2019
]
is based on complex assumptions and they do not pro-
vide a convergence analysis of their algorithm. Sev-
eral other works, such as [
Guo et al., 2021
,
Shi et al., 2020
,
Wang et al., 2019
,
Zou et al., 2019
] suggested properly tun-
ing the hyper-parameters of ADAM-type algorithms had
helped with convergence in practice.
It is empirically well-known that larger batch size reduces
the variance of the loss of a stochastic optimization algo-
rithm. [
Qian and Klabjan, 2020
] gave a theoretical proof
that the variance of the stochastic gradient is proportional
to
1/b
. Although several works connected the convergence
of ADAM with the mini-batch size, the direct connection
between convergence and variance is wanted. For the full-
batch case (i.e., where there is no variance), [
De et al., 2018
]
showed that ADAM converges under some specific schedul-
ing of learning rates. [
Shi et al., 2020
] showed the con-
vergence of full gradient ADAM and RMSProp with the
learning rate schedule
αt=α/√t
and constants
β1
and
β2
satisifying
β1<√β2
. For the stochastic setting with a
fixed batch size, Zaheer et al. [
Zaheer et al., 2018
] proved
that the expected norm of the gradient can be bounded into a
neighborhood of 0, whose size is proportional to
1/b
. They
suggested to increase the batch size with the number of itera-
tions in order to establish convergence. One question is that
whether there exists a threshold of batch size
b∗< N
, such
that any batch size larger than
b∗
guarantees convergence.
We show that even when
b=N−1
, there still exist diver-
gent examples of ADAM. This means that although large
batch size helps tighten the optimality gap, the convergence
issue is not solved as long as the variance exists. Another
possible convergent result is to analyze the convergence in
expectation or high probability under a stochastic starting
point. However our divergent result holds for any initial
point, which rules out this possibility.
Without relying on the mini-batch size, we make a direct
analysis of variance and the convergence of ADAM. We
first show a motivating result which points out that the con-
vergence of an ADAM-type algorithm can be implied by
reducing the variance. Motivated by this, we propose a
variance reduced version of ADAM, called VRADAM, and
show that VRADAM converges. We provide two options
regarding to resetting of ADAM states during the full gra-
dient steps, and recommend the resetting option based on a
theoretical analysis herein and computational experiments.
Finally, we conduct several computational experiments, and
show that our algorithm performs as well as the original
version of ADAM.
In Section 3, we show a divergent example. Using con-
tradiction by assuming the algorithm converges, we show
that the expected update of iterates is larger than a positive
constant, which means that it is impossible for the algorithm
to converge to an optimal solution, which contradicts with
the assumption. In Section 5, we prove the convergence of
VRADAM. The main proof technique applied is to properly
bound the difference between the estimated gradients and
the true value of gradients. By bounding the update of the
objective function in each iterate, we can further employ the
strong convexity assumption and conclude convergence.
Our contributions are as follows.
1.
We provide an unconstrained and strongly convex
stochastic optimization problem on which the origi-
nal ADAM diverges. We show that the divergence
holds for any initial point, which rules out all of the
possible weaker convergent results under stochastic
starting point.
2.
We construct a divergent mini-batch problem with
b=N−1
, and conclude that there does not exist
a convergent threshold for the mini-batch size.
3.
We propose a variance reduced version of ADAM. We
provide convergence results of the variance reduced
version for strongly convex objectives to optimality or
non-convex objectives. We show by experiments that
the variance reduction does not harm the numerical
performance of ADAM.
In Section 2, we review the literature on the topics of the
convergence/divergence issue of ADAM and variance re-
duction optimization methods. In Section 3 we provide
divergent examples for stochastic ADAM. We show that the
example is divergent for large batch sizes, which disproves
the existence of a convergence threshold of mini-batch size.
In Section 4 we start from a reducing variance condition and
prove the convergence of an ADAM-type algorithm under
this condition. In Section 5 we propose a variance reduced
version of ADAM. We show that resetting the states in the
algorithm helps with the performance. We also provide a
convergence result of our variance reduced ADAM. In Sec-
tion 6 we conduct several numerical experiments and show
the convergence and sensitivity of the proposed algorithm.
2 Literature Review
Convergence of ADAM:
Reddi et al. [
Reddi et al., 2018
]
firstly pointed out the convergence issue of ADAM and
proposed two convergent variants: (a) AMSGrad takes the