such as the nonlinear conjugate gradient (CG) method [
24
], no line-search or trust-region technique is
performed in AA, and this is a big advantage in large-scale unconstrained optimization. Empirically,
it is observed that AA is quite successful in accelerating convergence. We refer readers to [
9
] for a
recent survey of acceleration methods.
However, classical AA has one undesirable disadvantage in that it is expensive in terms of memory
as well as computational cost, especially in a nonconvex stochastic setting, where only sublinear
convergence can be expected when only stochastic gradients can be accessed [
3
]. In light of this, a
number of variants of AA have been proposed which aim at improving its performance and robustness
(e.g., [
39
,
63
,
62
,
56
,
67
]). The above-cited works focus on improving the convergence behavior of
AA, but they do not consider reducing the memory cost. In machine learning, we often encounter
practical situations where the number of parameters is quite large and for this reason, it is not practical
to use a large number of vectors in the acceleration methods. It is not clear whether or not the
symmetric structure of the Hessian can be exploited in a scheme like AA to reduce the memory cost
while still maintaining the convergence guarantees. In this paper, we will demonstrate how this can
be accomplished with a new algorithm that is superior to AA in practice.
Our contributions.
This paper develops a nonlinear acceleration method, nonlinear Truncated
Generalized Conjugate Residual method (nlTGCR), that takes advantage of symmetry. This work is
motivated by the observation that the Hessian of a nonlinear function, or the Jacobian of a gradient
of a mapping,
f
is symmetric and therefore more effective, conjugate gradient-like schemes can be
exploited.
We demonstrate that nonlinear acceleration methods can benefit from the symmetry property of the
Hessian. In particular, we study both linear and nonlinear problems and give a systematic analysis of
TGCR and nlTGCR. We show that TGCR is efficient and optimal for linear problems. By viewing
the method from the angle of an inexact Newton approach, we also show that adding a few global
convergence strategies ensures that nlTGCR can achieve global convergence guarantees.
We complement our theoretical results with numerical simulations on several different problems. The
experimental results demonstrate advantages of our methods. To the best of our knowledge, this
is still the first work to investigate and improve the AA dynamics by exploiting symmetry of the
Hessian.
Related work.
Designing efficient optimization methods has received much attention. Several recent
works [
65
,
4
,
40
,
48
,
18
] consider second order optimization methods that employ sketching or
approximation techniques. Different from these approaches, our method is a first-order method that
utilizes symmetry of the Hessian instead of constructing it. A variant of inexact Newton method was
proposed in [
47
] where the least-squares sub-problems are solved approximately using Minimum
Residual method. Similarly, a new type of quasi Newton symmetric update [
54
] uses several secant
equations in a least-squares sense. These approaches have the same goal as ours. However, they are
more closely related to a secant or a multi-secant technique, and as will be argued it does a better
job of capturing the nonlinearity of the problem. [
63
] proposed a short-term AA algorithm that is
different from ours because it is still based on the parameter sequence instead of the gradient sequence
and does not exploit symmetry of the Hessian.
2 Background
2.1 Extrapolation, acceleration, and the Anderson Acceleration procedure
Consider a general fixed-point problem and the associated fixed-point iteration as shown in
(1)
. Denote
by
rj=H(xj)−xj
the residual vector at the
j
th iteration. Classical extrapolation methods including
RRE, MPE and the vector
-Algorithm, have been designed to accelerate the convergence of the
original sequence by generating a new and independent sequence of the form:
t(k)
j=Pk
i=0 αixj+i
.
An important characteristic of these classical extrapolation methods is that the two sequences are not
mixed in the sense that no accelerated item
t(k)
j
, is used to produce the iterate
xj
. These extrapolation
methods must be distinguished from acceleration methods such as the AA procedure which aim at
generating their own sequences to find a fixed point of a certain mapping H.
AA was originally designed to solve a system of nonlinear equations written in the form
F(x) =
H(x)−x= 0
[
1
,
61
,
41
,
30
]. Denote
Fi=F(xi)
. AA starts with an initial
x0
and sets
x1=
2