An Efficient Nonlinear Acceleration method that Exploits Symmetry of the Hessian Huan He

2025-04-30 0 0 5.08MB 31 页 10玖币
侵权投诉
An Efficient Nonlinear Acceleration method that
Exploits Symmetry of the Hessian
Huan He
Harvard University
huan_he@hms.harvard.edu
Shifan Zhao
Emory University
szhao89@emory.edu
Ziyuan Tang
University of Minnesota
tang0389@umn.edu
Joyce C Ho
Emory University
joyce.c.ho@emory.edu
Yousef Saad
University of Minnesota
saad@umn.edu
Yuanzhe Xi
Emory University
yuanzhe.xi@emory.edu
Abstract
Nonlinear acceleration methods are powerful techniques to speed up fixed-point
iterations. However, many acceleration methods require storing a large number of
previous iterates and this can become impractical if computational resources are
limited. In this paper, we propose a nonlinear Truncated Generalized Conjugate
Residual method (nlTGCR) whose goal is to exploit the symmetry of the Hessian
to reduce memory usage. The proposed method can be interpreted as either an
inexact Newton or a quasi-Newton method. We show that, with the help of global
strategies like residual check techniques, nlTGCR can converge globally for general
nonlinear problems and that under mild conditions, nlTGCR is able to achieve
superlinear convergence. We further analyze the convergence of nlTGCR in a
stochastic setting. Numerical results demonstrate the superiority of nlTGCR when
compared with several other competitive baseline approaches on a few problems.
Our code will be available in the future.
1 Introduction
In this paper, we consider solving the fixed-point problem:
Find xRnsuch that x=H(x). (1)
This problem has received a surge of interest due to its wide range of applications in mathematics,
computational science and engineering. Most optimization algorithms are iterative, and their goal is
to find a related fixed-point of the form
(1)
, where
H:RnRn
is the iteration mapping which can
potentially be nonsmooth or noncontractive. When the optimization problem is convex,
H
is typically
nonexpansive, and the solution set of the fixed-point problem is the same as that of the original
optimization problem, or closely related to it. Consider the simple fixed-point iteration
xk+1 =H(xk)
which produces a sequence of iterates
{x0, x1,··· , xK}
. When the iteration converges, its limit is a
fixed-point, i.e.,
x=H(x)
. However, an issue with fixed-point iteration is that it does not always
converge, and when it does, it might reach the limit very slowly.
To address this issue, a number of acceleration methods have been proposed and studied over the years,
such as the reduced-rank extrapolation (RRE) [
58
], minimal-polynomial extrapolation (MPE) [
12
],
modified MPE (MMPE) [
33
], and the vector
-algorithms [
8
]. Besides these algorithms, Anderson
Acceleration (AA) [
1
] has received enormous recent attention due to its nice properties and its success
in machine learnin applications [
56
,
22
,
57
,
14
,
60
,
44
,
25
,
63
]. In practice, since computing the
Hessian of the objective function is commonly difficult or even unavailable, AA can be seen as a
practical alternative to Newton’s method [
34
]. Also, compared with the classical iterative methods
arXiv:2210.12573v1 [cs.LG] 22 Oct 2022
such as the nonlinear conjugate gradient (CG) method [
24
], no line-search or trust-region technique is
performed in AA, and this is a big advantage in large-scale unconstrained optimization. Empirically,
it is observed that AA is quite successful in accelerating convergence. We refer readers to [
9
] for a
recent survey of acceleration methods.
However, classical AA has one undesirable disadvantage in that it is expensive in terms of memory
as well as computational cost, especially in a nonconvex stochastic setting, where only sublinear
convergence can be expected when only stochastic gradients can be accessed [
3
]. In light of this, a
number of variants of AA have been proposed which aim at improving its performance and robustness
(e.g., [
39
,
63
,
62
,
56
,
67
]). The above-cited works focus on improving the convergence behavior of
AA, but they do not consider reducing the memory cost. In machine learning, we often encounter
practical situations where the number of parameters is quite large and for this reason, it is not practical
to use a large number of vectors in the acceleration methods. It is not clear whether or not the
symmetric structure of the Hessian can be exploited in a scheme like AA to reduce the memory cost
while still maintaining the convergence guarantees. In this paper, we will demonstrate how this can
be accomplished with a new algorithm that is superior to AA in practice.
Our contributions.
This paper develops a nonlinear acceleration method, nonlinear Truncated
Generalized Conjugate Residual method (nlTGCR), that takes advantage of symmetry. This work is
motivated by the observation that the Hessian of a nonlinear function, or the Jacobian of a gradient
of a mapping,
f
is symmetric and therefore more effective, conjugate gradient-like schemes can be
exploited.
We demonstrate that nonlinear acceleration methods can benefit from the symmetry property of the
Hessian. In particular, we study both linear and nonlinear problems and give a systematic analysis of
TGCR and nlTGCR. We show that TGCR is efficient and optimal for linear problems. By viewing
the method from the angle of an inexact Newton approach, we also show that adding a few global
convergence strategies ensures that nlTGCR can achieve global convergence guarantees.
We complement our theoretical results with numerical simulations on several different problems. The
experimental results demonstrate advantages of our methods. To the best of our knowledge, this
is still the first work to investigate and improve the AA dynamics by exploiting symmetry of the
Hessian.
Related work.
Designing efficient optimization methods has received much attention. Several recent
works [
65
,
4
,
40
,
48
,
18
] consider second order optimization methods that employ sketching or
approximation techniques. Different from these approaches, our method is a first-order method that
utilizes symmetry of the Hessian instead of constructing it. A variant of inexact Newton method was
proposed in [
47
] where the least-squares sub-problems are solved approximately using Minimum
Residual method. Similarly, a new type of quasi Newton symmetric update [
54
] uses several secant
equations in a least-squares sense. These approaches have the same goal as ours. However, they are
more closely related to a secant or a multi-secant technique, and as will be argued it does a better
job of capturing the nonlinearity of the problem. [
63
] proposed a short-term AA algorithm that is
different from ours because it is still based on the parameter sequence instead of the gradient sequence
and does not exploit symmetry of the Hessian.
2 Background
2.1 Extrapolation, acceleration, and the Anderson Acceleration procedure
Consider a general fixed-point problem and the associated fixed-point iteration as shown in
(1)
. Denote
by
rj=H(xj)xj
the residual vector at the
j
th iteration. Classical extrapolation methods including
RRE, MPE and the vector
-Algorithm, have been designed to accelerate the convergence of the
original sequence by generating a new and independent sequence of the form:
t(k)
j=Pk
i=0 αixj+i
.
An important characteristic of these classical extrapolation methods is that the two sequences are not
mixed in the sense that no accelerated item
t(k)
j
, is used to produce the iterate
xj
. These extrapolation
methods must be distinguished from acceleration methods such as the AA procedure which aim at
generating their own sequences to find a fixed point of a certain mapping H.
AA was originally designed to solve a system of nonlinear equations written in the form
F(x) =
H(x)x= 0
[
1
,
61
,
41
,
30
]. Denote
Fi=F(xi)
. AA starts with an initial
x0
and sets
x1=
2
H(x0) = x0+βF0
, where
β > 0
is a parameter. At step
j > 1
we define
Xj= [xjm, . . . , xj1],
and ¯
Fj= [Fjm, . . . , Fj1]along with the differences:
Xj= [∆xjm··· xj1]Rn×m,
Fj= [∆Fjm··· Fj1]Rn×m.(2)
We then define the next AA iterate as follows:
xj+1 =xj+βFj(Xj+βFj)θ(j)where: (3)
θ(j)=argminθRmkFj− Fjθk2.(4)
To define the next iterate in (3) the algorithm uses the term
Fj+1 =F(xj+1)
where
xj+1
is the current
accelerated iterate. AA belongs to the class of multi-secant methods. Indeed, the approximation (3)
can be written as:
xj+1 =xj[βI + (Xj+βFj)(FT
jFj)1FT
j]Fj
xjGjFj.(5)
Thus, Gjcan be seen as an update to the (approximate) inverse Jacobian Gjm=βI
Gj=Gjm+ (XjGjmFj)(FT
jFj)1FT
j,(6)
and is the minimizer of kGj+βIkFunder the multi-secant condition of type II 1
GjFj=Xj.(7)
This link between AA and Broyden multi-secant type updates was first unraveled by Eyert [
21
] and
expanded upon in [46].
2.2 Inexact and quasi-Newton methods
Given a nonlinear system of equations
F(x)=0
. Inexact Newton methods [
15
,
10
], start with an
initial guess x0and compute a sequence of iterates as follows
Solve J(xj)δj≈ −F(xj)(8)
Set xj+1 =xj+δj(9)
Here,
J(xj)
is the Jacobian of
F
at the current iterate
xj
. In (8) the system is solved inexactly,
typically by some iterative method. In quasi-Newton methods [
17
,
17
,
54
], the inverse of the Jacobian
is approximated progressively. Because it is the inverse Jacobian that is approximated, the method
is akin to Broyden’s second (or type-II) update method. This method replaces Newtons’s iteration:
xj+1 =xjDF (xj)1F(xj)
with
xj+1 =xjGjF(xj)
where
Gj
approximates the inverse of
the Jacobian
DF (xj)
at
xj
by the update formula
Gj+1 =Gj+ (∆xjGjF(xj))vT
j
in which
vjis defined in different ways see [46] for details.
3 Exploiting symmetry
In the following, we specifically consider the case where the nonlinear mapping
F
is the gradient of
some objective function φ:RnRto be minimized, i.e.,
F(x) = φ(x).
In this situation, the Jacobian of
F
becomes
2φ
the Hessian of
φ
. An obvious observation here
is that the symmetry of the Hessian is not taken into account in the approximate inverse Hessian
update formula (6). This has only been considered in the literature (very) recently (e.g., [
6
,
55
,
7
]).
In a 1983 report, [
53
] showed that the matrix
Gj
obtained by a multi-secant method that satisfies the
secant condition (7) is symmetric iff the matrix
XT
jFj
is symmetric. It is possible to explicitly force
symmetry by employing generalizations of the symmetric versions of Broyden-type methods. Thus,
the authors of [
6
,
7
] developed a multisecant version of the Powell Symmetric Broyden (PSB) update
1
Type I Broyden conditions involve approximations to the Jacobian, while type II conditions deal with the
inverse Jacobian.
3
due to Powell [
45
] while the article [
55
] proposed a symmetric multisecant method based on the
popular Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach as well as the Davidon-Fletcher-Powell
(DFP) update. However, there are a number of issues with the symmetric versions of multisecant
updates, some of which are discussed in [55].
We observe that when we are close to the limit, the condition
XT
jFj=FT
jXj
is nearly satisfied.
This is because if xis the limit with F(x) = 0 we can write
F(xk)F(xk1)=[F(xk)F(x)]
[F(xk1)F(x)]
≈ ∇2φ(x)(xkxk1).
(10)
This translates to
Fj≈ ∇2φ(x)Xj
from which it follows that
XT
jFj≈ XT
j2φ(x)Xj
which
is a symmetric matrix under mild smoothness conditions on
φ
. Therefore, the issue of symmetry
can be mitigated if we are able to develop nonlinear acceleration methods that take advantage of
near-symmetry.
3.1 The linear case: Truncated GCR (TGCR)
We first consider solving the linear system
Ax =b
with a general matrix
A
. The Generalized
Conjugate Residual (GCR) algorithm, see, e.g., [
19
,
51
], solves this linear system by building a
sequence of search directions pi, for i= 0,··· , j at step jso that the vectors Apiare orthogonal to
each other. With this property it is easy to generate iterates that minimize the residual at each step,
and this leads to GCR, see [51, pp 195-196] for details.
Next we will make two changes to GCR. First, we will develop a truncated version in which any
given
Apj
is orthogonal to the previous
m Api
s only. This is dictated by practical considerations,
because keeping all
Api
vectors may otherwise require too much memory. Second, we will keep a
set of vectors for the
pi
s and another set for the vectors
viApi
, for
i= 1,··· , j
at step
j
in order
to avoid unnecessary additional products of the matrix
A
with vectors. The Truncated GCR (TGCR)
is summarized in Algorithm 1.
Algorithm 1 TGCR (m)
1: Input: Matrix A, RHS b, initial x0.
2: Set r0bAx0;v=Ar0;
3: v0=v/kvk;p0=r0/kvk;
4: for j= 0,1,2,··· ,Until convergence do
5: αj= (rj, vj)
6: xj+1 =xj+αjpj
7: rj+1 =rjαjvj
8: p=rj+1;v=Ap;
9: i0= max(1, j m+ 1)
10: for i=i0:jdo
11: βij := (v, Api)
12: p:= pβij pi;
13: v:= vβij vi;
14: end for
15: pj+1 := p/kvk;vj+1 := v/kvk;
16: end for
With
m=
we obtain the non-restarted GCR method, which is equivalent to the non-restarted (i.e.,
full) GMRES. However, when
A
is symmetric, but not necessarily symmetric positive definite, then
TGCR (1) is identical with TGCR (m) in exact arithmetic. This leads to big savings both in terms of
memory and in computational costs.
Theorem 3.1.
When the coefficient matrix
A
is symmetric, TGCR (m) generates the same iterates
as TGCR (1) for any
m > 0
. In addition, when
A
is positive definite, the
k
-th residual vector
rk=bAxksatisfies the following inequality where κis the spectral condition number of A:
krkk ≤ 2κ1
κ+ 1k
kr0k.(11)
4
3.2 The nonlinear case: nlTGCR
Assume now that we want to solve the nonlinear problem
F(x)=0.
We need to make three major
changes to Algorithm 1. First, any residual is now the negative of
F(x)
so Line 2 and Line 7 must be
replaced by
r0=F(x0)
and
rj+1 =F(xj+1)
, respectively. In addition, we originally need to
calculate the products
Ar0
and
Ap
in Line 2 and Line 8 respectively. Here
A
needs to be replaced
by the Jacobian
J(xj)
of
F
at the current iterate. We also use the notation
Pj[pi0,··· , pj]
, and
Vj[J(xi0)pi0,··· , J(xj)pj]
. The most important change is in lines 5-6 where
αj
of Algorithm 1
needs to be replaced by a vector
yj
. This is because when we write the linear model used in the form
of an inexact Newton method:
F(xj+Pjy)F(xj)+[J]Pjywhere
[J]Pj[J(xi0)pi0,··· , J(xj)pj] = Vj.(12)
The projection method that minimizes the norm
kF(xj)+[J]Pjyk=kF(xj)+Vjyk
of the right-hand
side determines yin such a way that
F(xj) + VjySpan{Vj} → (Vj)T[F(xj) + Vjy]=0
y=VT
jrj
(13)
where it is assumed the
vi
s are fully orthogonal. Note that in the linear case, it can be shown
that
VT
jrj
has only one nonzero component when one assumes that the vectors
J(xi)pi
are fully
orthogonal, i.e., that
i0= 1
always. The nonlinear version of TGCR (m) is summarized in Algorithm
2 where the indication ‘Use Frechet’ means that the vector
v=J(x)u
is to be computed as
v= (F(x+u)F(x))/ for some small .
Algorithm 2 nlTGCR (m)
1: Input:F(x), initial x0.
2: Set r0=F(x0).
3: Compute v=J(x0)r0; (Use Frechet)
4: v0=v/kvk,p0=r0/kvk;
5: for j= 0,1,2,··· ,Until convergence do
6: yj=VT
jrj
7: xj+1 =xj+Pjyj
8: rj+1 =F(xj+1)
9: Set: p:= rj+1;
10: i0= max(1, j m+ 1)
11: Compute v=J(xj+1)p(Use Frechet)
12: for i=i0:jdo
13: βij := (v, vi)
14: p:= pβij pi
15: v:= vβij vi
16: end for
17: pj+1 := p/kvk;vj+1 := v/kvk;
18: end for
Remark.
nlTGCR (m) requires 2 function evaluations per step: one in Line 8 and the other in
Line 11. In the situation when computing the Jacobian is inexpensive, then one can compute
Jp
in
Line 11 as a matrix-vector product and this will reduce the number of function evaluations per step
from 2 to 1. The inner loop in Line 12-16 corresponds to exploiting symmetry of Hessian. At a given
step, nlTGCR (m) attempts to approximate a Newton step:
xj+1 =xj+δ
where
δ
is an approximate
solution to J(xj)δ+F(xj)=0.
High-Level Clarification.
At this point, one might ask the question: why not just use an inexact
Newton method whereby the Jacobian system is solved with the linear GCR or TGCR method?
This is where AA provides an interesting insight on some weaknesses of Newton-Krylov method. A
Newton-Krylov method generates a Krylov subspace
Span{r0, Jr0,··· , Jkr0}
at a current iterate
5
摘要:

AnEfcientNonlinearAccelerationmethodthatExploitsSymmetryoftheHessianHuanHeHarvardUniversityhuan_he@hms.harvard.eduShifanZhaoEmoryUniversityszhao89@emory.eduZiyuanTangUniversityofMinnesotatang0389@umn.eduJoyceCHoEmoryUniversityjoyce.c.ho@emory.eduYousefSaadUniversityofMinnesotasaad@umn.eduYuanzheXiE...

展开>> 收起<<
An Efficient Nonlinear Acceleration method that Exploits Symmetry of the Hessian Huan He.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:5.08MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注