An Efﬁcient Nonlinear Acceleration method that Exploits Symmetry of the Hessian Huan He

2025-04-30 0 0 5.08MB 31 页 10玖币

侵权投诉

An Efﬁcient Nonlinear Acceleration method that

Exploits Symmetry of the Hessian

Huan He

Harvard University

huan_he@hms.harvard.edu

Shifan Zhao

Emory University

szhao89@emory.edu

Ziyuan Tang

University of Minnesota

tang0389@umn.edu

Joyce C Ho

Emory University

joyce.c.ho@emory.edu

Yousef Saad

University of Minnesota

saad@umn.edu

Yuanzhe Xi

Emory University

yuanzhe.xi@emory.edu

Abstract

Nonlinear acceleration methods are powerful techniques to speed up ﬁxed-point

iterations. However, many acceleration methods require storing a large number of

previous iterates and this can become impractical if computational resources are

limited. In this paper, we propose a nonlinear Truncated Generalized Conjugate

Residual method (nlTGCR) whose goal is to exploit the symmetry of the Hessian

to reduce memory usage. The proposed method can be interpreted as either an

inexact Newton or a quasi-Newton method. We show that, with the help of global

strategies like residual check techniques, nlTGCR can converge globally for general

nonlinear problems and that under mild conditions, nlTGCR is able to achieve

superlinear convergence. We further analyze the convergence of nlTGCR in a

stochastic setting. Numerical results demonstrate the superiority of nlTGCR when

compared with several other competitive baseline approaches on a few problems.

Our code will be available in the future.

1 Introduction

In this paper, we consider solving the ﬁxed-point problem:

Find x∈Rnsuch that x=H(x). (1)

This problem has received a surge of interest due to its wide range of applications in mathematics,

computational science and engineering. Most optimization algorithms are iterative, and their goal is

to ﬁnd a related ﬁxed-point of the form

(1)

, where

H:Rn→Rn

is the iteration mapping which can

potentially be nonsmooth or noncontractive. When the optimization problem is convex,

is typically

nonexpansive, and the solution set of the ﬁxed-point problem is the same as that of the original

optimization problem, or closely related to it. Consider the simple ﬁxed-point iteration

xk+1 =H(xk)

which produces a sequence of iterates

{x0, x1,··· , xK}

. When the iteration converges, its limit is a

ﬁxed-point, i.e.,

x∗=H(x∗)

. However, an issue with ﬁxed-point iteration is that it does not always

converge, and when it does, it might reach the limit very slowly.

To address this issue, a number of acceleration methods have been proposed and studied over the years,

such as the reduced-rank extrapolation (RRE) [

], minimal-polynomial extrapolation (MPE) [

modiﬁed MPE (MMPE) [

], and the vector



-algorithms [

]. Besides these algorithms, Anderson

Acceleration (AA) [

] has received enormous recent attention due to its nice properties and its success

in machine learnin applications [

]. In practice, since computing the

Hessian of the objective function is commonly difﬁcult or even unavailable, AA can be seen as a

practical alternative to Newton’s method [

]. Also, compared with the classical iterative methods

arXiv:2210.12573v1 [cs.LG] 22 Oct 2022

such as the nonlinear conjugate gradient (CG) method [

], no line-search or trust-region technique is

performed in AA, and this is a big advantage in large-scale unconstrained optimization. Empirically,

it is observed that AA is quite successful in accelerating convergence. We refer readers to [

] for a

recent survey of acceleration methods.

However, classical AA has one undesirable disadvantage in that it is expensive in terms of memory

as well as computational cost, especially in a nonconvex stochastic setting, where only sublinear

convergence can be expected when only stochastic gradients can be accessed [

]. In light of this, a

number of variants of AA have been proposed which aim at improving its performance and robustness

(e.g., [

]). The above-cited works focus on improving the convergence behavior of

AA, but they do not consider reducing the memory cost. In machine learning, we often encounter

practical situations where the number of parameters is quite large and for this reason, it is not practical

to use a large number of vectors in the acceleration methods. It is not clear whether or not the

symmetric structure of the Hessian can be exploited in a scheme like AA to reduce the memory cost

while still maintaining the convergence guarantees. In this paper, we will demonstrate how this can

be accomplished with a new algorithm that is superior to AA in practice.

Our contributions.

This paper develops a nonlinear acceleration method, nonlinear Truncated

Generalized Conjugate Residual method (nlTGCR), that takes advantage of symmetry. This work is

motivated by the observation that the Hessian of a nonlinear function, or the Jacobian of a gradient

of a mapping,

is symmetric and therefore more effective, conjugate gradient-like schemes can be

exploited.

We demonstrate that nonlinear acceleration methods can beneﬁt from the symmetry property of the

Hessian. In particular, we study both linear and nonlinear problems and give a systematic analysis of

TGCR and nlTGCR. We show that TGCR is efﬁcient and optimal for linear problems. By viewing

the method from the angle of an inexact Newton approach, we also show that adding a few global

convergence strategies ensures that nlTGCR can achieve global convergence guarantees.

We complement our theoretical results with numerical simulations on several different problems. The

experimental results demonstrate advantages of our methods. To the best of our knowledge, this

is still the ﬁrst work to investigate and improve the AA dynamics by exploiting symmetry of the

Hessian.

Related work.

Designing efﬁcient optimization methods has received much attention. Several recent

works [

] consider second order optimization methods that employ sketching or

approximation techniques. Different from these approaches, our method is a ﬁrst-order method that

utilizes symmetry of the Hessian instead of constructing it. A variant of inexact Newton method was

proposed in [

] where the least-squares sub-problems are solved approximately using Minimum

Residual method. Similarly, a new type of quasi Newton symmetric update [

] uses several secant

equations in a least-squares sense. These approaches have the same goal as ours. However, they are

more closely related to a secant or a multi-secant technique, and as will be argued it does a better

job of capturing the nonlinearity of the problem. [

] proposed a short-term AA algorithm that is

different from ours because it is still based on the parameter sequence instead of the gradient sequence

and does not exploit symmetry of the Hessian.

2 Background

2.1 Extrapolation, acceleration, and the Anderson Acceleration procedure

Consider a general ﬁxed-point problem and the associated ﬁxed-point iteration as shown in

(1)

. Denote

rj=H(xj)−xj

the residual vector at the

th iteration. Classical extrapolation methods including

RRE, MPE and the vector



-Algorithm, have been designed to accelerate the convergence of the

original sequence by generating a new and independent sequence of the form:

t(k)

j=Pk

i=0 αixj+i

An important characteristic of these classical extrapolation methods is that the two sequences are not

mixed in the sense that no accelerated item

t(k)

, is used to produce the iterate

. These extrapolation

methods must be distinguished from acceleration methods such as the AA procedure which aim at

generating their own sequences to ﬁnd a ﬁxed point of a certain mapping H.

AA was originally designed to solve a system of nonlinear equations written in the form

F(x) =

H(x)−x= 0

[

]. Denote

Fi=F(xi)

. AA starts with an initial

and sets

x1=

H(x0) = x0+βF0

, where

β > 0

is a parameter. At step

j > 1

we deﬁne

Xj= [xj−m, . . . , xj−1],

and ¯

Fj= [Fj−m, . . . , Fj−1]along with the differences:

Xj= [∆xj−m··· ∆xj−1]∈Rn×m,

Fj= [∆Fj−m··· ∆Fj−1]∈Rn×m.(2)

We then deﬁne the next AA iterate as follows:

xj+1 =xj+βFj−(Xj+βFj)θ(j)where: (3)

θ(j)=argminθ∈RmkFj− Fjθk2.(4)

To deﬁne the next iterate in (3) the algorithm uses the term

Fj+1 =F(xj+1)

where

xj+1

is the current

accelerated iterate. AA belongs to the class of multi-secant methods. Indeed, the approximation (3)

can be written as:

xj+1 =xj−[−βI + (Xj+βFj)(FT

jFj)−1FT

j]Fj

≡xj−GjFj.(5)

Thus, Gjcan be seen as an update to the (approximate) inverse Jacobian Gj−m=−βI

Gj=Gj−m+ (Xj−Gj−mFj)(FT

jFj)−1FT

j,(6)

and is the minimizer of kGj+βIkFunder the multi-secant condition of type II 1

GjFj=Xj.(7)

This link between AA and Broyden multi-secant type updates was ﬁrst unraveled by Eyert [

] and

expanded upon in [46].

2.2 Inexact and quasi-Newton methods

Given a nonlinear system of equations

F(x)=0

. Inexact Newton methods [

], start with an

initial guess x0and compute a sequence of iterates as follows

Solve J(xj)δj≈ −F(xj)(8)

Set xj+1 =xj+δj(9)

Here,

J(xj)

is the Jacobian of

at the current iterate

. In (8) the system is solved inexactly,

typically by some iterative method. In quasi-Newton methods [

], the inverse of the Jacobian

is approximated progressively. Because it is the inverse Jacobian that is approximated, the method

is akin to Broyden’s second (or type-II) update method. This method replaces Newtons’s iteration:

xj+1 =xj−DF (xj)−1F(xj)

with

xj+1 =xj−GjF(xj)

where

approximates the inverse of

the Jacobian

DF (xj)

by the update formula

Gj+1 =Gj+ (∆xj−Gj∆F(xj))vT

in which

vjis deﬁned in different ways see [46] for details.

3 Exploiting symmetry

In the following, we speciﬁcally consider the case where the nonlinear mapping

is the gradient of

some objective function φ:Rn→Rto be minimized, i.e.,

F(x) = ∇φ(x).

In this situation, the Jacobian of

becomes

∇2φ

the Hessian of

. An obvious observation here

is that the symmetry of the Hessian is not taken into account in the approximate inverse Hessian

update formula (6). This has only been considered in the literature (very) recently (e.g., [

]).

In a 1983 report, [

] showed that the matrix

obtained by a multi-secant method that satisﬁes the

secant condition (7) is symmetric iff the matrix

jFj

is symmetric. It is possible to explicitly force

symmetry by employing generalizations of the symmetric versions of Broyden-type methods. Thus,

the authors of [

] developed a multisecant version of the Powell Symmetric Broyden (PSB) update

Type I Broyden conditions involve approximations to the Jacobian, while type II conditions deal with the

inverse Jacobian.

due to Powell [

] while the article [

] proposed a symmetric multisecant method based on the

popular Broyden-Fletcher-Goldfarb-Shanno (BFGS) approach as well as the Davidon-Fletcher-Powell

(DFP) update. However, there are a number of issues with the symmetric versions of multisecant

updates, some of which are discussed in [55].

We observe that when we are close to the limit, the condition

jFj=FT

jXj

is nearly satisﬁed.

This is because if x∗is the limit with F(x∗) = 0 we can write

F(xk)−F(xk−1)=[F(xk)−F(x∗)]

−[F(xk−1)−F(x∗)]

≈ ∇2φ(x∗)(xk−xk−1).

(10)

This translates to

Fj≈ ∇2φ(x∗)Xj

from which it follows that

jFj≈ XT

j∇2φ(x∗)Xj

which

is a symmetric matrix under mild smoothness conditions on

. Therefore, the issue of symmetry

can be mitigated if we are able to develop nonlinear acceleration methods that take advantage of

near-symmetry.

3.1 The linear case: Truncated GCR (TGCR)

We ﬁrst consider solving the linear system

Ax =b

with a general matrix

. The Generalized

Conjugate Residual (GCR) algorithm, see, e.g., [

], solves this linear system by building a

sequence of search directions pi, for i= 0,··· , j at step jso that the vectors Apiare orthogonal to

each other. With this property it is easy to generate iterates that minimize the residual at each step,

and this leads to GCR, see [51, pp 195-196] for details.

Next we will make two changes to GCR. First, we will develop a truncated version in which any

given

Apj

is orthogonal to the previous

m Api

’s only. This is dictated by practical considerations,

because keeping all

Api

vectors may otherwise require too much memory. Second, we will keep a

set of vectors for the

’s and another set for the vectors

vi≡Api

, for

i= 1,··· , j

at step

in order

to avoid unnecessary additional products of the matrix

with vectors. The Truncated GCR (TGCR)

is summarized in Algorithm 1.

Algorithm 1 TGCR (m)

1: Input: Matrix A, RHS b, initial x0.

2: Set r0≡b−Ax0;v=Ar0;

3: v0=v/kvk;p0=r0/kvk;

4: for j= 0,1,2,··· ,Until convergence do

5: αj= (rj, vj)

6: xj+1 =xj+αjpj

7: rj+1 =rj−αjvj

8: p=rj+1;v=Ap;

9: i0= max(1, j −m+ 1)

10: for i=i0:jdo

11: βij := (v, Api)

12: p:= p−βij pi;

13: v:= v−βij vi;

14: end for

15: pj+1 := p/kvk;vj+1 := v/kvk;

16: end for

With

m=∞

we obtain the non-restarted GCR method, which is equivalent to the non-restarted (i.e.,

full) GMRES. However, when

is symmetric, but not necessarily symmetric positive deﬁnite, then

TGCR (1) is identical with TGCR (m) in exact arithmetic. This leads to big savings both in terms of

memory and in computational costs.

Theorem 3.1.

When the coefﬁcient matrix

is symmetric, TGCR (m) generates the same iterates

as TGCR (1) for any

m > 0

. In addition, when

is positive deﬁnite, the

-th residual vector

rk=b−Axksatisﬁes the following inequality where κis the spectral condition number of A:

krkk ≤ 2√κ−1

√κ+ 1k

kr0k.(11)

3.2 The nonlinear case: nlTGCR

Assume now that we want to solve the nonlinear problem

F(x)=0.

We need to make three major

changes to Algorithm 1. First, any residual is now the negative of

F(x)

so Line 2 and Line 7 must be

replaced by

r0=−F(x0)

and

rj+1 =−F(xj+1)

, respectively. In addition, we originally need to

calculate the products

Ar0

and

in Line 2 and Line 8 respectively. Here

needs to be replaced

by the Jacobian

J(xj)

at the current iterate. We also use the notation

Pj≡[pi0,··· , pj]

, and

Vj≡[J(xi0)pi0,··· , J(xj)pj]

. The most important change is in lines 5-6 where

αj

of Algorithm 1

needs to be replaced by a vector

. This is because when we write the linear model used in the form

of an inexact Newton method:

F(xj+Pjy)≈F(xj)+[J]Pjywhere

[J]Pj≡[J(xi0)pi0,··· , J(xj)pj] = Vj.(12)

The projection method that minimizes the norm

kF(xj)+[J]Pjyk=kF(xj)+Vjyk

of the right-hand

side determines yin such a way that

F(xj) + Vjy⊥Span{Vj} → (Vj)T[F(xj) + Vjy]=0

→y=VT

jrj

(13)

where it is assumed the

’s are fully orthogonal. Note that in the linear case, it can be shown

that

jrj

has only one nonzero component when one assumes that the vectors

J(xi)pi

are fully

orthogonal, i.e., that

i0= 1

always. The nonlinear version of TGCR (m) is summarized in Algorithm

2 where the indication ‘Use Frechet’ means that the vector

v=J(x)u

is to be computed as

v= (F(x+u)−F(x))/ for some small .

Algorithm 2 nlTGCR (m)

1: Input:F(x), initial x0.

2: Set r0=−F(x0).

3: Compute v=J(x0)r0; (Use Frechet)

4: v0=v/kvk,p0=r0/kvk;

5: for j= 0,1,2,··· ,Until convergence do

6: yj=VT

jrj

7: xj+1 =xj+Pjyj

8: rj+1 =−F(xj+1)

9: Set: p:= rj+1;

10: i0= max(1, j −m+ 1)

11: Compute v=J(xj+1)p(Use Frechet)

12: for i=i0:jdo

13: βij := (v, vi)

14: p:= p−βij pi

15: v:= v−βij vi

16: end for

17: pj+1 := p/kvk;vj+1 := v/kvk;

18: end for

Remark.

nlTGCR (m) requires 2 function evaluations per step: one in Line 8 and the other in

Line 11. In the situation when computing the Jacobian is inexpensive, then one can compute

Line 11 as a matrix-vector product and this will reduce the number of function evaluations per step

from 2 to 1. The inner loop in Line 12-16 corresponds to exploiting symmetry of Hessian. At a given

step, nlTGCR (m) attempts to approximate a Newton step:

xj+1 =xj+δ

where

is an approximate

solution to J(xj)δ+F(xj)=0.

High-Level Clariﬁcation.

At this point, one might ask the question: why not just use an inexact

Newton method whereby the Jacobian system is solved with the linear GCR or TGCR method?

This is where AA provides an interesting insight on some weaknesses of Newton-Krylov method. A

Newton-Krylov method generates a Krylov subspace

Span{r0, Jr0,··· , Jkr0}

at a current iterate

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnEfcientNonlinearAccelerationmethodthatExploitsSymmetryoftheHessianHuanHeHarvardUniversityhuan_he@hms.harvard.eduShifanZhaoEmoryUniversityszhao89@emory.eduZiyuanTangUniversityofMinnesotatang0389@umn.eduJoyceCHoEmoryUniversityjoyce.c.ho@emory.eduYousefSaadUniversityofMinnesotasaad@umn.eduYuanzheXiE...

展开>> 收起<<

An Efﬁcient Nonlinear Acceleration method that Exploits Symmetry of the Hessian Huan He.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Efﬁcient Nonlinear Acceleration method that Exploits Symmetry of the Hessian Huan He

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: