Using second-order information in gradient sampling methods for nonsmooth optimization Bennet Gebken1

2025-05-06 0 0 2.39MB 31 页 10玖币

侵权投诉

Using second-order information in gradient

sampling methods for nonsmooth optimization

Bennet Gebken1*

1*Department of Mathematics, Technical University of Munich,

Boltzmannstr. 3, Garching b. M¨unchen, 85748, Germany.

Corresponding author(s). E-mail(s): bennet.gebken@cit.tum.de;

Abstract

In this article, we introduce a novel concept for second-order information of a

nonsmooth function inspired by the Goldstein ε-subdiﬀerential. It comprises the

coeﬃcients of all existing second-order Taylor expansions in an ε-ball around

a given point. Based on this concept, we deﬁne a model of the objective as

the maximum of these Taylor expansions, and derive a sampling scheme for its

approximation in practice. Minimization of this model induces a simple descent

method, for which we show convergence for the case where the objective is convex

or of max-type. While we do not prove any rate of convergence of this method,

numerical experiments suggest superlinear behavior with respect to the number

of oracle calls of the objective.

Keywords: nonsmooth optimization, nonconvex optimization, superlinear

convergence, gradient sampling, trust-region method

MSC Classiﬁcation: 90C30 , 90C56 , 49J52

1 Introduction

Nonsmooth optimization is concerned with optimizing functions that are not necessar-

ily diﬀerentiable. Solution methods from smooth optimization, like gradient descent,

may fail to work in this case, as even when the gradient ∇f(x) of a nonsmooth func-

tion fat a point xexists, it may not yield a stable description of the local behavior of

faround x. This issue can be overcome by considering multiple gradients at the same

time: In bundle methods [1,2], gradients from previous iterations are collected to build

acutting-plane model of the objective at the current point. Alternatively, in gradient

arXiv:2210.04579v4 [math.OC] 3 Jan 2025

sampling methods [3–8], ∇f(x) is replaced by (an approximation of) the Goldstein ε-

subdiﬀerential ∂εf(x) [9]. For several algorithms from these classes, a linear speed of

convergence (at best) was proven [10–12]. Since for smooth optimization, Newton’s

method yields (at least) superlinear convergence by using second-order derivatives,

the question arises whether such a fast method can be constructed for the nonsmooth

case as well. It turns out that this task is quite challenging: In [13] it was called a

“wondrous grail” and it can be related Problem 14 in [14].

To the best of our knowledge, there are currently two methods for which superlin-

ear convergence can be proven for certain classes of nonsmooth optimization problems.

The ﬁrst method is the bundle-based VU-algorithm from [15]. The idea of this method

is to identify a subspace Ualong which fbehaves smoothly and to then perform a

Newton-like step tangent to U. One of the challenges of this approach is to identify the

subspace Uautomatically. It appears that the only mechanism that is guaranteed to

achieve this is described in [16], and requires the objective to be piecewise twice contin-

uously diﬀerentiable with convex selection functions (with some additional technical

assumptions). The second method is the recent SuperPolyak [17]. It is based on the

subproblem of the bundle method by Polyak [18], but uses iteratively computed bun-

dle information and treats the cutting-plane inequalities as equalities. The objective

functions it is designed for are functions that possess a sharp minimum (i.e., the objec-

tive value grows linearly away from the minimum) with known optimal value. While

these are the only methods (that we are aware of) for which superlinear convergence

can be proven, there are several other methods that attempt to achieve this rate by

using some form of second-order information: In [19–21], methods were proposed that

arise by infusing second-order information into cutting-plane models. Alternatively,

there are methods that directly generalize quasi-Newton to the nonsmooth case [22–

28]. However, there are certain drawbacks in all of the approaches mentioned this far:

Both the VU-algorithm and SuperPolyak require certain special structures of f. For

[19–21], a proper deﬁnition of “second-order information” of a nonsmooth function is

missing, which makes it diﬃcult to exploit this information theoretically for proving

fast convergence. Similarly, for the generalized quasi-Newton approaches, there is no

corresponding “nonsmooth Newton’s method” from which their convergence theory

can be inherited.

The goal of this article is to construct a fast method that avoids the above

drawbacks. The basic idea is to use the maximum of existing second-order Taylor

expansions around a point as a local model of the objective function, as was already

proposed in [20]. There, whenever the Hessian matrix does not exist at a point y, “the

Hessian at some point inﬁnitely close to y” (cf. [20], p. 2) is taken instead. In con-

trast to this, we will base our model on the second-order ε-jet, which can be seen as

a second-order version of the Goldstein ε-subdiﬀerential. It contains the coeﬃcients

of all existing second-order Taylor expansions (and their limits) in an ε-ball around a

given point and allows us to deﬁne the aforementioned model in a well-deﬁned way.

Based on minimization of this model, we ﬁrst derive a purely abstract descent method

(Algo. 4.1) for the case where the entire ε-jet is available at every point. To obtain a

method that is actually implementable, we then derive a sampling strategy to approx-

imate the ε-jet (Algo. 4.2), for which availability of a single element of the 0-jet at

every point (i.e., almost always simply the objective value, gradient and Hessian at

that point) is suﬃcient. Inserting this sampling scheme into the abstract method yields

a practical descent method (Algo. 4.3), for which we prove global convergence for the

case where the objective is convex or of max-type. We do not prove anything about

the rate of convergence of the abstract or practical method here. However, numeri-

cal experiments suggest that it is fast with respect to oracle calls (i.e., evaluations of

fand its derivatives). For reproducibility and transparency, the code for all of our

experiments is available at https://github.com/b-gebken/SOGS.

It is important to note that fast convergence with respect to oracle calls does not

necessarily imply a fast runtime. For example, when comparing gradient descent and

Newton’s method in smooth optimization, the latter yields superlinear convergence

with respect to oracle calls. But every iteration of Newton’s method requires the

solution of a system of linear equations, which is more expensive than an iteration

of gradient descent, and may outweigh the fast convergence (see, e.g., [29], Section

8.6.1). In a similar vein, every iteration of our method will require the solution of a

subproblem (see (17)) which is a nonconvex quadratically constrained quadratic pro-

gram (QCQP) (see, e.g., [30], Section 4.4). We could apply several modiﬁcations to

simplify its solution, like positive deﬁnite modiﬁcations of the (potentially noncon-

vex) quadratic terms (as in [20]), summing the quadratic terms and moving them

to the objective of the subproblem (as in [21]) or using a quasi-Newton matrix for

the second-order information to avoid evaluation of the Hessian matrix. However, we

refrain from doing so, as we believe that it is important to ﬁrst understand whether

superlinear convergence can be obtained in the best possible setting (with high com-

putational eﬀort). If this is not the case, then the above techniques likely do not lead

to superlinear convergence either. To put it in another way: We ﬁrst want to ﬁgure

out what Newton’s method for nonsmooth optimization could be in this article, before

trying to derive quasi-Newton variants of it in future work.

The remainder of this article is structured as follows: In Section 2we introduce the

basics of nonsmooth analysis. In Section 3.1 we deﬁne the second-order ε-jet and derive

some of its theoretical properties. Afterwards, in Section 3.2, we deﬁne a model based

on the ε-jet and prove error estimates for the case where fis convex or of max-type. In

Section 4we derive our descent method. We begin with the abstract version in Section

4.1 and prove its convergence. We then derive the sampling scheme for the ε-jet and

prove its termination in Section 4.2. In Section 4.3, we insert the sampling scheme

into the abstract method to obtain the practical descent method, for which we prove

convergence as well. Section 5contains our numerical experiments. We ﬁrst compare

the performance of our method to other solvers for nonsmooth optimization problems

using common test problems. Afterwards, we compare it to the VU-algorithm and

SuperPolyak, which both have superlinear speed of convergence on their respective

problem classes. Finally, in Section 6, we summarize our results and discuss possible

directions for future research.

2 Preliminaries

In this section, we introduce the basics of nonsmooth analysis that are used throughout

the article. More thorough introductions can be found in [2,31]. To this end, let f:

Rn→Rbe locally Lipschitz continuous and let Ω be the set of points in which fis not

diﬀerentiable. By Rademacher’s Theorem, Ω is a null set. The Clarke subdiﬀerential

of fat x∈Rnis deﬁned as

∂f(x) := conv ξ∈Rn:∃(xj)j∈Rn\Ω with xj→xand ∇f(xj)→ξ,(1)

and its elements are called subgradients. It is nonempty and compact. A necessary

condition for a point xto be locally optimal is that 0 ∈∂f(x), which is referred to as

xbeing a critical point. As a set-valued map, the map x7→ ∂f(x) is upper semicontin-

uous. A more “stable” subdiﬀerential which circumvents some of the practical issues

of the Clarke subdiﬀerential (cf. [32], Section 3) is the Goldstein ε-subdiﬀerential [9].

For x∈Rnand ε≥0 it can be deﬁned as

∂εf(x) := conv \

δ>ε {∇f(y) : y∈Bδ(x)\Ω}!,(2)

where Bδ(x) := {y∈Rn:∥x−y∥ ≤ δ}and ∥·∥is the Euclidean norm. Both

subdiﬀerentials are related via

∂εf(x) = conv(∂f(Bε(x))) (3)

(cf. [33], Eq. (2.1)). Like the Clarke subdiﬀerential, ∂εf(x) is nonempty and com-

pact. Furthermore, it can be used to compute descent directions of f: Let ¯v=

−arg minξ∈∂εf(x)∥ξ∥. Then the mean-value theorem ([31], Theorem 2.3.7) combined

with a well-known result from convex analysis (cf. [34]) yields

f(x+t¯v)−f(x)≤ −t∥¯v∥2∀t∈(0, ε/∥¯v∥].(4)

For k∈Nwe say that a function is Ckif it is k-times continuously diﬀerentiable.

3 Second-order ε-jet and model

In this section, we ﬁrst deﬁne the second-order ε-jet and derive some of its properties.

Afterwards, we use it to deﬁne and analyze the second-order model that will later

be the foundation of our descent method. This includes error estimates for the cases

where fis convex or of max-type.

This article contains results for fbelonging to diﬀerent classes of functions. To

avoid confusion, each result explicitly states which assumption is required.

3.1 Second-order ε-jet

We begin by introducing the class of functions to which our approach is applicable.

To this end, let f:Rn→Rand let Ω2⊆Rnbe the set of points in which fis not

twice diﬀerentiable.

Assumption 1. Assume that

(A1.1) fis locally Lipschitz continuous,

(A1.2) Ω2is a null set and

(A1.3) for all x∈Rnthere is an open set U⊆Rnwith x∈Usuch that {∇2f(y) :

y∈U\Ω2}is bounded.

First of all, (A1.1) allows us to use the classical theory of nonsmooth analysis from

Section 2. Since we want to deﬁne our second-order object by evaluating the Hessian

matrix of fat diﬀerent points in Rn,Rn\Ω2should at least be dense. The stronger

assumption (A1.2) is made to obtain consistency with the ε-subdiﬀerential later on

(cf. Lemma 2). Finally, (A1.3) assures boundedness. The second-order object that we

use throughout the article is now deﬁned as follows:

Deﬁnition 1. Let x∈Rnand ε≥0. For fsatisfying Assumption 1, the set

εf(x) := \

δ>ε {(y, f (y),∇f(y),∇2f(y)) : y∈Bδ(x)\Ω2} ⊆ Rn×R×Rn×Rn×n

is called the second-order ε-jet of fat x.1

Compared to the deﬁnition of the Goldstein ε-subdiﬀerential (cf. (2)) there are

two modiﬁcations: Firstly, since we later want to deﬁne the model based on second-

order Taylor expansions, the gradient ∇f(y) is replaced by everything that is needed

for building these expansions, represented as the 4-tuple (y, f(y),∇f(y),∇2f(y)), and

Ω is replaced by Ω2. Secondly, we omitted taking the convex hull because a convex

combination of such 4-tuples does in general not correspond to any Taylor expansion

of f. In the following, we analyze some of the properties of J2

εf(x). First of all, like

the ε-subdiﬀerential, J2

εf(x) is nonempty and compact:

Lemma 1. Assume that fsatisﬁes Assumption 1. Then J2

εf(x)is nonempty and

compact for all x∈Rnand ε≥0.

Proof We ﬁrst show that the set

T(δ) := {(y, f(y),∇f(y),∇2f(y)) : y∈Bδ(x)\Ω2}(5)

is bounded for all δ > ε. Since we consider boundedness in a ﬁnite-dimensional product

space, it is equivalent to boundedness of the projections on the individual factors. Clearly,

Bδ(x)\Ω2and f(Bδ(x)\Ω2) are bounded (due to continuity of f). Furthermore, {∇f(y) :

y∈Bδ(x)\Ω2}is bounded as a subset of the compact set ∂δf(x). To see that {∇2f(y) :

y∈Bδ(x)\Ω2}is bounded as well, let (Ui)i∈Ibe an open cover of Bδ(x) induced by (A1.3)

1The term “jet” is inspired by the classical notion of jets for smooth functions ([35], §2) and the related

concept of Fr´echet second order subjets in [36].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Usingsecond-orderinformationingradientsamplingmethodsfornonsmoothoptimizationBennetGebken1*1*DepartmentofMathematics,TechnicalUniversityofMunich,Boltzmannstr.3,Garchingb.M¨unchen,85748,Germany.Correspondingauthor(s).E-mail(s):bennet.gebken@cit.tum.de;AbstractInthisarticle,weintroduceanovelconceptfor...

展开>> 收起<<

Using second-order information in gradient sampling methods for nonsmooth optimization Bennet Gebken1.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Using second-order information in gradient sampling methods for nonsmooth optimization Bennet Gebken1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: