Using second-order information in gradient sampling methods for nonsmooth optimization Bennet Gebken1

2025-05-06 0 0 2.39MB 31 页 10玖币
侵权投诉
Using second-order information in gradient
sampling methods for nonsmooth optimization
Bennet Gebken1*
1*Department of Mathematics, Technical University of Munich,
Boltzmannstr. 3, Garching b. M¨unchen, 85748, Germany.
Corresponding author(s). E-mail(s): bennet.gebken@cit.tum.de;
Abstract
In this article, we introduce a novel concept for second-order information of a
nonsmooth function inspired by the Goldstein ε-subdifferential. It comprises the
coefficients of all existing second-order Taylor expansions in an ε-ball around
a given point. Based on this concept, we define a model of the objective as
the maximum of these Taylor expansions, and derive a sampling scheme for its
approximation in practice. Minimization of this model induces a simple descent
method, for which we show convergence for the case where the objective is convex
or of max-type. While we do not prove any rate of convergence of this method,
numerical experiments suggest superlinear behavior with respect to the number
of oracle calls of the objective.
Keywords: nonsmooth optimization, nonconvex optimization, superlinear
convergence, gradient sampling, trust-region method
MSC Classification: 90C30 , 90C56 , 49J52
1 Introduction
Nonsmooth optimization is concerned with optimizing functions that are not necessar-
ily differentiable. Solution methods from smooth optimization, like gradient descent,
may fail to work in this case, as even when the gradient f(x) of a nonsmooth func-
tion fat a point xexists, it may not yield a stable description of the local behavior of
faround x. This issue can be overcome by considering multiple gradients at the same
time: In bundle methods [1,2], gradients from previous iterations are collected to build
acutting-plane model of the objective at the current point. Alternatively, in gradient
1
arXiv:2210.04579v4 [math.OC] 3 Jan 2025
sampling methods [38], f(x) is replaced by (an approximation of) the Goldstein ε-
subdifferential εf(x) [9]. For several algorithms from these classes, a linear speed of
convergence (at best) was proven [1012]. Since for smooth optimization, Newton’s
method yields (at least) superlinear convergence by using second-order derivatives,
the question arises whether such a fast method can be constructed for the nonsmooth
case as well. It turns out that this task is quite challenging: In [13] it was called a
“wondrous grail” and it can be related Problem 14 in [14].
To the best of our knowledge, there are currently two methods for which superlin-
ear convergence can be proven for certain classes of nonsmooth optimization problems.
The first method is the bundle-based VU-algorithm from [15]. The idea of this method
is to identify a subspace Ualong which fbehaves smoothly and to then perform a
Newton-like step tangent to U. One of the challenges of this approach is to identify the
subspace Uautomatically. It appears that the only mechanism that is guaranteed to
achieve this is described in [16], and requires the objective to be piecewise twice contin-
uously differentiable with convex selection functions (with some additional technical
assumptions). The second method is the recent SuperPolyak [17]. It is based on the
subproblem of the bundle method by Polyak [18], but uses iteratively computed bun-
dle information and treats the cutting-plane inequalities as equalities. The objective
functions it is designed for are functions that possess a sharp minimum (i.e., the objec-
tive value grows linearly away from the minimum) with known optimal value. While
these are the only methods (that we are aware of) for which superlinear convergence
can be proven, there are several other methods that attempt to achieve this rate by
using some form of second-order information: In [1921], methods were proposed that
arise by infusing second-order information into cutting-plane models. Alternatively,
there are methods that directly generalize quasi-Newton to the nonsmooth case [22
28]. However, there are certain drawbacks in all of the approaches mentioned this far:
Both the VU-algorithm and SuperPolyak require certain special structures of f. For
[1921], a proper definition of “second-order information” of a nonsmooth function is
missing, which makes it difficult to exploit this information theoretically for proving
fast convergence. Similarly, for the generalized quasi-Newton approaches, there is no
corresponding “nonsmooth Newton’s method” from which their convergence theory
can be inherited.
The goal of this article is to construct a fast method that avoids the above
drawbacks. The basic idea is to use the maximum of existing second-order Taylor
expansions around a point as a local model of the objective function, as was already
proposed in [20]. There, whenever the Hessian matrix does not exist at a point y, “the
Hessian at some point infinitely close to y” (cf. [20], p. 2) is taken instead. In con-
trast to this, we will base our model on the second-order ε-jet, which can be seen as
a second-order version of the Goldstein ε-subdifferential. It contains the coefficients
of all existing second-order Taylor expansions (and their limits) in an ε-ball around a
given point and allows us to define the aforementioned model in a well-defined way.
Based on minimization of this model, we first derive a purely abstract descent method
(Algo. 4.1) for the case where the entire ε-jet is available at every point. To obtain a
method that is actually implementable, we then derive a sampling strategy to approx-
imate the ε-jet (Algo. 4.2), for which availability of a single element of the 0-jet at
2
every point (i.e., almost always simply the objective value, gradient and Hessian at
that point) is sufficient. Inserting this sampling scheme into the abstract method yields
a practical descent method (Algo. 4.3), for which we prove global convergence for the
case where the objective is convex or of max-type. We do not prove anything about
the rate of convergence of the abstract or practical method here. However, numeri-
cal experiments suggest that it is fast with respect to oracle calls (i.e., evaluations of
fand its derivatives). For reproducibility and transparency, the code for all of our
experiments is available at https://github.com/b-gebken/SOGS.
It is important to note that fast convergence with respect to oracle calls does not
necessarily imply a fast runtime. For example, when comparing gradient descent and
Newton’s method in smooth optimization, the latter yields superlinear convergence
with respect to oracle calls. But every iteration of Newton’s method requires the
solution of a system of linear equations, which is more expensive than an iteration
of gradient descent, and may outweigh the fast convergence (see, e.g., [29], Section
8.6.1). In a similar vein, every iteration of our method will require the solution of a
subproblem (see (17)) which is a nonconvex quadratically constrained quadratic pro-
gram (QCQP) (see, e.g., [30], Section 4.4). We could apply several modifications to
simplify its solution, like positive definite modifications of the (potentially noncon-
vex) quadratic terms (as in [20]), summing the quadratic terms and moving them
to the objective of the subproblem (as in [21]) or using a quasi-Newton matrix for
the second-order information to avoid evaluation of the Hessian matrix. However, we
refrain from doing so, as we believe that it is important to first understand whether
superlinear convergence can be obtained in the best possible setting (with high com-
putational effort). If this is not the case, then the above techniques likely do not lead
to superlinear convergence either. To put it in another way: We first want to figure
out what Newton’s method for nonsmooth optimization could be in this article, before
trying to derive quasi-Newton variants of it in future work.
The remainder of this article is structured as follows: In Section 2we introduce the
basics of nonsmooth analysis. In Section 3.1 we define the second-order ε-jet and derive
some of its theoretical properties. Afterwards, in Section 3.2, we define a model based
on the ε-jet and prove error estimates for the case where fis convex or of max-type. In
Section 4we derive our descent method. We begin with the abstract version in Section
4.1 and prove its convergence. We then derive the sampling scheme for the ε-jet and
prove its termination in Section 4.2. In Section 4.3, we insert the sampling scheme
into the abstract method to obtain the practical descent method, for which we prove
convergence as well. Section 5contains our numerical experiments. We first compare
the performance of our method to other solvers for nonsmooth optimization problems
using common test problems. Afterwards, we compare it to the VU-algorithm and
SuperPolyak, which both have superlinear speed of convergence on their respective
problem classes. Finally, in Section 6, we summarize our results and discuss possible
directions for future research.
3
2 Preliminaries
In this section, we introduce the basics of nonsmooth analysis that are used throughout
the article. More thorough introductions can be found in [2,31]. To this end, let f:
RnRbe locally Lipschitz continuous and let Ω be the set of points in which fis not
differentiable. By Rademacher’s Theorem, Ω is a null set. The Clarke subdifferential
of fat xRnis defined as
f(x) := conv ξRn:(xj)jRn\Ω with xjxand f(xj)ξ,(1)
and its elements are called subgradients. It is nonempty and compact. A necessary
condition for a point xto be locally optimal is that 0 f(x), which is referred to as
xbeing a critical point. As a set-valued map, the map x7→ f(x) is upper semicontin-
uous. A more “stable” subdifferential which circumvents some of the practical issues
of the Clarke subdifferential (cf. [32], Section 3) is the Goldstein ε-subdifferential [9].
For xRnand ε0 it can be defined as
εf(x) := conv \
δ{∇f(y) : yBδ(x)\}!,(2)
where Bδ(x) := {yRn:xy∥ ≤ δ}and ∥·∥is the Euclidean norm. Both
subdifferentials are related via
εf(x) = conv(f(Bε(x))) (3)
(cf. [33], Eq. (2.1)). Like the Clarke subdifferential, εf(x) is nonempty and com-
pact. Furthermore, it can be used to compute descent directions of f: Let ¯v=
arg minξεf(x)ξ. Then the mean-value theorem ([31], Theorem 2.3.7) combined
with a well-known result from convex analysis (cf. [34]) yields
f(x+t¯v)f(x)≤ −t¯v2t(0, ε/¯v].(4)
For kNwe say that a function is Ckif it is k-times continuously differentiable.
3 Second-order ε-jet and model
In this section, we first define the second-order ε-jet and derive some of its properties.
Afterwards, we use it to define and analyze the second-order model that will later
be the foundation of our descent method. This includes error estimates for the cases
where fis convex or of max-type.
This article contains results for fbelonging to different classes of functions. To
avoid confusion, each result explicitly states which assumption is required.
4
3.1 Second-order ε-jet
We begin by introducing the class of functions to which our approach is applicable.
To this end, let f:RnRand let Ω2Rnbe the set of points in which fis not
twice differentiable.
Assumption 1. Assume that
(A1.1) fis locally Lipschitz continuous,
(A1.2) 2is a null set and
(A1.3) for all xRnthere is an open set URnwith xUsuch that {∇2f(y) :
yU\2}is bounded.
First of all, (A1.1) allows us to use the classical theory of nonsmooth analysis from
Section 2. Since we want to define our second-order object by evaluating the Hessian
matrix of fat different points in Rn,Rn\2should at least be dense. The stronger
assumption (A1.2) is made to obtain consistency with the ε-subdifferential later on
(cf. Lemma 2). Finally, (A1.3) assures boundedness. The second-order object that we
use throughout the article is now defined as follows:
Definition 1. Let xRnand ε0. For fsatisfying Assumption 1, the set
J2
εf(x) := \
δ{(y, f (y),f(y),2f(y)) : yBδ(x)\2} ⊆ Rn×R×Rn×Rn×n
is called the second-order ε-jet of fat x.1
Compared to the definition of the Goldstein ε-subdifferential (cf. (2)) there are
two modifications: Firstly, since we later want to define the model based on second-
order Taylor expansions, the gradient f(y) is replaced by everything that is needed
for building these expansions, represented as the 4-tuple (y, f(y),f(y),2f(y)), and
Ω is replaced by Ω2. Secondly, we omitted taking the convex hull because a convex
combination of such 4-tuples does in general not correspond to any Taylor expansion
of f. In the following, we analyze some of the properties of J2
εf(x). First of all, like
the ε-subdifferential, J2
εf(x) is nonempty and compact:
Lemma 1. Assume that fsatisfies Assumption 1. Then J2
εf(x)is nonempty and
compact for all xRnand ε0.
Proof We first show that the set
T(δ) := {(y, f(y),f(y),2f(y)) : yBδ(x)\2}(5)
is bounded for all δ > ε. Since we consider boundedness in a finite-dimensional product
space, it is equivalent to boundedness of the projections on the individual factors. Clearly,
Bδ(x)\2and f(Bδ(x)\2) are bounded (due to continuity of f). Furthermore, {∇f(y) :
yBδ(x)\2}is bounded as a subset of the compact set δf(x). To see that {∇2f(y) :
yBδ(x)\2}is bounded as well, let (Ui)iIbe an open cover of Bδ(x) induced by (A1.3)
1The term “jet” is inspired by the classical notion of jets for smooth functions ([35], §2) and the related
concept of Fechet second order subjets in [36].
5
摘要:

Usingsecond-orderinformationingradientsamplingmethodsfornonsmoothoptimizationBennetGebken1*1*DepartmentofMathematics,TechnicalUniversityofMunich,Boltzmannstr.3,Garchingb.M¨unchen,85748,Germany.Correspondingauthor(s).E-mail(s):bennet.gebken@cit.tum.de;AbstractInthisarticle,weintroduceanovelconceptfor...

展开>> 收起<<
Using second-order information in gradient sampling methods for nonsmooth optimization Bennet Gebken1.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:2.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注