Anisotropic Proximal Gradient

2025-04-30 0 0 1.35MB 47 页 10玖币
侵权投诉
Anisotropic Proximal Gradient
Emanuel LaudePanagiotis Patrinos
December 24, 2024
Abstract
This paper studies a novel algorithm for nonconvex composite minimization which can
be interpreted in terms of dual space nonlinear preconditioning for the classical proximal
gradient method. The proposed scheme can be applied to additive composite minimization
problems whose smooth part exhibits an anisotropic descent inequality relative to a refer-
ence function. It is proved that the anisotropic descent property is closed under pointwise
average if the Bregman distance generated by the conjugate reference function is jointly
convex. More specifically, for the exponential reference function we prove its closedness
under pointwise conic combinations. We analyze the method’s asymptotic convergence
and prove its linear convergence under an anisotropic proximal gradient dominance con-
dition. Applications are discussed including exponentially regularized LPs and logistic
regression with nonsmooth regularization. In numerical experiments we show significant
improvements of the proposed method over its Euclidean counterparts.
Keywords: Bregman distance ·proximal gradient method ·duality ·nonlinear precondi-
tioning
AMS Subject Classification: 65K05 ·49J52 ·90C30
1 Introduction
1.1 Motivation and contributions
A classical first-order approach for the minimization of an additive composite problem is the
celebrated proximal gradient algorithm. In particular, if the gradient mapping of the smooth
part of the cost function is Lipschitz continuous the algorithm converges with a constant step-
size which avoids a possibly expensive, in terms of function evaluations, linesearch procedure.
However, there are many functions which do not exhibit a globally Lipschitz continuous gra-
dient mapping and thus proximal gradient does not converge with a constant step-size. For
example this is the case for problems involving exponential penalties. It is well known that
the proximal gradient method can be written in terms of a majorize-minimize procedure with
a quadratic upper bound. However, in some cases more structure of the cost function is
available. This offers the opportunity to construct problem specific and potentially tighter
upper approximations that perhaps lead to faster convergence yet preserving the simplicity
KU Leuven, Department of Electrical Engineering (ESAT-STADIUS), Kasteelpark Arenberg 10, 3001
Leuven, Belgium {emanuel.laude,panos.patrinos}@esat.kuleuven.be
1
arXiv:2210.15531v4 [math.OC] 22 Dec 2024
of a first-order method. In this paper we study a novel, nonlinear extension of the proxi-
mal gradient method, called anisotropic proximal gradient method. In the same way that
the Euclidean proximal gradient method at each step minimizes a simple quadratic model
the proposed method minimizes a certain problem specific upper bound obtained from an
anisotropic generalization of the celebrated descent lemma (anisotropic smoothness) [29,28].
This is conceptually similar to the notion of relative smoothness and the Bregman proximal
gradient method [10,5,30]. In fact, in the convex non-composite case the proposed method
can be seen as a certain dual counterpart thereof where the roles of reference function and
cost function are flipped [32]. The approach applies but is not limited to entropically regu-
larized LPs, logistic regression and soft maximum-type problems covering problems that are
not handled with the Bregman proximal gradient method. In experiments we show promising
improvements over Euclidean methods including a recent family of linesearch-free adaptive
proximal gradient methods [27,34]. More precisely the contributions of this work are sum-
merized as follows:
We introduce the notion of anisotropic smoothness, an extension of [29,28], that is
characterized in terms of an inequality akin to (and in fact a generalization of) the well-
known Euclidean descent inequality. In particular, we develop a calculus for anisotropic
smoothness providing equivalent characterisations and invariances that allow one to
assemble more complicated functions from elementary ones. For example in the ex-
ponential case we show that the property is closed under pointwise conic combinations
which distinguishes it from the notion of Euclidean Lipschitz smoothness and the notion
of relative smoothness.
Based on the anisotropic descent inequality we propose a novel algorithm that can be
seen as a nonlinearly preconditioned version of the Euclidean proximal gradient method.
For improved practicality we also consider a linesearch extension. We show that the
method converges subsequentially using an unconventional regularized gap function as
a measure of stationarity. In particular, this guarantees a sublinear rate wrt. the
stationarity gap.
We prove the Q-linear convergence of the anisotropic proximal gradient method under
a non-Euclidean generalization of the proximal gradient dominated condition [25]. This
is closely related to the PL-inequality. We prove that the anisotropic proximal gradient
dominated condition is implied by anisotropic strong convexity of the smooth part. We
show that relative smoothness [5,30] is a dual characterization of anisotropic strong
convexity generalizing [29] to reference functions ϕwhich are not necessarily super-
coercive. Also note that for the Bregman proximal gradient method only an R-linear
convergence result is known [30].
We also provide an equivalent interpretation in terms of a difference of Φ-convex ap-
proach. In particular, this allows us to employ a transfer of smoothness via a certain
double-min duality to show linear convergence if the composite part is strongly convex
rather than the smooth part.
In numerical experiments we show that the algorithm improves over its Euclidean coun-
terparts. On the task of regularized logistic regression we show that for small regu-
larization weight the algorithm achieves a better performance than a recent family of
linesearch-free adaptive proximal gradient methods [27,34].
2
1.2 Related work
A popular nonlinear extension of the classical proximal gradient method is the Bregman
proximal gradient algorithm [10,5,30]. Similar to our approach the algorithm applies to
composite problems whose smooth part exhibits relative smoothness (in the Bregman sense),
a generalization of Lipschitz smoothness introduced by the same authors. For instance this
allows one to consider functions whose Hessians exhibit a certain polynomial growth [30].
Yet, there exist examples, such as problems with exponential penalties, for which it is hard
to construct “simple” reference functions. Simplicity of the reference function is this context
refers to an easy to invert gradient mapping and simple proximal mappings both of which are
crucial for tractability of the algorithm. As a remedy we introduce the notion of anisotropic
smoothness as an alternative generalization of Lipschitz smoothness. The notion of anisotropic
smoothness considered in this paper is an extension of [29] relaxing the super-coercivity
assumption of the reference function. Up to sign-flip, anisotropic smoothness as introduced
in [29] can also be regarded as a globalization of anisotropic prox-regularity [28] specialized
to smooth functions. Alongside relative smoothness [30,4] also propose the notion of relative
strong convexity which allows the authors to study the linear convergence of the Bregman
(proximal) gradient method. In particular, [30] pose the open question of the existence of
a meaningful conjugate duality for relative smoothness and strong convexity akin to the
Euclidean case. Under full domain and super-coercivity of the reference function this is
addressed in [29] where it is shown that such a duality correspondence (in the convex case)
holds between relative smoothness and anisotropic strong convexity as well as relative strong
convexity and anisotropic smoothness. However, the super-coercivity restriction limits the
practical applicability of the framework. In this paper we further generalize these results
to reference functions which are not necessarily super-coercive which allows us to include
the important exponential reference function. Expanding on [29] we show that anisotropic
smoothness is equivalent to the existence of an infimal convolution expression, or equivalently,
to a certain notion of generalized concavity [50], which is an important concept in optimal
transport [50]. In general, however, the pointwise sum of infimal convolutions is not an infimal
convolution [29] limiting the practical applicability. Averages of infimal convolutions also
appear in the context of the Bregman proximal average [51]. In particular the authors study
the preservation of its convexity for reference functions whose conjugates generate a jointly
convex Bregman distance. Building upon these results we show that this property furnishes a
sufficient condition for closedness of anisotropic smoothness under pointwise average even in
the nonconvex case which is important for practice. Furthermore, for the exponential reference
function, we show that anisotropic smoothness is closed under pointwise conic combinations–a
property shared with the notion of quasi self-concordance [18,1]. In this case, a second-order
characterization of anisotropic smoothness reveals a close relation to the recent notion of
(L0, L1)-smoothness [52].
1.3 Paper organization
The remainder of this paper is organized as follows:
In Section 2we introduce the notation and some required basic concepts such as Bregman
distances.
In Section 3we introduce the notion of anisotropic smoothness, propose the algorithm
and discuss the well-definedness of its iterates under certain constraint qualifications.
3
In Section 4we develop a calculus for anisotropic smoothness and provide examples.
In Section 5we analyse the algorithm.
In Section 6we provide an equivalent interpretation in terms of a difference of Φ-convex
approach.
In Section 7we consider applications based on the examples developed in Section 4.
2 Notation and preliminaries
We denote by ⟨·,·⟩ the standard Euclidean inner product on Rnand by x:= px, xfor
any xRnthe standard Euclidean norm on Rn. In accordance with [36,42] we extend the
classical arithmetic on Rto the extended real line R:= R∪ {−∞,+∞}. We define upper
addition −∞ ˙
+=and lower addition −∞+
.=−∞, and accordingly upper subtraction
˙
− ∞ =and lower subtraction ∞ −
.=−∞. The effective domain of an extended
real-valued function f:RnRis denoted by dom f:= {xRn:f(x)<∞}, and we say
that fis proper if dom f̸=and f(x)>−∞ for all xRn; lower semicontinuous (lsc) if
f(¯x)lim infx¯xf(x) for all ¯xRn; super-coercive if f(x)/x∥→∞as x∥→∞. We
define by Γ0(Rn) the class of all proper, lsc convex functions f:RnR. For any functions
f:RnRand g:RnRwe define the infimal convolution or epi-addition of fand gas
(fg)(x) = infyRng(xy) + f(y). For a proper function f:RnRand λ0 we define
the epi-scaling (λ ⋆ f)(x) = λf(λ1x) for λ > 0 and (λ ⋆ f)(x) = δ{0}(x) otherwise. We adopt
the operator precedence for epi-multiplication and epi-addition from the pointwise case. We
denote by Rn
+= [0,+)nthe nonnegative orthant and by Rn
++ = (0,+)nthe positive
orthant. Let g:= g((·)) denote the reflection of g. Since convex conjugation and reflection
commute we adopt the notation: (g)= (g)=: g
. We adopt the notions of essential
smoothness, essential strict convexity and Legendre functions from [40, Section 26]: We say
that a function fΓ0(Rn) is essentially smooth, if int(dom f)̸=and fis differentiable
on int(dom f) such that ∥∇f(xν)∥ → ∞, whenever int(dom f)xνxbdry dom f,
and essentially strictly convex, if fis strictly convex on every convex subset of dom f,
and Legendre, if fis both essentially smooth and essentially strictly convex. We denote by
N0:= N∪ {0}. Otherwise we adopt the notation from [42].
Our concepts and algorithm involve a reference function ϕwhich complies with the fol-
lowing standing requirement which is assumed valid across the entire paper unless stated
otherwise:
(A1) ϕΓ0(Rn) is Legendre with dom ϕ=Rn.
The dual reference function ϕis Legendre as well [40, Theorem 26.3] but does not necessarily
have full domain. Instead, thanks to a conjugate duality between super-coercivity and full do-
main [6, Proposition 2.16], ϕis super-coercive. The gradient mapping ϕ:Rnint dom ϕ
is a diffeomorphism between Rnand int dom ϕwith inverse (ϕ)1=ϕ[40, Theorem
26.5]. In fact, super-coercivity is a standard assumption for the mirror-descent and Bregman
(proximal) gradient algorithm as it ensures well-definedness of the mirror update [5, Lemma
2(ii)]. Since the framework that will be developed in this paper corresponds to a certain dual
Bregman approach the full-domain restriction is somewhat natural.
The running example considered in this paper is the exponential reference function:
Example 2.1.Let ϕ:RnRdefined by ϕ(x) = SumExp(x) := Pn
j=1 exp(xj). Then
ϕΓ0(Rn) is Legendre with full domain and thus complies with our standing assumption.
4
The convex conjugate ϕis the von Neumann entropy defined by
ϕ(x) = H(x) := (Pn
j=1 xjln(xj)xjif xRn
+
+otherwise,
where 0 ln(0) := 0.
For the gradients we have ϕ(x) = Exp(x) := (exp(xj))n
j=1 and ϕ(x) = Log(x) :=
(ln(x
j))n
j=1.
In the course of the paper we often consider the Bregman distance Dϕgenerated by the
dual reference function ϕ:
Definition 2.2 (Bregman distance).We define the Bregman distance generated by ϕas:
Dϕ(x, y) := (ϕ(x)ϕ(y)− ⟨∇ϕ(y), x yif xdom ϕ, y int dom ϕ
+otherwise.
Thanks to [6, Theorem 3.7(iv)] we have that Dϕ(x, y) = 0 if and only if x=y.
For ϕ= SumExp being the exponential reference function as in Example 2.1 the Bregman
distance Dϕis the classical KL-divergence. Thanks to [6, Theorem 3.7(v)] have the following
key relation between Dϕand Dϕ:
Lemma 2.3. We have that the identity
Dϕ(x, y) = Dϕ(ϕ(y),ϕ(x)),
holds true for any x, y int dom ϕ.
3 Anisotropic proximal gradient
3.1 Anisotropic descent inequality
The algorithm that is developed in this paper is based on the following anisotropic descent
property generalizing [29, Remark 3.10] to reference functions ϕwhich are not necessarily
super-coercive. Up to sign-flip it can also be regarded as a globalization of anisotropic prox-
regularity [28, Definition 2.13] specialized to smooth functions.
Definition 3.1 (anisotropic descent inequality (anisotropic smoothness)).Let f∈ C1(Rn)
such that the following constraint qualification holds true
rge frge ϕ. (3.1)
Then we say that fsatisfies the anisotropic descent property (is anisotropically smooth) rel-
ative to ϕwith constant L > 0if for all ¯xRn
f(x)f(¯x) + 1
L⋆ ϕ(x¯x+L1ϕ(f(¯x))) 1
L⋆ ϕ(L1ϕ(f(¯x))) xRn.(3.2)
If L= 1 we say that fsatisfies the anisotropic descent property relative to ϕwithout further
mention of L.
The following remarks are in order:
5
摘要:

AnisotropicProximalGradientEmanuelLaude∗PanagiotisPatrinos∗December24,2024AbstractThispaperstudiesanovelalgorithmfornonconvexcompositeminimizationwhichcanbeinterpretedintermsofdualspacenonlinearpreconditioningfortheclassicalproximalgradientmethod.Theproposedschemecanbeappliedtoadditivecompositeminim...

展开>> 收起<<
Anisotropic Proximal Gradient.pdf

共47页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:47 页 大小:1.35MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 47
客服
关注