Anisotropic Proximal Gradient

2025-04-30 0 0 1.35MB 47 页 10玖币

侵权投诉

Emanuel Laude∗Panagiotis Patrinos∗

December 24, 2024

Abstract

This paper studies a novel algorithm for nonconvex composite minimization which can

be interpreted in terms of dual space nonlinear preconditioning for the classical proximal

gradient method. The proposed scheme can be applied to additive composite minimization

problems whose smooth part exhibits an anisotropic descent inequality relative to a refer-

ence function. It is proved that the anisotropic descent property is closed under pointwise

average if the Bregman distance generated by the conjugate reference function is jointly

convex. More speciﬁcally, for the exponential reference function we prove its closedness

under pointwise conic combinations. We analyze the method’s asymptotic convergence

and prove its linear convergence under an anisotropic proximal gradient dominance con-

dition. Applications are discussed including exponentially regularized LPs and logistic

regression with nonsmooth regularization. In numerical experiments we show signiﬁcant

improvements of the proposed method over its Euclidean counterparts.

Keywords: Bregman distance ·proximal gradient method ·duality ·nonlinear precondi-

tioning

AMS Subject Classiﬁcation: 65K05 ·49J52 ·90C30

1 Introduction

1.1 Motivation and contributions

A classical ﬁrst-order approach for the minimization of an additive composite problem is the

celebrated proximal gradient algorithm. In particular, if the gradient mapping of the smooth

part of the cost function is Lipschitz continuous the algorithm converges with a constant step-

size which avoids a possibly expensive, in terms of function evaluations, linesearch procedure.

However, there are many functions which do not exhibit a globally Lipschitz continuous gra-

dient mapping and thus proximal gradient does not converge with a constant step-size. For

example this is the case for problems involving exponential penalties. It is well known that

the proximal gradient method can be written in terms of a majorize-minimize procedure with

a quadratic upper bound. However, in some cases more structure of the cost function is

available. This oﬀers the opportunity to construct problem speciﬁc and potentially tighter

upper approximations that perhaps lead to faster convergence yet preserving the simplicity

∗KU Leuven, Department of Electrical Engineering (ESAT-STADIUS), Kasteelpark Arenberg 10, 3001

Leuven, Belgium {emanuel.laude,panos.patrinos}@esat.kuleuven.be

arXiv:2210.15531v4 [math.OC] 22 Dec 2024

of a ﬁrst-order method. In this paper we study a novel, nonlinear extension of the proxi-

mal gradient method, called anisotropic proximal gradient method. In the same way that

the Euclidean proximal gradient method at each step minimizes a simple quadratic model

the proposed method minimizes a certain problem speciﬁc upper bound obtained from an

anisotropic generalization of the celebrated descent lemma (anisotropic smoothness) [29,28].

This is conceptually similar to the notion of relative smoothness and the Bregman proximal

gradient method [10,5,30]. In fact, in the convex non-composite case the proposed method

can be seen as a certain dual counterpart thereof where the roles of reference function and

cost function are ﬂipped [32]. The approach applies but is not limited to entropically regu-

larized LPs, logistic regression and soft maximum-type problems covering problems that are

not handled with the Bregman proximal gradient method. In experiments we show promising

improvements over Euclidean methods including a recent family of linesearch-free adaptive

proximal gradient methods [27,34]. More precisely the contributions of this work are sum-

merized as follows:

•We introduce the notion of anisotropic smoothness, an extension of [29,28], that is

characterized in terms of an inequality akin to (and in fact a generalization of) the well-

known Euclidean descent inequality. In particular, we develop a calculus for anisotropic

smoothness providing equivalent characterisations and invariances that allow one to

assemble more complicated functions from elementary ones. For example in the ex-

ponential case we show that the property is closed under pointwise conic combinations

which distinguishes it from the notion of Euclidean Lipschitz smoothness and the notion

of relative smoothness.

•Based on the anisotropic descent inequality we propose a novel algorithm that can be

seen as a nonlinearly preconditioned version of the Euclidean proximal gradient method.

For improved practicality we also consider a linesearch extension. We show that the

method converges subsequentially using an unconventional regularized gap function as

a measure of stationarity. In particular, this guarantees a sublinear rate wrt. the

stationarity gap.

•We prove the Q-linear convergence of the anisotropic proximal gradient method under

a non-Euclidean generalization of the proximal gradient dominated condition [25]. This

is closely related to the PL-inequality. We prove that the anisotropic proximal gradient

dominated condition is implied by anisotropic strong convexity of the smooth part. We

show that relative smoothness [5,30] is a dual characterization of anisotropic strong

convexity generalizing [29] to reference functions ϕwhich are not necessarily super-

coercive. Also note that for the Bregman proximal gradient method only an R-linear

convergence result is known [30].

•We also provide an equivalent interpretation in terms of a diﬀerence of Φ-convex ap-

proach. In particular, this allows us to employ a transfer of smoothness via a certain

double-min duality to show linear convergence if the composite part is strongly convex

rather than the smooth part.

•In numerical experiments we show that the algorithm improves over its Euclidean coun-

terparts. On the task of regularized logistic regression we show that for small regu-

larization weight the algorithm achieves a better performance than a recent family of

linesearch-free adaptive proximal gradient methods [27,34].

1.2 Related work

A popular nonlinear extension of the classical proximal gradient method is the Bregman

proximal gradient algorithm [10,5,30]. Similar to our approach the algorithm applies to

composite problems whose smooth part exhibits relative smoothness (in the Bregman sense),

a generalization of Lipschitz smoothness introduced by the same authors. For instance this

allows one to consider functions whose Hessians exhibit a certain polynomial growth [30].

Yet, there exist examples, such as problems with exponential penalties, for which it is hard

to construct “simple” reference functions. Simplicity of the reference function is this context

refers to an easy to invert gradient mapping and simple proximal mappings both of which are

crucial for tractability of the algorithm. As a remedy we introduce the notion of anisotropic

smoothness as an alternative generalization of Lipschitz smoothness. The notion of anisotropic

smoothness considered in this paper is an extension of [29] relaxing the super-coercivity

assumption of the reference function. Up to sign-ﬂip, anisotropic smoothness as introduced

in [29] can also be regarded as a globalization of anisotropic prox-regularity [28] specialized

to smooth functions. Alongside relative smoothness [30,4] also propose the notion of relative

strong convexity which allows the authors to study the linear convergence of the Bregman

(proximal) gradient method. In particular, [30] pose the open question of the existence of

a meaningful conjugate duality for relative smoothness and strong convexity akin to the

Euclidean case. Under full domain and super-coercivity of the reference function this is

addressed in [29] where it is shown that such a duality correspondence (in the convex case)

holds between relative smoothness and anisotropic strong convexity as well as relative strong

convexity and anisotropic smoothness. However, the super-coercivity restriction limits the

practical applicability of the framework. In this paper we further generalize these results

to reference functions which are not necessarily super-coercive which allows us to include

the important exponential reference function. Expanding on [29] we show that anisotropic

smoothness is equivalent to the existence of an inﬁmal convolution expression, or equivalently,

to a certain notion of generalized concavity [50], which is an important concept in optimal

transport [50]. In general, however, the pointwise sum of inﬁmal convolutions is not an inﬁmal

convolution [29] limiting the practical applicability. Averages of inﬁmal convolutions also

appear in the context of the Bregman proximal average [51]. In particular the authors study

the preservation of its convexity for reference functions whose conjugates generate a jointly

convex Bregman distance. Building upon these results we show that this property furnishes a

suﬃcient condition for closedness of anisotropic smoothness under pointwise average even in

the nonconvex case which is important for practice. Furthermore, for the exponential reference

function, we show that anisotropic smoothness is closed under pointwise conic combinations–a

property shared with the notion of quasi self-concordance [18,1]. In this case, a second-order

characterization of anisotropic smoothness reveals a close relation to the recent notion of

(L0, L1)-smoothness [52].

1.3 Paper organization

The remainder of this paper is organized as follows:

In Section 2we introduce the notation and some required basic concepts such as Bregman

distances.

In Section 3we introduce the notion of anisotropic smoothness, propose the algorithm

and discuss the well-deﬁnedness of its iterates under certain constraint qualiﬁcations.

In Section 4we develop a calculus for anisotropic smoothness and provide examples.

In Section 5we analyse the algorithm.

In Section 6we provide an equivalent interpretation in terms of a diﬀerence of Φ-convex

approach.

In Section 7we consider applications based on the examples developed in Section 4.

2 Notation and preliminaries

We denote by ⟨·,·⟩ the standard Euclidean inner product on Rnand by ∥x∥:= p⟨x, x⟩for

any x∈Rnthe standard Euclidean norm on Rn. In accordance with [36,42] we extend the

classical arithmetic on Rto the extended real line R:= R∪ {−∞,+∞}. We deﬁne upper

addition −∞ ˙

+∞=∞and lower addition −∞+

.∞=−∞, and accordingly upper subtraction

∞˙

− ∞ =∞and lower subtraction ∞ −

.∞=−∞. The eﬀective domain of an extended

real-valued function f:Rn→Ris denoted by dom f:= {x∈Rn:f(x)<∞}, and we say

that fis proper if dom f̸=∅and f(x)>−∞ for all x∈Rn; lower semicontinuous (lsc) if

f(¯x)≤lim infx→¯xf(x) for all ¯x∈Rn; super-coercive if f(x)/∥x∥→∞as ∥x∥→∞. We

deﬁne by Γ0(Rn) the class of all proper, lsc convex functions f:Rn→R. For any functions

f:Rn→Rand g:Rn→Rwe deﬁne the inﬁmal convolution or epi-addition of fand gas

(f□g)(x) = infy∈Rng(x−y) + f(y). For a proper function f:Rn→Rand λ≥0 we deﬁne

the epi-scaling (λ ⋆ f)(x) = λf(λ−1x) for λ > 0 and (λ ⋆ f)(x) = δ{0}(x) otherwise. We adopt

the operator precedence for epi-multiplication and epi-addition from the pointwise case. We

denote by Rn

+= [0,+∞)nthe nonnegative orthant and by Rn

++ = (0,+∞)nthe positive

orthant. Let g−:= g(−(·)) denote the reﬂection of g. Since convex conjugation and reﬂection

commute we adopt the notation: (g−)∗= (g∗)−=: g∗

−. We adopt the notions of essential

smoothness, essential strict convexity and Legendre functions from [40, Section 26]: We say

that a function f∈Γ0(Rn) is essentially smooth, if int(dom f)̸=∅and fis diﬀerentiable

on int(dom f) such that ∥∇f(xν)∥ → ∞, whenever int(dom f)∋xν→x∈bdry dom f,

and essentially strictly convex, if fis strictly convex on every convex subset of dom ∂f,

and Legendre, if fis both essentially smooth and essentially strictly convex. We denote by

N0:= N∪ {0}. Otherwise we adopt the notation from [42].

Our concepts and algorithm involve a reference function ϕwhich complies with the fol-

lowing standing requirement which is assumed valid across the entire paper unless stated

otherwise:

(A1) ϕ∈Γ0(Rn) is Legendre with dom ϕ=Rn.

The dual reference function ϕ∗is Legendre as well [40, Theorem 26.3] but does not necessarily

have full domain. Instead, thanks to a conjugate duality between super-coercivity and full do-

main [6, Proposition 2.16], ϕ∗is super-coercive. The gradient mapping ∇ϕ:Rn→int dom ϕ∗

is a diﬀeomorphism between Rnand int dom ϕ∗with inverse (∇ϕ)−1=∇ϕ∗[40, Theorem

26.5]. In fact, super-coercivity is a standard assumption for the mirror-descent and Bregman

(proximal) gradient algorithm as it ensures well-deﬁnedness of the mirror update [5, Lemma

2(ii)]. Since the framework that will be developed in this paper corresponds to a certain dual

Bregman approach the full-domain restriction is somewhat natural.

The running example considered in this paper is the exponential reference function:

Example 2.1.Let ϕ:Rn→Rdeﬁned by ϕ(x) = SumExp(x) := Pn

j=1 exp(xj). Then

ϕ∈Γ0(Rn) is Legendre with full domain and thus complies with our standing assumption.

The convex conjugate ϕ∗is the von Neumann entropy deﬁned by

ϕ∗(x) = H(x) := (Pn

j=1 xjln(xj)−xjif x∈Rn

+∞otherwise,

where 0 ln(0) := 0.

For the gradients we have ∇ϕ(x) = Exp(x) := (exp(xj))n

j=1 and ∇ϕ∗(x∗) = Log(x∗) :=

(ln(x∗

j))n

j=1.

In the course of the paper we often consider the Bregman distance Dϕ∗generated by the

dual reference function ϕ∗:

Deﬁnition 2.2 (Bregman distance).We deﬁne the Bregman distance generated by ϕ∗as:

Dϕ∗(x, y) := (ϕ∗(x)−ϕ∗(y)− ⟨∇ϕ∗(y), x −y⟩if x∈dom ϕ∗, y ∈int dom ϕ∗

+∞otherwise.

Thanks to [6, Theorem 3.7(iv)] we have that Dϕ∗(x, y) = 0 if and only if x=y.

For ϕ= SumExp being the exponential reference function as in Example 2.1 the Bregman

distance Dϕ∗is the classical KL-divergence. Thanks to [6, Theorem 3.7(v)] have the following

key relation between Dϕ∗and Dϕ:

Lemma 2.3. We have that the identity

Dϕ∗(x, y) = Dϕ(∇ϕ∗(y),∇ϕ∗(x)),

holds true for any x, y ∈int dom ϕ∗.

3 Anisotropic proximal gradient

3.1 Anisotropic descent inequality

The algorithm that is developed in this paper is based on the following anisotropic descent

property generalizing [29, Remark 3.10] to reference functions ϕwhich are not necessarily

super-coercive. Up to sign-ﬂip it can also be regarded as a globalization of anisotropic prox-

regularity [28, Deﬁnition 2.13] specialized to smooth functions.

Deﬁnition 3.1 (anisotropic descent inequality (anisotropic smoothness)).Let f∈ C1(Rn)

such that the following constraint qualiﬁcation holds true

rge ∇f⊆rge ∇ϕ. (3.1)

Then we say that fsatisﬁes the anisotropic descent property (is anisotropically smooth) rel-

ative to ϕwith constant L > 0if for all ¯x∈Rn

f(x)≤f(¯x) + 1

L⋆ ϕ(x−¯x+L−1∇ϕ∗(∇f(¯x))) −1

L⋆ ϕ(L−1∇ϕ∗(∇f(¯x))) ∀x∈Rn.(3.2)

If L= 1 we say that fsatisﬁes the anisotropic descent property relative to ϕwithout further

mention of L.

The following remarks are in order:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnisotropicProximalGradientEmanuelLaude∗PanagiotisPatrinos∗December24,2024AbstractThispaperstudiesanovelalgorithmfornonconvexcompositeminimizationwhichcanbeinterpretedintermsofdualspacenonlinearpreconditioningfortheclassicalproximalgradientmethod.Theproposedschemecanbeappliedtoadditivecompositeminim...

展开>> 收起<<

Anisotropic Proximal Gradient.pdf

共47页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Anisotropic Proximal Gradient

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: