Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle Dan Garber

2025-05-06 0 0 1.21MB 31 页 10玖币

侵权投诉

Faster Projection-Free Augmented Lagrangian Methods via

Weak Proximal Oracle

Dan Garber∗

dangar@technion.ac.il

Tsur Livney∗

tsur.livney@campus.technion.ac.il

Shoham Sabach∗

ssabach@technion.ac.il

Technion - Israel Institute of Technology

Abstract

This paper considers a convex composite optimization problem with aﬃne constraints,

which includes problems that take the form of minimizing a smooth convex objective func-

tion over the intersection of (simple) convex sets, or regularized with multiple (simple)

functions. Motivated by high-dimensional applications in which exact projection/proximal

computations are not tractable, we propose a projection-free augmented Lagrangian-based

method, in which primal updates are carried out using a weak proximal oracle (WPO). In an

earlier work, WPO was shown to be more powerful than the standard linear minimization

oracle (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe meth-

ods). Moreover, WPO is computationally tractable for many high-dimensional problems of

interest, including those motivated by recovery of low-rank matrices and tensors, and opti-

mization over polytopes which admit eﬃcient LMOs. The main result of this paper shows

that under a certain curvature assumption (which is weaker than strong convexity), our

WPO-based algorithm achieves an ergodic rate of convergence of O(1/T ) for both the ob-

jective residual and feasibility gap. This result, to the best of our knowledge, improves upon

the O(1/√T) rate for existing LMO-based projection-free methods for this class of prob-

lems. Empirical experiments on a low-rank and sparse covariance matrix estimation task

and the Max Cut semideﬁnite relaxation demonstrate that of our method can outperform

state-of-the-art LMO-based Lagrangian-based methods.

1 Introduction

Throughout the paper, we consider the following minimization problem

min

x∈E1,y∈E2

f(x) + RX(x) + RY(y) s.t. Ax=y,(OP)

where E1and E2are ﬁnite Euclidean spaces, f:E1→Ris a convex and β-smooth function,

RX:E1→(−∞,∞] and RY:E2→(−∞,∞] are proper, lower semi-continuous and convex

functions, and A:E1→E2is a linear mapping.

Problems that fall into the model (OP) appear in many interesting and important active

research areas such as machine learning, signal processing, statistics, and more. For example,

the recovery of a matrix or a tensor which is both sparse and of low rank is useful in problems

such as covariance matrix estimation [2, 15, 38], graph and image denoising [4, 5, 45, 46] and

link prediction [32, 34, 44]. More applications of Problem (OP) can be found in brain mapping

[36, 31, 23] and multiple sequence alignment [11, 10, 28, 42].

∗Authors are ordered alphabetically.

arXiv:2210.13968v2 [math.OC] 21 Feb 2023

An important family of eﬃcient methods for solving problems in the form of model (OP)

are Lagrangian-based methods, and most notably augmented Lagrangian methods [35, 25, 37].

Starting with the classical proximal method of multipliers [39], and until more recently [40,

12, 13, 9], such methods, which are based on proximal/projection computations (due to the

nonsmooth functions RXand RYin Problem (OP)) have been successfully developed and cor-

responding provable convergence rates have been established. However, in many cases of interest

such proximal/projection computations are not tractable in high-dimensional problems, for in-

stance when either RXor RYis a nuclear norm regularizer for matrices, which underlies many

recovery problems of low-rank matrices and tensors (see, for instance, [7, 6, 16]), or an indicator

function for a polytope. Thus, with the growing interest in recent years in so-called projection-

free methods, which are mostly based on the use of linear minimization oracles (LMO) instead

of proximal/projection oracles through the Frank-Wolfe method (aka conditional gradient, see

for instance [26]), and are often much more eﬃcient to implement for high-dimensional problems

(e.g., in case RXor RYis a nuclear norm regularizer or an indicator function for a polytope

which captures some well-studied combinatorial structure, see for instance [26, 24]), have been

studied [33, 43, 41]. However, these methods suﬀer from slow convergence rates compared to

their proximal/projection-based counterparts, with the worst-case guaranteed convergence rate

being at best O(1/√T), where Tis the number of iterations executed. This rate is not known to

be improvable even under additional standard curvature assumptions such as strong convexity

of the function f(·) in Problem (OP)1. A recent attempt to obtain faster projection-free meth-

ods under relatively mild assumptions has been made in [22], however as we discuss in detail in

the appendix (see Section A), there is a major problem with their proof which does not seem

easily ﬁxable. We also refer the interested reader to the excellent discussions in [22] on major

issues with other previous attempts to prove faster rates for projection-free methods.

For the simpler problem of minimizing a smooth convex objective function over a convex

and compact set, and in particular in case the feasible set is either a polytope or a nuclear

norm ball of matrices or a spectrahedron (set of positive semideﬁnite matrices with unit trace),

several recent works showed how simple modiﬁcations of the Frank-Wolfe method can lead

to provably faster convergence rates, under standard curvature assumptions, see for instance

[19, 30, 3, 17, 1, 20]. Thus, in the context of the signiﬁcantly more complex Problem (OP), our

work considers the following natural question:

Can we design a projection-free augmented Lagrangian-based method that, at least under

standard curvature assumptions, improves upon the current O(1/√T)convergence rate?

We answer this question on the aﬃrmative side by providing a projection-free method with a

rate of O(1/T ), both in terms of the objective function residual and the feasibility gap of the

aﬃne constraint in Problem (OP).

Our approach departs from previous projection-free methods which guarantee only a rate

of O(1/√T) in two aspects. First, as already suggested, we make a curvature assumption on

Problem (OP): we introduce a curvature condition we call primal quadratic gap (see deﬁnition

in the sequel). In particular, this condition holds whenever the smooth function f(·) in Problem

(OP) is strongly convex, but also holds in case f(·) is a composition of a strongly convex function

with a linear transformation (e.g., a least squares objective, which need not be strongly convex)

and RX,RYare indicators for polytopes. Second, while previous projection-free methods rely

on the availability of a linear minimization oracle (LMO), in this work we consider a slightly

stronger oracle which was already considered in recent works [1, 21, 20] (these however do

not apply to problems such as Problem (OP), which includes aﬃne constraints), namely the

weak proximal oracle (WPO)2. In a nutshell, this oracle solves a certain relaxed version of the

1This is not surprising since it is well-known that in general, and as opposed to projection/proximal-based

methods, the Frank-Wolfe method does not beneﬁt from strong convexity, see for instance discussions in [19, 17, 1].

2The term “weak proximal oracle” was originally coined in [20].

proximal/projection problem, which can still be much more eﬃcient to solve than the standard

proximal/projection problem, but can provide more informative directions than that of the

LMO. Two prime examples for the eﬃciency of implementing the WPO are when (i) RXor

RYis an indicator function for a polytope which admits and eﬃcient LMO, then the WPO

could be implemented based on a single call to the LMO of the polytope, and (ii) RXor RY

is an indicator function/regularizer corresponding to the matrix nuclear norm and a unique

low-rank optimal solution exists, then implementing the WPO corresponds to a low-rank SVD

computation with rank matching that of the low-rank optimal solution, which is much more

eﬃcient than proximal/projection computation, which generally requires a full-rank SVD (see

a detailed discussion in [1, 20, 21]).

The combination of the two ingredients: a curvature condition and the weak proximal

oracle, to obtain faster convergence rates for projection-free methods should not come as a

surprise since it was already instrumental in achieving similar improvements for projection-free

methods in settings that do not include aﬃne constraints as in model (OP), see for instance

[19, 30, 17, 1, 20]3. To the best of our knowledge, this is the ﬁrst time such an approach is used

for a problem of the form of model (OP).

Algorithm Oracle Rate Oracle imple-

mentation in

polytope setup

Oracle imple-

mentation in

nuclear norm

setup

Assumptions

Proximal Method of Multi-

pliers [39, 40]

proximal O(1/T ) projection full-rank SVD

Accelerated Primal Dual

[9]

proximal O(1/T 2) projection full-rank SVD strong convexity

Conditional Gradient Aug-

mented Lagrangian [43]

LMO O(1/√T) LMO rank-one SVD

This work weak

proxi-

mal

O(1/T ) LMO plus con-

vex quadratic

opt. over

simplex

rank(x∗)-SVD primal quadratic

gap (weaker than

strong convexity)

Table 1: Comparison of augmented Lagrangian-based methods with diﬀerent optimization or-

acles for Problem (OP). The “Rate” column speciﬁes the convergence rate only in terms of

the number of iterations Tand suppresses all other quantities. “Polytope setup” in the forth

column refers to a setting in which RXis an indicator function of some compact and convex

polytope and RYis some proximal friendly function. “Nuclear norm setup” in the ﬁfth column

refers to a setting in which RXis a matrix nuclear norm regularizer and RYis some proxi-

mal friendly function. x∗denotes the optimal solution, which is assumed to be unique. Both

columns specify the dominating cost of implementing the appropriate optimization oracle for

the xvariable.

1.1 Paper organization

In Section 2, we discuss the augmented Lagrangian approach for solving the Problem (OP),

present the Primal Quadratic Gap property (PQG) needed for our algorithm’s analysis. We

also recall the notion of the Weak Proximal Oracle that will be used in our algorithm and

discuss its implementation in several important scenarios. In the Appendix (see Section 3), we

give examples of problems of interest, for which our algorithm might be appealing to use. In

3While technically [19, 30] rely on the use of a standard LMO for the feasible set (which they assume to be

a polytope), as we show in the sequel, the way they use the output of the LMO to construct the new descent

direction is very similar to the implementation of a WPO. In particular, we rely on observations from [19] to

construct an eﬃcient WPO for polytopes.

Section 4, we develop our algorithm and prove our main rate of convergence result. In Section

5, we demonstrate the empirical performance of our algorithm.

2 Preliminaries

2.1 Notation

Throughout the paper, we will use the following notation for simplifying the presentation and

developments. We use the following compact notations q:= x

y≡(x,y)∈E1×E2=: E, and

RQ(q) := RX(x) + RY(y). In addition, we deﬁne the linear mapping K:E→E2:= [A,−I],

where Iis an identity linear mapping. This way we can compactly write the constraint Ax=y

as Kq=0. For any ﬁnite Euclidean space Vof dimension n, we denote the standard Euclidean

inner product of any two points x:= [x1, . . . , xn]>and y:= [y1, . . . , yn]>∈V, by hx,yi:=

i=1 xi·yi, and we let kxk:= phx,xiand kxk1:= Pn

i=1 |xi|denote the standard Euclidean

norm and the `1norm, respectively. For any two ﬁnite Euclidean spaces V1and V2, the spectral

norm of the linear mapping T:V1→V2, is denoted by kT k := maxx∈V1{kT xk:kxk= 1}.

While both norms are denoted the same, throughout the paper it will be clear from the context

which of the norms is used in each appearance. In addition, kXkFdenotes the Frobenius norm

of a matrix X,kXknuc denotes its nuclear norm and tr(X) denotes its trace. We denote by

+the set of all positive semideﬁnite matrices of size d×dand by Sτthe spectrahedron of all

matrices in Sd

+with trace τ. We use the notation δC(·) to denote the indicator function of a set

2.2 The Augmented Lagrangian

The augmented Lagrangian (AL) of Problem (OP) with a multiplier (dual variable) w∈E2is

deﬁned as

Lρ(q,w)≡ Lρ(x,y,w) := f(x) + RQ(q) + hw,Kqi+ρ

2kKqk2,

where ρ > 0 is a penalty parameter associated to the linear equality constraint. Note that the

(standard) Lagrangian of Problem (OP) is recovered when ρ= 0.

We also require the two following standard assumptions which we assume to hold true

throughout the paper.

Assumption 1. The augmented Lagrangian has a saddle point, i.e., there exists a point

(q∗,w∗)∈E×E2satisfying

Lρ(q∗,w)≤ Lρ(q∗,w∗)≤ Lρ(q,w∗),(1)

for all q∈Eand w∈E2.

Assumption 2. (Slater’s Condition) There exist x∈ri(dom(f)∩dom(RX)) and y∈ri(dom(RY))

such that Ax=y, where ri denotes the relative interior of a set.

We denote by P∗the set of optimal solutions of the primal problem (OP), and by D∗the

set of optimal solutions of the associated dual problem.

Under the above two assumptions, and thanks to the convexity of Problem (OP), strong

duality holds. As a result, and thanks to Proposition 2, the set of saddle points of Lρis non-

empty and corresponds to the set of all pairs (q∗,w∗), where q∗∈ P∗and w∈ D∗(see also

Appendix B.3).

In order to ﬁnd solutions of Problem (OP), we will solve the following equivalent saddle

point problem

min

q∈Emax

w∈E2Lρ(q,w),(2)

whose optimal solutions are the saddle points of Lρ.

From now on, we will denote the optimal objective function value by Lρ(q∗,w∗), for all

saddle points (q∗,w∗), and for short we will write L∗

ρ.

The smooth part of the augmented Lagrangian (SAL) of Problem (OP) is deﬁned, for any

q∈Eand w∈E2, by

S(q,w)≡S(x,y,w) := f(x) + hw,Kqi+ρ

2kKqk2.(3)

The following result will be essential to our developments in the sequel. Its simple proof is

deferred to the appendix.

Lemma 2.1. The function q→S(q,w), for any ﬁxed w∈E2, is smooth with parameter

βS=β+ρ(kAk + 1)2.

2.2.1 Primal Quadratic Gap

We now study a new property of the smooth function q→S(·,w), for any ﬁxed w∈E2, which

we refer to as the Primal Quadratic Gap (PQG).

Deﬁnition 1. (Primal Quadratic Gap) We say that Problem (OP) satisﬁes the Primal Quadratic

Gap property with a parameter αS>0, if for any q∈dom(RQ), the point q∗:= arg minq∗∈P∗kq−

q∗k2∈ P∗satisﬁes, for all w∈E2, the following inequality

hq∗−q,∇qS(q,w)i ≤ S(q∗,w)−S(q,w)−αS

2kq∗−qk2.(4)

This property is a weaker version of the strong convexity property. Here, instead of assuming

the inequality (4) holds for any two points, we only require that it holds for any point q∈

dom(RQ) and the corresponding closest optimal solution of Problem (OP).

Example 1. If the function f(x) is strongly convex, then the function q→S(q,w), for a

ﬁxed w∈E2, is strongly convex with a certain parameter αS. This implies that in this case the

SAL S(q,w) satisﬁes the primal quadratic gap property, with the same parameter αS.

Theorem 1. Suppose that f:E1→Ris α-strongly convex. Then, Problem (OP) admits a

unique primal optimal solution q∗and it satisﬁes the PQG property with the parameter αS=

min{α

2,αρ

α+2ρkAk2}>0. In particular, the function q→S(q,w), for any ﬁxed w∈E2, is strongly

convex.

The proof is deferred to the Appendix (see Section B).

Example 2. If RQis an indicator function for a polytope, we can show that the PQG property

holds true even when f(·) need not be strongly convex. Alternatively, we will make the following

assumption.

Assumption 3. (i) f≡g◦ B, where B:E1→E3is a linear mapping, and g:E3→Ris

αg-strongly convex.

(ii) RQ(q) := RX(x) + RY(y)is an indicator of a compact and convex polytope F ≡ {q∈E:

Cq≤b}, where C:E→Rpis a linear mapping, and b∈Rp.

Theorem 2. Suppose that Assumption 3 holds true. Then, there exists a constant σ > 0such

that if ρ≥αg, then Problem (OP) satisﬁes the PQG property with the parameter αS=αgσ−1.

The proof is deferred to the Appendix (see Section B).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FasterProjection-FreeAugmentedLagrangianMethodsviaWeakProximalOracleDanGarber*dangar@technion.ac.ilTsurLivney*tsur.livney@campus.technion.ac.ilShohamSabach*ssabach@technion.ac.ilTechnion-IsraelInstituteofTechnologyAbstractThispaperconsidersaconvexcompositeoptimizationproblemwithaneconstraints,which...

展开>> 收起<<

Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle Dan Garber.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle Dan Garber

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: