Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle Dan Garber

2025-05-06 0 0 1.21MB 31 页 10玖币
侵权投诉
Faster Projection-Free Augmented Lagrangian Methods via
Weak Proximal Oracle
Dan Garber
dangar@technion.ac.il
Tsur Livney
tsur.livney@campus.technion.ac.il
Shoham Sabach
ssabach@technion.ac.il
Technion - Israel Institute of Technology
Abstract
This paper considers a convex composite optimization problem with affine constraints,
which includes problems that take the form of minimizing a smooth convex objective func-
tion over the intersection of (simple) convex sets, or regularized with multiple (simple)
functions. Motivated by high-dimensional applications in which exact projection/proximal
computations are not tractable, we propose a projection-free augmented Lagrangian-based
method, in which primal updates are carried out using a weak proximal oracle (WPO). In an
earlier work, WPO was shown to be more powerful than the standard linear minimization
oracle (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe meth-
ods). Moreover, WPO is computationally tractable for many high-dimensional problems of
interest, including those motivated by recovery of low-rank matrices and tensors, and opti-
mization over polytopes which admit efficient LMOs. The main result of this paper shows
that under a certain curvature assumption (which is weaker than strong convexity), our
WPO-based algorithm achieves an ergodic rate of convergence of O(1/T ) for both the ob-
jective residual and feasibility gap. This result, to the best of our knowledge, improves upon
the O(1/T) rate for existing LMO-based projection-free methods for this class of prob-
lems. Empirical experiments on a low-rank and sparse covariance matrix estimation task
and the Max Cut semidefinite relaxation demonstrate that of our method can outperform
state-of-the-art LMO-based Lagrangian-based methods.
1 Introduction
Throughout the paper, we consider the following minimization problem
min
xE1,yE2
f(x) + RX(x) + RY(y) s.t. Ax=y,(OP)
where E1and E2are finite Euclidean spaces, f:E1Ris a convex and β-smooth function,
RX:E1(−∞,] and RY:E2(−∞,] are proper, lower semi-continuous and convex
functions, and A:E1E2is a linear mapping.
Problems that fall into the model (OP) appear in many interesting and important active
research areas such as machine learning, signal processing, statistics, and more. For example,
the recovery of a matrix or a tensor which is both sparse and of low rank is useful in problems
such as covariance matrix estimation [2, 15, 38], graph and image denoising [4, 5, 45, 46] and
link prediction [32, 34, 44]. More applications of Problem (OP) can be found in brain mapping
[36, 31, 23] and multiple sequence alignment [11, 10, 28, 42].
Authors are ordered alphabetically.
1
arXiv:2210.13968v2 [math.OC] 21 Feb 2023
An important family of efficient methods for solving problems in the form of model (OP)
are Lagrangian-based methods, and most notably augmented Lagrangian methods [35, 25, 37].
Starting with the classical proximal method of multipliers [39], and until more recently [40,
12, 13, 9], such methods, which are based on proximal/projection computations (due to the
nonsmooth functions RXand RYin Problem (OP)) have been successfully developed and cor-
responding provable convergence rates have been established. However, in many cases of interest
such proximal/projection computations are not tractable in high-dimensional problems, for in-
stance when either RXor RYis a nuclear norm regularizer for matrices, which underlies many
recovery problems of low-rank matrices and tensors (see, for instance, [7, 6, 16]), or an indicator
function for a polytope. Thus, with the growing interest in recent years in so-called projection-
free methods, which are mostly based on the use of linear minimization oracles (LMO) instead
of proximal/projection oracles through the Frank-Wolfe method (aka conditional gradient, see
for instance [26]), and are often much more efficient to implement for high-dimensional problems
(e.g., in case RXor RYis a nuclear norm regularizer or an indicator function for a polytope
which captures some well-studied combinatorial structure, see for instance [26, 24]), have been
studied [33, 43, 41]. However, these methods suffer from slow convergence rates compared to
their proximal/projection-based counterparts, with the worst-case guaranteed convergence rate
being at best O(1/T), where Tis the number of iterations executed. This rate is not known to
be improvable even under additional standard curvature assumptions such as strong convexity
of the function f(·) in Problem (OP)1. A recent attempt to obtain faster projection-free meth-
ods under relatively mild assumptions has been made in [22], however as we discuss in detail in
the appendix (see Section A), there is a major problem with their proof which does not seem
easily fixable. We also refer the interested reader to the excellent discussions in [22] on major
issues with other previous attempts to prove faster rates for projection-free methods.
For the simpler problem of minimizing a smooth convex objective function over a convex
and compact set, and in particular in case the feasible set is either a polytope or a nuclear
norm ball of matrices or a spectrahedron (set of positive semidefinite matrices with unit trace),
several recent works showed how simple modifications of the Frank-Wolfe method can lead
to provably faster convergence rates, under standard curvature assumptions, see for instance
[19, 30, 3, 17, 1, 20]. Thus, in the context of the significantly more complex Problem (OP), our
work considers the following natural question:
Can we design a projection-free augmented Lagrangian-based method that, at least under
standard curvature assumptions, improves upon the current O(1/T)convergence rate?
We answer this question on the affirmative side by providing a projection-free method with a
rate of O(1/T ), both in terms of the objective function residual and the feasibility gap of the
affine constraint in Problem (OP).
Our approach departs from previous projection-free methods which guarantee only a rate
of O(1/T) in two aspects. First, as already suggested, we make a curvature assumption on
Problem (OP): we introduce a curvature condition we call primal quadratic gap (see definition
in the sequel). In particular, this condition holds whenever the smooth function f(·) in Problem
(OP) is strongly convex, but also holds in case f(·) is a composition of a strongly convex function
with a linear transformation (e.g., a least squares objective, which need not be strongly convex)
and RX,RYare indicators for polytopes. Second, while previous projection-free methods rely
on the availability of a linear minimization oracle (LMO), in this work we consider a slightly
stronger oracle which was already considered in recent works [1, 21, 20] (these however do
not apply to problems such as Problem (OP), which includes affine constraints), namely the
weak proximal oracle (WPO)2. In a nutshell, this oracle solves a certain relaxed version of the
1This is not surprising since it is well-known that in general, and as opposed to projection/proximal-based
methods, the Frank-Wolfe method does not benefit from strong convexity, see for instance discussions in [19, 17, 1].
2The term “weak proximal oracle” was originally coined in [20].
2
proximal/projection problem, which can still be much more efficient to solve than the standard
proximal/projection problem, but can provide more informative directions than that of the
LMO. Two prime examples for the efficiency of implementing the WPO are when (i) RXor
RYis an indicator function for a polytope which admits and efficient LMO, then the WPO
could be implemented based on a single call to the LMO of the polytope, and (ii) RXor RY
is an indicator function/regularizer corresponding to the matrix nuclear norm and a unique
low-rank optimal solution exists, then implementing the WPO corresponds to a low-rank SVD
computation with rank matching that of the low-rank optimal solution, which is much more
efficient than proximal/projection computation, which generally requires a full-rank SVD (see
a detailed discussion in [1, 20, 21]).
The combination of the two ingredients: a curvature condition and the weak proximal
oracle, to obtain faster convergence rates for projection-free methods should not come as a
surprise since it was already instrumental in achieving similar improvements for projection-free
methods in settings that do not include affine constraints as in model (OP), see for instance
[19, 30, 17, 1, 20]3. To the best of our knowledge, this is the first time such an approach is used
for a problem of the form of model (OP).
Algorithm Oracle Rate Oracle imple-
mentation in
polytope setup
Oracle imple-
mentation in
nuclear norm
setup
Assumptions
Proximal Method of Multi-
pliers [39, 40]
proximal O(1/T ) projection full-rank SVD
Accelerated Primal Dual
[9]
proximal O(1/T 2) projection full-rank SVD strong convexity
Conditional Gradient Aug-
mented Lagrangian [43]
LMO O(1/T) LMO rank-one SVD
This work weak
proxi-
mal
O(1/T ) LMO plus con-
vex quadratic
opt. over
simplex
rank(x)-SVD primal quadratic
gap (weaker than
strong convexity)
Table 1: Comparison of augmented Lagrangian-based methods with different optimization or-
acles for Problem (OP). The “Rate” column specifies the convergence rate only in terms of
the number of iterations Tand suppresses all other quantities. “Polytope setup” in the forth
column refers to a setting in which RXis an indicator function of some compact and convex
polytope and RYis some proximal friendly function. “Nuclear norm setup” in the fifth column
refers to a setting in which RXis a matrix nuclear norm regularizer and RYis some proxi-
mal friendly function. xdenotes the optimal solution, which is assumed to be unique. Both
columns specify the dominating cost of implementing the appropriate optimization oracle for
the xvariable.
1.1 Paper organization
In Section 2, we discuss the augmented Lagrangian approach for solving the Problem (OP),
present the Primal Quadratic Gap property (PQG) needed for our algorithm’s analysis. We
also recall the notion of the Weak Proximal Oracle that will be used in our algorithm and
discuss its implementation in several important scenarios. In the Appendix (see Section 3), we
give examples of problems of interest, for which our algorithm might be appealing to use. In
3While technically [19, 30] rely on the use of a standard LMO for the feasible set (which they assume to be
a polytope), as we show in the sequel, the way they use the output of the LMO to construct the new descent
direction is very similar to the implementation of a WPO. In particular, we rely on observations from [19] to
construct an efficient WPO for polytopes.
3
Section 4, we develop our algorithm and prove our main rate of convergence result. In Section
5, we demonstrate the empirical performance of our algorithm.
2 Preliminaries
2.1 Notation
Throughout the paper, we will use the following notation for simplifying the presentation and
developments. We use the following compact notations q:= x
y(x,y)E1×E2=: E, and
RQ(q) := RX(x) + RY(y). In addition, we define the linear mapping K:EE2:= [A,−I],
where Iis an identity linear mapping. This way we can compactly write the constraint Ax=y
as Kq=0. For any finite Euclidean space Vof dimension n, we denote the standard Euclidean
inner product of any two points x:= [x1, . . . , xn]>and y:= [y1, . . . , yn]>V, by hx,yi:=
Pn
i=1 xi·yi, and we let kxk:= phx,xiand kxk1:= Pn
i=1 |xi|denote the standard Euclidean
norm and the `1norm, respectively. For any two finite Euclidean spaces V1and V2, the spectral
norm of the linear mapping T:V1V2, is denoted by kT k := maxxV1{kT xk:kxk= 1}.
While both norms are denoted the same, throughout the paper it will be clear from the context
which of the norms is used in each appearance. In addition, kXkFdenotes the Frobenius norm
of a matrix X,kXknuc denotes its nuclear norm and tr(X) denotes its trace. We denote by
Sd
+the set of all positive semidefinite matrices of size d×dand by Sτthe spectrahedron of all
matrices in Sd
+with trace τ. We use the notation δC(·) to denote the indicator function of a set
C.
2.2 The Augmented Lagrangian
The augmented Lagrangian (AL) of Problem (OP) with a multiplier (dual variable) wE2is
defined as
Lρ(q,w)≡ Lρ(x,y,w) := f(x) + RQ(q) + hw,Kqi+ρ
2kKqk2,
where ρ > 0 is a penalty parameter associated to the linear equality constraint. Note that the
(standard) Lagrangian of Problem (OP) is recovered when ρ= 0.
We also require the two following standard assumptions which we assume to hold true
throughout the paper.
Assumption 1. The augmented Lagrangian has a saddle point, i.e., there exists a point
(q,w)E×E2satisfying
Lρ(q,w)≤ Lρ(q,w)≤ Lρ(q,w),(1)
for all qEand wE2.
Assumption 2. (Slater’s Condition) There exist xri(dom(f)dom(RX)) and yri(dom(RY))
such that Ax=y, where ri denotes the relative interior of a set.
We denote by Pthe set of optimal solutions of the primal problem (OP), and by Dthe
set of optimal solutions of the associated dual problem.
Under the above two assumptions, and thanks to the convexity of Problem (OP), strong
duality holds. As a result, and thanks to Proposition 2, the set of saddle points of Lρis non-
empty and corresponds to the set of all pairs (q,w), where q∈ Pand w∈ D(see also
Appendix B.3).
4
In order to find solutions of Problem (OP), we will solve the following equivalent saddle
point problem
min
qEmax
wE2Lρ(q,w),(2)
whose optimal solutions are the saddle points of Lρ.
From now on, we will denote the optimal objective function value by Lρ(q,w), for all
saddle points (q,w), and for short we will write L
ρ.
The smooth part of the augmented Lagrangian (SAL) of Problem (OP) is defined, for any
qEand wE2, by
S(q,w)S(x,y,w) := f(x) + hw,Kqi+ρ
2kKqk2.(3)
The following result will be essential to our developments in the sequel. Its simple proof is
deferred to the appendix.
Lemma 2.1. The function qS(q,w), for any fixed wE2, is smooth with parameter
βS=β+ρ(kAk + 1)2.
2.2.1 Primal Quadratic Gap
We now study a new property of the smooth function qS(·,w), for any fixed wE2, which
we refer to as the Primal Quadratic Gap (PQG).
Definition 1. (Primal Quadratic Gap) We say that Problem (OP) satisfies the Primal Quadratic
Gap property with a parameter αS>0, if for any qdom(RQ), the point q:= arg minq∈Pkq
qk2∈ Psatisfies, for all wE2, the following inequality
hqq,qS(q,w)i ≤ S(q,w)S(q,w)αS
2kqqk2.(4)
This property is a weaker version of the strong convexity property. Here, instead of assuming
the inequality (4) holds for any two points, we only require that it holds for any point q
dom(RQ) and the corresponding closest optimal solution of Problem (OP).
Example 1. If the function f(x) is strongly convex, then the function qS(q,w), for a
fixed wE2, is strongly convex with a certain parameter αS. This implies that in this case the
SAL S(q,w) satisfies the primal quadratic gap property, with the same parameter αS.
Theorem 1. Suppose that f:E1Ris α-strongly convex. Then, Problem (OP) admits a
unique primal optimal solution qand it satisfies the PQG property with the parameter αS=
min{α
2,αρ
α+2ρkAk2}>0. In particular, the function qS(q,w), for any fixed wE2, is strongly
convex.
The proof is deferred to the Appendix (see Section B).
Example 2. If RQis an indicator function for a polytope, we can show that the PQG property
holds true even when f(·) need not be strongly convex. Alternatively, we will make the following
assumption.
Assumption 3. (i) fg◦ B, where B:E1E3is a linear mapping, and g:E3Ris
αg-strongly convex.
(ii) RQ(q) := RX(x) + RY(y)is an indicator of a compact and convex polytope F ≡ {qE:
Cqb}, where C:ERpis a linear mapping, and bRp.
Theorem 2. Suppose that Assumption 3 holds true. Then, there exists a constant σ > 0such
that if ραg, then Problem (OP) satisfies the PQG property with the parameter αS=αgσ1.
The proof is deferred to the Appendix (see Section B).
5
摘要:

FasterProjection-FreeAugmentedLagrangianMethodsviaWeakProximalOracleDanGarber*dangar@technion.ac.ilTsurLivney*tsur.livney@campus.technion.ac.ilShohamSabach*ssabach@technion.ac.ilTechnion-IsraelInstituteofTechnologyAbstractThispaperconsidersaconvexcompositeoptimizationproblemwithaneconstraints,which...

展开>> 收起<<
Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle Dan Garber.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.21MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注