Coordinate Descent for SLOPE Johan Larsson Quentin Klopfenstein Department of Statistics

2025-05-06 0 0 2.84MB 18 页 10玖币
侵权投诉
Coordinate Descent for SLOPE
Johan Larsson Quentin Klopfenstein
Department of Statistics
Lund University, Sweden
johan.larsson@stat.lu.se
Luxembourg Centre for Systems Biomedicine
University of Luxembourg, Luxembourg
quentin.klopfenstein@uni.lu
Mathurin Massias Jonas Wallin
Univ. Lyon, Inria, CNRS, ENS de Lyon,
UCB Lyon 1, LIP UMR 5668, F-69342
Lyon, France
mathurin.massias@inria.fr
Department of Statistics
Lund University, Sweden
jonas.wallin@stat.lu.se
Abstract
The lasso is the most famous sparse regression
and feature selection method. One reason for
its popularity is the speed at which the un-
derlying optimization problem can be solved.
Sorted L-One Penalized Estimation (SLOPE)
is a generalization of the lasso with appeal-
ing statistical properties. In spite of this,
the method has not yet reached widespread
interest. A major reason for this is that cur-
rent software packages that fit SLOPE rely
on algorithms that perform poorly in high
dimensions. To tackle this issue, we propose
a new fast algorithm to solve the SLOPE op-
timization problem, which combines proximal
gradient descent and proximal coordinate de-
scent steps. We provide new results on the
directional derivative of the SLOPE penalty
and its related SLOPE thresholding operator,
as well as provide convergence guarantees for
our proposed solver. In extensive benchmarks
on simulated and real data, we show that our
method outperforms a long list of competing
algorithms.
1 INTRODUCTION
In this paper we present a novel numerical algorithm for
Sorted L-One Penalized Estimation (SLOPE, Bogdan
et al. 2013; Bogdan et al. 2015; Zeng and Figueiredo
2014), which, for a design matrix
XRn×p
and re-
sponse vector yRn, is defined as
min
βRpP(β) = 1
2kyXβk2+J(β)(1)
where
J(β) = p
X
j=1
λj|β(j)|(2)
is the sorted `1norm, defined through
|β(1)| ≥ |β(2)| ≥ ··· ≥ |β(p)|,(3)
with
λ
being a fixed non-increasing and non-negative
sequence.
The sorted
`1
norm is a sparsity-enforcing penalty that
has become increasingly popular due to several ap-
pealing properties, such as its ability to control false
discovery rate (Bogdan et al. 2015; Kos and Bogdan
2020), cluster coefficients (Figueiredo and Nowak 2016;
Schneider and Tardivel 2020), and recover sparsity and
ordering patterns in the solution (Bogdan et al. 2022).
Unlike other competing sparse regularization methods
such as MCP (Zhang 2010) and SCAD (Fan and Li
2001), SLOPE has the advantage of being a convex
problem (Bogdan et al. 2015).
In spite of the availability of predictor screening
rules (Larsson, Bogdan, and Wallin 2020; Elvira and
Herzet 2021), which help speed up SLOPE in the high-
dimensional regime, current state-of-the-art algorithms
for SLOPE perform poorly in comparison to those of
more established penalization methods such as the lasso
(
`1
norm regularization) and ridge regression (
`2
norm
regularization). As a small illustration of this issue,
we compared the speed at which the SLOPE (Larsson
et al. 2022) and glmnet (Friedman et al. 2022) packages
solve a SLOPE and lasso problem, respectively, for the
bcTCGA
data set. SLOPE takes 43 seconds to reach
convergence, whilst glmnet requires only 0.14 seconds
1
.
1See Appendix B.1 for details on this experiment.
arXiv:2210.14780v1 [math.OC] 26 Oct 2022
Coordinate Descent for SLOPE
Table 1: Example of the permutation operator (
i
)and
its inverse (i)for β= [0.5,5,4]T
i βi(i) (i)
10.52 3
253 1
341 2
This lackluster performance has hampered the applica-
bility of SLOPE to many real-world applications. In
this paper we present a remedy for this issue, by pre-
senting an algorithm that reaches convergence in only
2.9 seconds on the same problem2.
A major reason for why algorithms for solving
`1
-,
MCP-, or SCAD-regularized problems enjoy better per-
formance is that they use coordinate descent (Tseng
2001; Friedman, Hastie, and Tibshirani 2010; Breheny
and Huang 2011). Current SLOPE solvers, on the
other hand, rely on proximal gradient descent algo-
rithms such as FISTA (Beck and Teboulle 2009) and
the alternating direction method of multipliers method
(ADMM, Boyd et al. 2010), which have proven to be
less efficient than coordinate descent in empirical bench-
marks on related problems, such as the lasso (Moreau
et al. 2022). In addition to FISTA and ADMM, there
has also been research into Newton-based augmented
Lagrangian methods to solve SLOPE (Luo et al. 2019).
But this method is adapted only to the
pn
regime
and, as we show in our paper, is outperformed by our
method even in this scenario. Applying coordinate de-
scent to SLOPE is not, however, straightforward since
convergence guarantees for coordinate descent require
the non-smooth part of the objective to be separable,
which is not the case for SLOPE. As a result, naive
coordinate descent schemes can get stuck (Figure 1).
In this article we address this problem by introducing
a new, highly effective algorithm for SLOPE based
on a hybrid proximal gradient and coordinate descent
scheme. Our method features convergence guarantees
and reduces the time required to fit SLOPE by orders
of magnitude in our empirical experiments.
Notation
Let (
i
)
be the inverse of (
i
)such that
(
i
)
= (
i
); see Table 1 for an example of this
operator for a particular β. This means that
J(β) = p
X
j=1
λj|β(j)|=p
X
j=1
λ(j)|βj|.
2
Note that we do not use any screening rule in the current
implementation of our algorithm, unlike the SLOPE package,
which uses the strong screening rule for SLOPE (Larsson,
Bogdan, and Wallin 2020).
0.0 0.2 0.4
β1
0.0
0.1
0.2
0.3
0.4
β2
0.30
0.35
P(β)
0.30 0.35
P(β)
Figure 1: An example of standard coordinate descent
getting stuck on a two-dimensional SLOPE problem.
The main plot shows level curves for the primal objec-
tive
(1)
, with the minimizer
β
= [0
,
0]
T
indicated by
the orange cross. The marginal plots display objective
values at
β1
= 0
.
2when optimizing over
β2
and vice
versa. At
β
= [0
.
2
,
0
.
2]
T
, standard coordinate descent
can only move in the directions indicated by the dashed
lines—neither of which are descent directions for the
objective. As a result, the algorithm is stuck at a
suboptimal point.
Sorted
`1
norm penalization leads to solution vectors
with clustered coefficients in which the absolute values
of several coefficients are set to exactly the same value.
To this end, for a fixed
β
such that
|βj|
takes
m
distinct
values, we introduce
C1,C2,...,Cm
and
c1, c2, . . . , cm
for the indices and coefficients respectively of the
m
clusters of
β
, such that
Ci
=
{j
:
|βj|
=
ci}
and
c1> c2>··· > cm
0
.
For a set
C
, let
¯
C
denote
its complement. Furthermore, let (
ei
)
i[d]
denote the
canonical basis of
Rd
, with [
d
] =
{
1
,
2
, . . . , d}
. Let
Xi:
and
X:i
denote the
i
-th row and column, respectively,
of the matrix
X
. Finally, let
sign
(
x
) =
x/|x|
(with
the convention 0/0 = 1) be the scalar sign, that acts
entrywise on vectors.
2 COORDINATE DESCENT FOR
SLOPE
Proximal coordinate descent cannot be applied to Prob-
lem (1) because the non-smooth term is not separable.
If the clusters
C
1,...,C
m
and signs of the solution
β
were known, however, then the values
c
1, . . . , c
m
taken
2
Johan Larsson, Quentin Klopfenstein, Mathurin Massias, Jonas Wallin
by the clusters of βcould be computed by solving
min
zRm1
2
yX
m
X
i=1 X
j∈C
i
zisign(β
j)ej
2
+m
X
i=1 |zi|X
j∈C
i
λj.
(4)
Conditionally on the knowledge of the clusters and
the signs of the coefficients, the penalty becomes sepa-
rable (Dupuis and Tardivel 2022), which means that
coordinate descent could be used.
Based on this idea, we derive a coordinate descent
update for minimizing the SLOPE problem
(1)
with re-
spect to the coefficients of a single cluster at a time (Sec-
tion 2.1). Because this update is limited to updating
and, possibly, merging clusters, we intertwine it with
proximal gradient descent in order to correctly identify
the clusters (Section 2.2). In Section 2.3, we present
this hybrid strategy and show that is guaranteed to
converge. In Section 3, we show empirically that our
algorithm outperforms competing alternatives for a
wide range of problems.
2.1 Coordinate Descent Update
In the sequel, let
β
be fixed with
m
clusters
C1,...,Cm
corresponding to values
c1, . . . , cm
. In addition, let
k
[
m
]be fixed and
sk
=
signβCk
. We are interested
in updating
β
by changing only the value taken on the
k-th cluster. To this end, we define β(z)Rpby:
βi(z) = (sign(βi)z , if i∈ Ck,
βi,otherwise.(5)
Minimizing the objective in this direction amounts to
solving the following one-dimensional problem:
min
zRG(z) = P(β(z)) = 1
2kyXβ(z)k2+H(z),
(6)
where
H(z) = |z|X
j∈Ck
λ(j)
z+X
j /∈Ck
|βj|λ(j)
z(7)
is the partial sorted
`1
norm with respect to the
k
-th
cluster and where we write
λ(j)
z
to indicate that the
inverse sorting permutation (
j
)
z
is defined with respect
to β(z). The optimality condition for Problem (6) is
δ∈ {−1,1}, G0(z;δ)0,
where
G0
(
z
;
δ
)is the directional derivative of
G
in the
direction
δ
. Since the first part of the objective is
differentiable, we have
G0(z;δ) = δX
j∈Ck
X>
:j(Xβ(z)y) + H0(z;δ),
where H0(z;δ)is the directional derivative of H.
Throughout the rest of this section we derive the solu-
tion to
(6)
. To do so, we will introduce the directional
derivative for the sorted
`1
norm with respect to the
coefficient of the
k
-th cluster. First, as illustrated on
Figure 2, note that
H
is piecewise affine, with break-
points at 0 and all
±ci
’s for
i6
=
k
. Hence, the partial
derivative is piecewise constant, with jumps at these
points; in addition,
H0
(
·
;1) =
H0
(
·,
1) except at these
points.
3c2c30c3c23c2c3c2
c3
0.8
1.0
1.2
1.4
H(z)
Figure 2: Graph of the partial sorted
`1
norm with
β= [3,1,3,2]T,k= 1, and so c1, c2, c3= (3,2,1).
Let
C
(
z
)be the function that returns the cluster of
β(z)corresponding to |z|, that is
C(z) = {j:|β(z)j|=|z|}.(8)
Remark 2.1.Note that if
z
is equal to some
ci
, then
C
(
z
) =
Ci∪ Ck
, and otherwise
C
(
z
) =
Ck
. Related
to the piecewise affineness of
H
is the fact that the
permutation3corresponding to β(z)is
Ck,Cm, . . . , C1if z]0, cm[,
Cm,...,Ci,Ck,Ci1, . . . , C1if z]ci, ci1[
and iJ2, mK,
Cm,...C1,Ckif z]c1,+[,
and that this permutation also reorders
β
(
z±h
)for
z6
=
ci
(
i6
=
k
)and
h
small enough. The only change
in permutation happens when
z
= 0 or
z
=
ci
(
i6
=
k
).
Finally, the permutations differ between
β
(
z
+
h
)and
β
(
zh
)for arbitrarily small
h
if and only if
z
=
ci6
= 0.
We can now state the directional derivative of H.
Theorem 2.2.
Let
c\k
be the set containing
all elements of
c
except the
k
-th one:
c\k
=
{c1,...ck1, ck+1, . . . , cm}. Let εc>0such that
εc<cicj,i6=jand εc< cmif cm6= 0.(9)
The directional derivative of the partial sorted
`1
norm
3the permutation is in fact not unique, without impact
on our results. This is discussed when needed in the proofs.
3
Coordinate Descent for SLOPE
with respect to the
k
-th cluster,
H
, in the direction
δ
is
H0(z;δ) =
X
jC(εc)
λ(j)
εcif z= 0,
sign(z)δX
jC(z+εcδ)
λ(j)
z+εcδif |z| ∈ c\k\ {0},
sign(z)δX
jC(z)
λ(j)
zotherwise .
The proof is in Appendix A.1; in Figure 3, we show an
example of the directional derivative and the objective
function.
12
13
14
G(z)
21 0 1 2
z
1.5
1.0
0.5
0.0
0.5
1.0
1.5
G0(z)
δ
1
1
Figure 3: The function
G
and its directional derivative
G0
(
·
;
δ
)for an example with
β
= [
3
,
1
,
3
,
2]
T
,
k
=
1, and consequently
c\k
=
{
1
,
2
}
. The solution of
Problem (6) is the value of
z
for which
G0
(
z
;
δ
)
0for
δ∈ {−
1
,
1
}
, which holds only at
z
= 2, which must
therefore be the solution.
Using the directional derivative, we can now introduce
the SLOPE thresholding operator.
Theorem 2.3
(The SLOPE Thresholding Operator)
.
Define S(x) = PjC(x)λ(j)
xand let
T(γ;ω, c, λ) =
0if |γ| ≤ S(εc),
sign(γ)ciif ωci+S(ciεc)
≤ |γ| ≤
ωci+S(ci+εc),
sign(γ)
ω|γ| − S(ci+εc)if ωci+S(ci+εc)
<|γ|<
ωci1+S(ci1εc),
sign(γ)
ω|γ| − S(c1+εc)if |γ| ≥
ωc1+S(c1+εc).
with
εc
defined as in
(9)
. Let
˜x
=
XCksign
(
βCk
)and
r=yXβ. Then
Tckk˜xk2+ ˜xTr;kxk2, c\k, λ= arg min
zR
G(z).(10)
An illustration of this operator is given in Figure 4.
Remark 2.4.The minimizer is unique because
G
is the
sum of a quadratic function in one variable and a norm.
Remark 2.5.In the lasso case where the
λi
’s are all
equal, the SLOPE thresholding operator reduces to the
soft thresholding operator.
In practice, it is rarely necessary to compute all sums in
Theorem 2.3. Instead, we first check in which direction
we need to search relative to the current order for
the cluster and then search in that direction until we
find the solution. The complexity of this operation
depends on how far we need to search and the size
of the current cluster and other clusters we need to
consider. In practice, the cost is typically larger at
the start of optimization and becomes marginal as
the algorithm approaches convergence and the cluster
permutation stabilizes.
2.2 Proximal Gradient Descent Update
The coordinate descent update outlined in the previous
section updates the coefficients of each cluster in unison,
which allows clusters to merge—but not to split. This
means that the coordinate descent updates are not
guaranteed to identify the clusters of the solution on
their own. To circumvent this issue, we combine these
coordinate descent steps with full proximal gradient
steps, which enable the algorithm to identify the cluster
structure (Liang, Fadili, and Peyré 2014) due to the
partial smoothness property of the sorted
`1
norm
that we prove in Appendix A.4. A similar idea has
previously been used in Bareilles, Iutzeler, and Malick
(2022), wherein Newton steps are taken on the problem
structure identified after a proximal gradient descent
step.
4
摘要:

CoordinateDescentforSLOPEJohanLarssonQuentinKlopfensteinDepartmentofStatisticsLundUniversity,Swedenjohan.larsson@stat.lu.seLuxembourgCentreforSystemsBiomedicineUniversityofLuxembourg,Luxembourgquentin.klopfenstein@uni.luMathurinMassiasJonasWallinUniv.Lyon,Inria,CNRS,ENSdeLyon,UCBLyon1,LIPUMR5668,F-6...

展开>> 收起<<
Coordinate Descent for SLOPE Johan Larsson Quentin Klopfenstein Department of Statistics.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.84MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注