Coordinate Descent for SLOPE Johan Larsson Quentin Klopfenstein Department of Statistics

2025-05-06 0 0 2.84MB 18 页 10玖币

侵权投诉

Coordinate Descent for SLOPE

Johan Larsson Quentin Klopfenstein

Department of Statistics

Lund University, Sweden

johan.larsson@stat.lu.se

Luxembourg Centre for Systems Biomedicine

University of Luxembourg, Luxembourg

quentin.klopfenstein@uni.lu

Mathurin Massias Jonas Wallin

Univ. Lyon, Inria, CNRS, ENS de Lyon,

UCB Lyon 1, LIP UMR 5668, F-69342

Lyon, France

mathurin.massias@inria.fr

Department of Statistics

Lund University, Sweden

jonas.wallin@stat.lu.se

Abstract

The lasso is the most famous sparse regression

and feature selection method. One reason for

its popularity is the speed at which the un-

derlying optimization problem can be solved.

Sorted L-One Penalized Estimation (SLOPE)

is a generalization of the lasso with appeal-

ing statistical properties. In spite of this,

the method has not yet reached widespread

interest. A major reason for this is that cur-

rent software packages that ﬁt SLOPE rely

on algorithms that perform poorly in high

dimensions. To tackle this issue, we propose

a new fast algorithm to solve the SLOPE op-

timization problem, which combines proximal

gradient descent and proximal coordinate de-

scent steps. We provide new results on the

directional derivative of the SLOPE penalty

and its related SLOPE thresholding operator,

as well as provide convergence guarantees for

our proposed solver. In extensive benchmarks

on simulated and real data, we show that our

method outperforms a long list of competing

algorithms.

1 INTRODUCTION

In this paper we present a novel numerical algorithm for

Sorted L-One Penalized Estimation (SLOPE, Bogdan

et al. 2013; Bogdan et al. 2015; Zeng and Figueiredo

2014), which, for a design matrix

X∈Rn×p

and re-

sponse vector y∈Rn, is deﬁned as

min

β∈RpP(β) = 1

2ky−Xβk2+J(β)(1)

where

J(β) = p

j=1

λj|β(j)|(2)

is the sorted `1norm, deﬁned through

|β(1)| ≥ |β(2)| ≥ ··· ≥ |β(p)|,(3)

with

being a ﬁxed non-increasing and non-negative

sequence.

The sorted

norm is a sparsity-enforcing penalty that

has become increasingly popular due to several ap-

pealing properties, such as its ability to control false

discovery rate (Bogdan et al. 2015; Kos and Bogdan

2020), cluster coeﬃcients (Figueiredo and Nowak 2016;

Schneider and Tardivel 2020), and recover sparsity and

ordering patterns in the solution (Bogdan et al. 2022).

Unlike other competing sparse regularization methods

such as MCP (Zhang 2010) and SCAD (Fan and Li

2001), SLOPE has the advantage of being a convex

problem (Bogdan et al. 2015).

In spite of the availability of predictor screening

rules (Larsson, Bogdan, and Wallin 2020; Elvira and

Herzet 2021), which help speed up SLOPE in the high-

dimensional regime, current state-of-the-art algorithms

for SLOPE perform poorly in comparison to those of

more established penalization methods such as the lasso

(

norm regularization) and ridge regression (

norm

regularization). As a small illustration of this issue,

we compared the speed at which the SLOPE (Larsson

et al. 2022) and glmnet (Friedman et al. 2022) packages

solve a SLOPE and lasso problem, respectively, for the

bcTCGA

data set. SLOPE takes 43 seconds to reach

convergence, whilst glmnet requires only 0.14 seconds

1See Appendix B.1 for details on this experiment.

arXiv:2210.14780v1 [math.OC] 26 Oct 2022

Coordinate Descent for SLOPE

Table 1: Example of the permutation operator (

)and

its inverse (i)−for β= [0.5,−5,4]T

i βi(i) (i)−

10.52 3

2−53 1

341 2

This lackluster performance has hampered the applica-

bility of SLOPE to many real-world applications. In

this paper we present a remedy for this issue, by pre-

senting an algorithm that reaches convergence in only

2.9 seconds on the same problem2.

A major reason for why algorithms for solving

MCP-, or SCAD-regularized problems enjoy better per-

formance is that they use coordinate descent (Tseng

2001; Friedman, Hastie, and Tibshirani 2010; Breheny

and Huang 2011). Current SLOPE solvers, on the

other hand, rely on proximal gradient descent algo-

rithms such as FISTA (Beck and Teboulle 2009) and

the alternating direction method of multipliers method

(ADMM, Boyd et al. 2010), which have proven to be

less eﬃcient than coordinate descent in empirical bench-

marks on related problems, such as the lasso (Moreau

et al. 2022). In addition to FISTA and ADMM, there

has also been research into Newton-based augmented

Lagrangian methods to solve SLOPE (Luo et al. 2019).

But this method is adapted only to the

pn

regime

and, as we show in our paper, is outperformed by our

method even in this scenario. Applying coordinate de-

scent to SLOPE is not, however, straightforward since

convergence guarantees for coordinate descent require

the non-smooth part of the objective to be separable,

which is not the case for SLOPE. As a result, naive

coordinate descent schemes can get stuck (Figure 1).

In this article we address this problem by introducing

a new, highly eﬀective algorithm for SLOPE based

on a hybrid proximal gradient and coordinate descent

scheme. Our method features convergence guarantees

and reduces the time required to ﬁt SLOPE by orders

of magnitude in our empirical experiments.

Notation

Let (

)

−

be the inverse of (

)such that



(

)

−−

= (

); see Table 1 for an example of this

operator for a particular β. This means that

J(β) = p

j=1

λj|β(j)|=p

j=1

λ(j)−|βj|.

Note that we do not use any screening rule in the current

implementation of our algorithm, unlike the SLOPE package,

which uses the strong screening rule for SLOPE (Larsson,

Bogdan, and Wallin 2020).

0.0 0.2 0.4

β1

0.0

0.1

0.2

0.3

0.4

β2

0.30

0.35

P(β)

0.30 0.35

P(β)

Figure 1: An example of standard coordinate descent

getting stuck on a two-dimensional SLOPE problem.

The main plot shows level curves for the primal objec-

tive

(1)

, with the minimizer

β∗

= [0

indicated by

the orange cross. The marginal plots display objective

values at

β1

= 0

2when optimizing over

β2

and vice

versa. At

= [0

, standard coordinate descent

can only move in the directions indicated by the dashed

lines—neither of which are descent directions for the

objective. As a result, the algorithm is stuck at a

suboptimal point.

Sorted

norm penalization leads to solution vectors

with clustered coeﬃcients in which the absolute values

of several coeﬃcients are set to exactly the same value.

To this end, for a ﬁxed

such that

|βj|

takes

distinct

values, we introduce

C1,C2,...,Cm

and

c1, c2, . . . , cm

for the indices and coeﬃcients respectively of the

clusters of

, such that

|βj|

ci}

and

c1> c2>··· > cm≥

For a set

, let

denote

its complement. Furthermore, let (

)

i∈[d]

denote the

canonical basis of

, with [

] =

{

, . . . , d}

. Let

Xi:

and

X:i

denote the

-th row and column, respectively,

of the matrix

. Finally, let

sign

(

) =

x/|x|

(with

the convention 0/0 = 1) be the scalar sign, that acts

entrywise on vectors.

2 COORDINATE DESCENT FOR

SLOPE

Proximal coordinate descent cannot be applied to Prob-

lem (1) because the non-smooth term is not separable.

If the clusters

C∗

1,...,C∗

m∗

and signs of the solution

β∗

were known, however, then the values

c∗

1, . . . , c∗

m∗

taken

Johan Larsson, Quentin Klopfenstein, Mathurin Massias, Jonas Wallin

by the clusters of β∗could be computed by solving

min

z∈Rm∗1

2



y−X

m∗

i=1 X

j∈C∗

zisign(β∗

j)ej



+m∗

i=1 |zi|X

j∈C∗

λj.

(4)

Conditionally on the knowledge of the clusters and

the signs of the coeﬃcients, the penalty becomes sepa-

rable (Dupuis and Tardivel 2022), which means that

coordinate descent could be used.

Based on this idea, we derive a coordinate descent

update for minimizing the SLOPE problem

(1)

with re-

spect to the coeﬃcients of a single cluster at a time (Sec-

tion 2.1). Because this update is limited to updating

and, possibly, merging clusters, we intertwine it with

proximal gradient descent in order to correctly identify

the clusters (Section 2.2). In Section 2.3, we present

this hybrid strategy and show that is guaranteed to

converge. In Section 3, we show empirically that our

algorithm outperforms competing alternatives for a

wide range of problems.

2.1 Coordinate Descent Update

In the sequel, let

be ﬁxed with

clusters

C1,...,Cm

corresponding to values

c1, . . . , cm

. In addition, let

k∈

[

]be ﬁxed and

signβCk

. We are interested

in updating

by changing only the value taken on the

k-th cluster. To this end, we deﬁne β(z)∈Rpby:

βi(z) = (sign(βi)z , if i∈ Ck,

βi,otherwise.(5)

Minimizing the objective in this direction amounts to

solving the following one-dimensional problem:

min

z∈RG(z) = P(β(z)) = 1

2ky−Xβ(z)k2+H(z),

(6)

where

H(z) = |z|X

j∈Ck

λ(j)−

z+X

j /∈Ck

|βj|λ(j)−

z(7)

is the partial sorted

norm with respect to the

-th

cluster and where we write

λ(j)−

to indicate that the

inverse sorting permutation (

)

−

is deﬁned with respect

to β(z). The optimality condition for Problem (6) is

∀δ∈ {−1,1}, G0(z;δ)≥0,

where

(

;

)is the directional derivative of

in the

direction

. Since the ﬁrst part of the objective is

diﬀerentiable, we have

G0(z;δ) = δX

j∈Ck

:j(Xβ(z)−y) + H0(z;δ),

where H0(z;δ)is the directional derivative of H.

Throughout the rest of this section we derive the solu-

tion to

(6)

. To do so, we will introduce the directional

derivative for the sorted

norm with respect to the

coeﬃcient of the

-th cluster. First, as illustrated on

Figure 2, note that

is piecewise aﬃne, with break-

points at 0 and all

±ci

’s for

. Hence, the partial

derivative is piecewise constant, with jumps at these

points; in addition,

(

;1) =

(

·,−

1) except at these

points.

−3−c2−c30c3c23−c2−c3c2

0.8

1.0

1.2

1.4

H(z)

Figure 2: Graph of the partial sorted

norm with

β= [−3,1,3,2]T,k= 1, and so c1, c2, c3= (3,2,1).

Let

(

)be the function that returns the cluster of

β(z)corresponding to |z|, that is

C(z) = {j:|β(z)j|=|z|}.(8)

Remark 2.1.Note that if

is equal to some

, then

(

) =

Ci∪ Ck

, and otherwise

(

) =

. Related

to the piecewise aﬃneness of

is the fact that the

permutation3corresponding to β(z)is











Ck,Cm, . . . , C1if z∈]0, cm[,

Cm,...,Ci,Ck,Ci−1, . . . , C1if z∈]ci, ci−1[

and i∈J2, mK,

Cm,...C1,Ckif z∈]c1,+∞[,

and that this permutation also reorders

(

z±h

)for

(

)and

small enough. The only change

in permutation happens when

= 0 or

(

Finally, the permutations diﬀer between

(

)and

(

z−h

)for arbitrarily small

if and only if

ci6

= 0.

We can now state the directional derivative of H.

Theorem 2.2.

Let

c\k

be the set containing

all elements of

except the

-th one:

c\k

{c1,...ck−1, ck+1, . . . , cm}. Let εc>0such that

εc<ci−cj,∀i6=jand εc< cmif cm6= 0.(9)

The directional derivative of the partial sorted

norm

3the permutation is in fact not unique, without impact

on our results. This is discussed when needed in the proofs.

Coordinate Descent for SLOPE

with respect to the

-th cluster,

, in the direction

H0(z;δ) = 









j∈C(εc)

λ(j)−

εcif z= 0,

sign(z)δX

j∈C(z+εcδ)

λ(j)−

z+εcδif |z| ∈ c\k\ {0},

sign(z)δX

j∈C(z)

λ(j)−

zotherwise .

The proof is in Appendix A.1; in Figure 3, we show an

example of the directional derivative and the objective

function.

G(z)

−2−1 0 1 2

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

G0(z)

−1

Figure 3: The function

and its directional derivative

(

;

)for an example with

= [

−

1, and consequently

c\k

{

}

. The solution of

Problem (6) is the value of

for which

(

;

)

≥

0for

δ∈ {−

}

, which holds only at

= 2, which must

therefore be the solution.

Using the directional derivative, we can now introduce

the SLOPE thresholding operator.

Theorem 2.3

(The SLOPE Thresholding Operator)

Deﬁne S(x) = Pj∈C(x)λ(j)−

xand let

T(γ;ω, c, λ) =











0if |γ| ≤ S(εc),

sign(γ)ciif ωci+S(ci−εc)

≤ |γ| ≤

ωci+S(ci+εc),

sign(γ)

ω|γ| − S(ci+εc)if ωci+S(ci+εc)

<|γ|<

ωci−1+S(ci−1−εc),

sign(γ)

ω|γ| − S(c1+εc)if |γ| ≥

ωc1+S(c1+εc).

with

εc

deﬁned as in

(9)

. Let

˜x

XCksign

(

βCk

)and

r=y−Xβ. Then

Tckk˜xk2+ ˜xTr;kxk2, c\k, λ= arg min

z∈R

G(z).(10)

An illustration of this operator is given in Figure 4.

Remark 2.4.The minimizer is unique because

is the

sum of a quadratic function in one variable and a norm.

Remark 2.5.In the lasso case where the

λi

’s are all

equal, the SLOPE thresholding operator reduces to the

soft thresholding operator.

In practice, it is rarely necessary to compute all sums in

Theorem 2.3. Instead, we ﬁrst check in which direction

we need to search relative to the current order for

the cluster and then search in that direction until we

ﬁnd the solution. The complexity of this operation

depends on how far we need to search and the size

of the current cluster and other clusters we need to

consider. In practice, the cost is typically larger at

the start of optimization and becomes marginal as

the algorithm approaches convergence and the cluster

permutation stabilizes.

2.2 Proximal Gradient Descent Update

The coordinate descent update outlined in the previous

section updates the coeﬃcients of each cluster in unison,

which allows clusters to merge—but not to split. This

means that the coordinate descent updates are not

guaranteed to identify the clusters of the solution on

their own. To circumvent this issue, we combine these

coordinate descent steps with full proximal gradient

steps, which enable the algorithm to identify the cluster

structure (Liang, Fadili, and Peyré 2014) due to the

partial smoothness property of the sorted

norm

that we prove in Appendix A.4. A similar idea has

previously been used in Bareilles, Iutzeler, and Malick

(2022), wherein Newton steps are taken on the problem

structure identiﬁed after a proximal gradient descent

step.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CoordinateDescentforSLOPEJohanLarssonQuentinKlopfensteinDepartmentofStatisticsLundUniversity,Swedenjohan.larsson@stat.lu.seLuxembourgCentreforSystemsBiomedicineUniversityofLuxembourg,Luxembourgquentin.klopfenstein@uni.luMathurinMassiasJonasWallinUniv.Lyon,Inria,CNRS,ENSdeLyon,UCBLyon1,LIPUMR5668,F-6...

展开>> 收起<<

Coordinate Descent for SLOPE Johan Larsson Quentin Klopfenstein Department of Statistics.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Coordinate Descent for SLOPE Johan Larsson Quentin Klopfenstein Department of Statistics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: