Stochastic Mirror Descent for Large-Scale Sparse Recovery Yannis Bekri1 yannis.bekriuniv-grenoble-alpes.fr

2025-05-03 0 0 1.51MB 30 页 10玖币

侵权投诉

Stochastic Mirror Descent for Large-Scale Sparse Recovery

Yannis Bekri1

yannis.bekri@univ-grenoble-alpes.fr

Sasila Ilandarideva1

sasila.ilandarideva@univ-grenoble-alpes.fr

Anatoli Juditsky1

anatoli.juditsky@univ-grenoble-alpes.fr

Vianney Perchet2

vianney.perchet@normalsup.org

1LJK, Universit´e Grenoble Alpes, Grenoble

2CREST, ENSAE Paris and CRITEO AI Lab, Paris

Abstract

In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-

dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization

problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by

the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem

objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis

prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a

conﬁdence ball of a given norm around the approximate solution). This convergence is linear during the

ﬁrst “preliminary” phase of the routine and is sublinear during the second “asymptotic” phase. We consider

an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting,

we show that the proposed algorithm attains the optimal convergence of the estimation error under weak

assumptions on the regressor distribution. We also present a numerical study illustrating the performance of

the algorithm on high-dimensional simulation data.

1 Introduction

Our original motivation is the well known problem of (generalized) linear high-dimensional regression with

random design. Formally, consider a dataset of

points (

φi, ηi

)

, i ∈ {

, . . . , N}

, where

φi∈Rn

are (random)

features and ηi∈Rare observations, linked by the following equation

ηi=r(φT

ix∗) + σξi, i ∈[N] := {1, . . . , N}(1)

where

ξi∈R

are i.i.d. observation noises. The standard objective is to recover the unknown parameter

x∗∈Rn

of the Generalized Linear Regression (1) – which is assumed to belong to a given convex closed set

and to be

s-sparse, i.e., to have at most snnon-vanishing entries from the data-set.

As mentioned before, we consider random design, where

φi

are i.i.d. random variables, so that the estimation

problem of x∗can be recast as the following generic Stochastic Optimization problem:

g∗= min

x∈Xg(x),where g(x) = EGx, (φ, η), G(x, (φ, η)) = s(φTx)−φTxη, (2)

arXiv:2210.12882v1 [stat.ML] 23 Oct 2022

with

(

) any primitive of

(

), i.e.,

(

) =

(

). The equivalence between the original and the stochastic

optimization problems comes from the fact that

x∗

is a critical point of

(

), i.e.,

∇g

(

x∗

) = 0 since, under mild

assumptions,

∇g

(

) =

E{φ

[

(

φTx

)

−r

(

φTx∗

)]

}

. Hence, as soon as

as a unique minimizer (say,

is strongly

convex over X), solutions of both problems are identical.

As a consequence, we shall focus on the generic problem (2), that has already been widely tackled. For

instance, when given an observation sample (

φi, ηi

i∈

[

], one may build a Sample Average Approximation

(SAA) of the objective g(x)

bgN(x) = 1

i=1

G(x, (φi, ηi)) = 1

i=1

[s(φT

ix)−φT

ixηi] (3)

and then solve the resulting problem of minimizing

bgN

(

) over sparse

’s. The celebrated

-norm minimization

approach allows to reduce this problem to convex optimization. We will provide a new algorithm adapted to this

high-dimensional case, and instantiating it to the original problem 1.

Existing approaches and related works.

Sparse recovery by Lasso and Dantzig Selector has been extensively

studied [

]. It computes a solution

bxN

to the

-penalized problem

minxbgN

(

) +

λkxk1

where

λ≥

0 is the algorithm parameter [

]. This delivers “good solutions”, with high probability for sparsity level

large as

ONκΣ

ln n

, as soon as the random regressors (the

φi

) are drawn independently from a normal distribution

with a covariance matrix Σ such that

κΣI

ρκΣI1

, for some

κΣ>

, ρ ≥

1. However, computing this

solution may be challenging in a very high-dimensional setting: even popular iterative algorithms, like coordinate

descent, loops over a large number of variables. To mitigate this, randomized algorithms [

], screening rules

and working sets [

] may be used to diminish the size of the optimization problem at hand, while

iterative thresholding [1,7,20,16,33] is a “direct” approach to enhance sparsity of the solution.

Another approach relies on Stochastic Approximation (SA). As

∇G

(

φi, ηi

)) =

φi

(

φT

)

−ηi

) is an

unbiased estimate of

∇g

(

), iterative Stochastic Gradient Descent (SGD) algorithm may be used to build

approximate solutions. Unfortunately, unless regressors

are sparse or possess a special structure, standard SA

leads to accuracy bounds for sparse recovery proportional to the dimension

which are essentially useless in

the high-dimensional setting. This motivates non-Euclidean SA procedures, such as Stochastic Mirror Descent

(SMD) [

], its application to sparse recovery enjoys almost dimension free convergence and it has been well

studied in the literature. For instance, under bounded regressors and with sub-Gaussian noise, SMD reaches

“slow rate” of sparse recovery of the type

(

bxN

)

−g∗

Oσpsln(n)/N

where

bxN

is the approximate solution

after

iterations [

]. Multistage routines may be used to improve the error estimates of SA under strong or

uniform convexity assumptions [

]. However, they do not always hold, as in sparse Generalized Linear

Regression, where they are replaced by Restricted Strong Convexity conditions. Then multistage procedures

[

] based on standard SMD algorithms [

] control the

-error

kbxN−x∗k2

at the rate

Oσ

κΣqsln n

N

with high probability. This is the best “asymptotic” rate attainable when solving (2). However, those algorithms

have two major limitations. They both need a number of iterations to reach a given accuracy proportional to

the initial error

kx∗−x0k1

and the sparsity level

must be of order

OκΣqN

ln n

for the sparse linear

regression. These limits may be seen as a consequence of dealing with non-smooth objective

(

). Although it

slightly restricts the scope of corresponding algorithms, we shall consider smooth objectives and algorithm for

minimizing composite objectives (cf. [

]) to mitigate the aforementioned drawbacks of the multistage

algorithms from [2,17].

Principal contributions.

We provide a reﬁned analysis of Composite Stochastic Mirror Descent (CSMD)

algorithms for computing sparse solutions to Stochastic Optimization problem leveraging smoothness of the

objective. This leads to a new “aggressive” choice of parameters in a multistage algorithm with signiﬁcantly

improved performances compared to those in [

]. We summarize below some properties of the proposed procedure

for problem (2).

1We use ABfor two symmetric matrices Aand Bif B−A0, i.e. B−Ais positive semideﬁnite.

Each stage of the algorithm is a speciﬁc CSMD recursion; They fall into two phases. During the ﬁrst

(preliminary) phase, the estimation error decreases linearly with the exponent proportional to

κΣ

sln n

. When it

reaches the value

Oσs

√κΣ

, the second (asymptotic) phase begins, and its stages contain exponentially increasing

number of iterations per stage, hence the estimation error decreases as

Oσs

κΣqln n

N

where

is the total

iteration count.

Organization and notation

The remaining of the paper is organized as follows. In Section 2, the general

problem is set, and the multistage optimization routine and the study of its basic properties are presented.

Then, in Section 3, we discuss the properties of the method and conditions under which it leads to “small

error” solutions to sparse GLR estimation problems. Finally, a small simulation study illustrating numerical

performance of the proposed routines in high-dimensional GLR estimation problem is presented in Section 3.3.

In the following, Eis a Euclidean space and k · k is a norm on E; we denote k · k∗the conjugate norm (i.e.,

kxk∗

supkyk≤1hy, xi

). Given a positive semideﬁnite matrix Σ

∈Sn

, for

x∈Rn

we denote

kxkΣ

√xTΣx

and

for any matrix

, we denote

kQk∞

maxij |

[

]

ij |

. We use a generic notation

and

for absolute constants;

a shortcut notation

a.b

(

a&b

) means that the ratio

a/b

(ratio

b/a

) is bounded by an absolute constant;

the symbols

and the notation (

)

respectively refer to ”maximum between”, ”minimum between” and

”positive part”.

2 Multistage Stochastic Mirror Descent for Sparse Stochastic Opti-

mization

This section is dedicated to the formulation of the generic stochastic optimization problem, the description and

the analysis of the generic algorithm.

2.1 Problem statement

Let

be a convex closed subset of an Euclidean space

and (Ω

, P

) a probability space. We consider a mapping

X×

Ω

→R

such that, for all

ω∈

Ω,

(

·, ω

) is convex on

and smooth, meaning that

∇G

(

·, ω

) is Lipschitz

continuous on Xwith a.s. bounded Lipschitz constant,

∀x, x0∈X, k∇G(x, ω)− ∇G(x0, ω)k ≤ L(ω)kx−x0k,L(ω)≤ν a.s.. (4)

We deﬁne

(

) :=

E{G

(

x, ω

)

}

, where

E{·}

stands for the expectation with respect to

, drawn from

. We

shall assume that the mapping

(

) is ﬁnite, convex and diﬀerentiable on

and we aim at solving the following

stochastic optimization problem

min

x∈X[g(x) = E{G(x, ω)}],(5)

assuming it admits an s-sparse optimal solution x∗for some sparsity structure.

To solve this problem, stochastic oracle can be queried: when given at input a point

x∈X

, generates an

ω∈

Ω from

and outputs

(

x, ω

) and

∇G

(

x, ω

) :=

∇xG

(

x, ω

) (with a slight abuse of notations). We assume

that the oracle is unbiased, i.e.,

E{∇G(x, ω)}=∇g(x),∀x∈X.

To streamline presentation, we assume, as it is often the case in applications of stochastic optimization

problem (5), that

x∗

is unconditional, i.e.,

∇g

(

x∗

) = 0. or stated otherwise

E{∇G

(

x∗, ω

)

}

= 0; we also suppose

the sub-Gaussianity of ∇G(x∗, ω), namely that, for some σ∗<∞

Enexp k∇G(x∗, ω)k2

∗/σ2

∗o≤exp(1).(6)

2.2 Composite Stochastic Mirror Descent algorithm

As mentioned in the introduction, (stochastic) optimization over the set of sparse solutions can be done through

”composite” techniques. We take a similar approach here, by transforming the generic problem 5into the

following composite Stochastic Optimization problem, adapted to some norm

k·k

, and parameterized by

κ≥

min

x∈XFκ(x) := 1

2g(x) + κkxk=1

2E{G(x, ω)}+κkxk.(7)

The purpose of this section is to derive a new (proximal) algorithm. We ﬁrst provide necessary backgrounds and

notations.

Proximal setup, Bregman divergences and Proximal mapping.

Let

be the unit ball of the norm

k·k

and

B→R

be a distance-generating function (d.-g.f.) of

, i.e., a continuously diﬀerentiable convex

function which is strongly convex with respect to the norm k·k,

h∇θ(x)− ∇θ(x0), x −x0i≥kx−x0k2,∀x, x0∈X.

We assume w.l.o.g. that θ(x)≥θ(0) = 0 and denote Θ = maxkzk≤1θ(z).

We now introduce a local and renormalized version of the d.-g.f. θ.

Deﬁnition 2.1

For any

x0∈X

, let

(

) :=

{z∈X

kz−x0k ≤ R}

be the ball of radius

around

. It is

equipped with the d.-g.f. ϑR

x0(z) := R2θ((z−x0)/R).

Note that ϑR

x0(z) is strongly convex on XR(x0) with modulus 1, ϑR

x0(x0) = 0, and ϑR

x0(z)≤ΘR2.

Deﬁnition 2.2 Given x0∈Xand R > 0, the Bregman divergence Vassociated to ϑis deﬁned by

Vx0(x, z) = ϑR

x0(z)−ϑR

x0(x)− h∇ϑR

x0(x), z −xi, x, z ∈X.

We can now deﬁne composite proximal mapping on

(

) [

] with respect to some convex and continuous

mapping h:X→R.

Deﬁnition 2.3 The composite proximal mapping with respect to hand xis deﬁned by

Proxh,x0(ζ, x) := arg min

z∈XR(x0)hζ, zi+h(z) + Vx0(x, z)

= arg min

z∈XR(x0)hζ− ∇ϑR

x0(x), zi+h(z) + ϑR

x0(z)(8)

If (8) can be eﬃciently solved to high accuracy and Θ is “not too large” (we refer to [

]); those setups

will be called “prox-friendly”. We now introduce the main building block of our algorithm, the Composite

Stochastic Mirror Descent.

Composite Stochastic Mirror Descent algorithm.

Given a sequence of positive step sizes

γi>

0, the

Composite Stochastic Mirror Descent (CSMD) is deﬁned by the following recursion

xi= Proxγih,x0(γi−1∇G(xi−1, ωi), xi−1), x0∈X. (9)

After msteps of CSMD, the ﬁnal output is bxm(approximate solution) deﬁned by

bxm=Pm−1

i=0 γixi

Pm−1

i=0 γi

(10)

For any integer

L∈N

, we can also deﬁne the

-minibatch CSMD. Let

ω(L)

= [

ω1

i, ..., ωL

] be i.i.d. realizations of

ωi. The associated (average) stochastic gradient is then simply deﬁned as

Hxi−1, ω(L)

i=1

`=1 ∇G(xi−1, ω`

i),

which yields the following recursion for the L-minibatch CSMD recursion:

x(L)

i= Proxγih,x0γi−1Hxi−1, ω(L)

i, x(L)

i−1, x0∈X, (11)

with its approximate solution bx(L)

m=Pm−1

i=0 γix(L)

i/Pm−1

i=0 γiafter miterations.

From now on, we set h(x) = κkxk.

Proposition 2.1

If step-sizes are constant, i.e.,

γi≡γ≤

)

−1

= 0

, ...

, and the initial point

x0∈X

such

that x∗∈XR(x0)then for any t&√1 + ln m, with probability at least 1−4e−t

Fκ(bxm)−Fκ(x∗).m−1γ−1R2(Θ + t) + κR +γσ2

∗(m+t),(12)

and the approximate solution bx(L)

mof the L-minibatch CSMD satisﬁes

Fκ(bx(L)

m)−Fκ(x∗).m−1γ−1R2(Θ + t) + κR +γσ2

∗ΘL−1(m+t).(13)

For the sake of clarity and conciseness, we denote CSMD(

x0, γ, κ, R, m, L

) the approximate solution

bx(L)

computed after

iterations of

-minibatch CSMD algorithm with initial point

, step-size

, and radius

using recursion (11).

2.3 Main contribution: a multistage adaptive algorithm

Our approach to ﬁnd sparse solution to the original stochastic optimization problem (7) consists in solving a

sequence of auxiliary composite problems (7), with their sequence of parameters (

) deﬁned recursively.

For the latter, we need to infer the quality of approximate solution to (5). To this end, we introduce the following

Reduced Strong Convexity (RSC) assumption, satisﬁed in the motivating example (it is discussed in the appendix

for the sake of ﬂuency):

Assumption [RSC]

There exist some

δ >

0 and

ρ < ∞

such that for any feasible solution

bx∈X

to the

composite problem (7) satisfying, with probability at least 1 −ε,

Fκ(bx)−Fκ(x∗)≤υ,

it holds, with probability at least 1 −ε, that

kbx−x∗k ≤ δρsκ +υκ−1.(14)

Given the diﬀerent problem parameters

s, ν, δ, ρ, κ, R

and some initial point

x0∈X

such that

x∗∈XR

(

)

Algorithm 1works in stages. Each stage represents a run of CSMD algorithm with properly set penalty parameter

. More precisely, at stage

+ 1, given the approximate solution

bxk

of stage

, a new instance of CSMD is

initialized on XRk+1 (xk+1

0) with xk+1

0=bxk

mand Rk+1 =Rk/2.

Furthermore, those stages are divided into two phases which we refer to as preliminary and asymptotic:

Preliminary phase:

During this phase, the step-sizes

and the number of CSMD iterations per stage are

ﬁxed; the error of approximate solutions converges linearly with the total number of calls to stochastic

oracle. This phase terminates when the error of approximate solution becomes independent of the initial

error of the algorithm; then the asymptotic phase begins.

Asymptotic phase:

In this phase, the step-size decreases and the length of the stage increases linearly; the

solution converges sublinearly, with the “standard” rate

ON−1/2

where

is the total number of oracle

calls. When expensive proximal computation (8) results in high numerical cost of the iterative algorithm,

minibatches are used to keep the number of iterations per stage ﬁxed.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StochasticMirrorDescentforLarge-ScaleSparseRecoveryYannisBekri1yannis.bekri@univ-grenoble-alpes.frSasilaIlandarideva1sasila.ilandarideva@univ-grenoble-alpes.frAnatoliJuditsky1anatoli.juditsky@univ-grenoble-alpes.frVianneyPerchet2vianney.perchet@normalsup.org1LJK,UniversiteGrenobleAlpes,Grenoble2CRE...

展开>> 收起<<

Stochastic Mirror Descent for Large-Scale Sparse Recovery Yannis Bekri1 yannis.bekriuniv-grenoble-alpes.fr.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Stochastic Mirror Descent for Large-Scale Sparse Recovery Yannis Bekri1 yannis.bekriuniv-grenoble-alpes.fr

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: