Stochastic Mirror Descent for Large-Scale Sparse Recovery Yannis Bekri1 yannis.bekriuniv-grenoble-alpes.fr

2025-05-03 0 0 1.51MB 30 页 10玖币
侵权投诉
Stochastic Mirror Descent for Large-Scale Sparse Recovery
Yannis Bekri1
yannis.bekri@univ-grenoble-alpes.fr
Sasila Ilandarideva1
sasila.ilandarideva@univ-grenoble-alpes.fr
Anatoli Juditsky1
anatoli.juditsky@univ-grenoble-alpes.fr
Vianney Perchet2
vianney.perchet@normalsup.org
1LJK, Universit´e Grenoble Alpes, Grenoble
2CREST, ENSAE Paris and CRITEO AI Lab, Paris
Abstract
In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-
dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization
problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by
the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem
objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis
prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a
confidence ball of a given norm around the approximate solution). This convergence is linear during the
first “preliminary” phase of the routine and is sublinear during the second “asymptotic” phase. We consider
an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting,
we show that the proposed algorithm attains the optimal convergence of the estimation error under weak
assumptions on the regressor distribution. We also present a numerical study illustrating the performance of
the algorithm on high-dimensional simulation data.
1 Introduction
Our original motivation is the well known problem of (generalized) linear high-dimensional regression with
random design. Formally, consider a dataset of
N
points (
φi, ηi
)
, i ∈ {
1
, . . . , N}
, where
φiRn
are (random)
features and ηiRare observations, linked by the following equation
ηi=r(φT
ix) + σξi, i [N] := {1, . . . , N}(1)
where
ξiR
are i.i.d. observation noises. The standard objective is to recover the unknown parameter
xRn
of the Generalized Linear Regression (1) – which is assumed to belong to a given convex closed set
X
and to be
s-sparse, i.e., to have at most snnon-vanishing entries from the data-set.
As mentioned before, we consider random design, where
φi
are i.i.d. random variables, so that the estimation
problem of xcan be recast as the following generic Stochastic Optimization problem:
g= min
xXg(x),where g(x) = EGx, (φ, η), G(x, (φ, η)) = s(φTx)φTxη, (2)
1
arXiv:2210.12882v1 [stat.ML] 23 Oct 2022
with
s
(
·
) any primitive of
r
(
·
), i.e.,
r
(
t
) =
s0
(
t
). The equivalence between the original and the stochastic
optimization problems comes from the fact that
x
is a critical point of
g
(
·
), i.e.,
g
(
x
) = 0 since, under mild
assumptions,
g
(
x
) =
E{φ
[
r
(
φTx
)
r
(
φTx
)]
}
. Hence, as soon as
g
as a unique minimizer (say,
g
is strongly
convex over X), solutions of both problems are identical.
As a consequence, we shall focus on the generic problem (2), that has already been widely tackled. For
instance, when given an observation sample (
φi, ηi
),
i
[
N
], one may build a Sample Average Approximation
(SAA) of the objective g(x)
bgN(x) = 1
N
N
X
i=1
G(x, (φi, ηi)) = 1
N
N
X
i=1
[s(φT
ix)φT
ii] (3)
and then solve the resulting problem of minimizing
bgN
(
x
) over sparse
x
’s. The celebrated
`1
-norm minimization
approach allows to reduce this problem to convex optimization. We will provide a new algorithm adapted to this
high-dimensional case, and instantiating it to the original problem 1.
Existing approaches and related works.
Sparse recovery by Lasso and Dantzig Selector has been extensively
studied [
11
,
8
,
5
,
46
,
10
,
9
]. It computes a solution
bxN
to the
`1
-penalized problem
minxbgN
(
x
) +
λkxk1
where
λ
0 is the algorithm parameter [
35
]. This delivers “good solutions”, with high probability for sparsity level
s
as
large as
ONκΣ
ln n
, as soon as the random regressors (the
φi
) are drawn independently from a normal distribution
with a covariance matrix Σ such that
κΣI
Σ
ρκΣI1
, for some
κΣ>
0
, ρ
1. However, computing this
solution may be challenging in a very high-dimensional setting: even popular iterative algorithms, like coordinate
descent, loops over a large number of variables. To mitigate this, randomized algorithms [
3
,
22
], screening rules
and working sets [
19
,
30
,
34
] may be used to diminish the size of the optimization problem at hand, while
iterative thresholding [1,7,20,16,33] is a “direct” approach to enhance sparsity of the solution.
Another approach relies on Stochastic Approximation (SA). As
G
(
x,
(
φi, ηi
)) =
φi
(
r
(
φT
ix
)
ηi
) is an
unbiased estimate of
g
(
x
), iterative Stochastic Gradient Descent (SGD) algorithm may be used to build
approximate solutions. Unfortunately, unless regressors
φ
are sparse or possess a special structure, standard SA
leads to accuracy bounds for sparse recovery proportional to the dimension
n
which are essentially useless in
the high-dimensional setting. This motivates non-Euclidean SA procedures, such as Stochastic Mirror Descent
(SMD) [
37
], its application to sparse recovery enjoys almost dimension free convergence and it has been well
studied in the literature. For instance, under bounded regressors and with sub-Gaussian noise, SMD reaches
“slow rate” of sparse recovery of the type
g
(
bxN
)
g
=
Oσpsln(n)/N
where
bxN
is the approximate solution
after
N
iterations [
44
,
45
]. Multistage routines may be used to improve the error estimates of SA under strong or
uniform convexity assumptions [
27
,
29
,
18
]. However, they do not always hold, as in sparse Generalized Linear
Regression, where they are replaced by Restricted Strong Convexity conditions. Then multistage procedures
[
2
,
17
] based on standard SMD algorithms [
24
,
38
] control the
`2
-error
kbxNxk2
at the rate
Oσ
κΣqsln n
N
with high probability. This is the best “asymptotic” rate attainable when solving (2). However, those algorithms
have two major limitations. They both need a number of iterations to reach a given accuracy proportional to
the initial error
R
=
kxx0k1
and the sparsity level
s
must be of order
OκΣqN
ln n
for the sparse linear
regression. These limits may be seen as a consequence of dealing with non-smooth objective
g
(
x
). Although it
slightly restricts the scope of corresponding algorithms, we shall consider smooth objectives and algorithm for
minimizing composite objectives (cf. [
25
,
32
,
39
]) to mitigate the aforementioned drawbacks of the multistage
algorithms from [2,17].
Principal contributions.
We provide a refined analysis of Composite Stochastic Mirror Descent (CSMD)
algorithms for computing sparse solutions to Stochastic Optimization problem leveraging smoothness of the
objective. This leads to a new “aggressive” choice of parameters in a multistage algorithm with significantly
improved performances compared to those in [
2
]. We summarize below some properties of the proposed procedure
for problem (2).
1We use ABfor two symmetric matrices Aand Bif BA0, i.e. BAis positive semidefinite.
2
Each stage of the algorithm is a specific CSMD recursion; They fall into two phases. During the first
(preliminary) phase, the estimation error decreases linearly with the exponent proportional to
κΣ
sln n
. When it
reaches the value
Oσs
κΣ
, the second (asymptotic) phase begins, and its stages contain exponentially increasing
number of iterations per stage, hence the estimation error decreases as
Oσs
κΣqln n
N
where
N
is the total
iteration count.
Organization and notation
The remaining of the paper is organized as follows. In Section 2, the general
problem is set, and the multistage optimization routine and the study of its basic properties are presented.
Then, in Section 3, we discuss the properties of the method and conditions under which it leads to “small
error” solutions to sparse GLR estimation problems. Finally, a small simulation study illustrating numerical
performance of the proposed routines in high-dimensional GLR estimation problem is presented in Section 3.3.
In the following, Eis a Euclidean space and k · k is a norm on E; we denote k · kthe conjugate norm (i.e.,
kxk
=
supkyk≤1hy, xi
). Given a positive semidefinite matrix Σ
Sn
, for
xRn
we denote
kxkΣ
=
xTΣx
and
for any matrix
Q
, we denote
kQk
=
maxij |
[
Q
]
ij |
. We use a generic notation
c
and
C
for absolute constants;
a shortcut notation
a.b
(
a&b
) means that the ratio
a/b
(ratio
b/a
) is bounded by an absolute constant;
the symbols
W
,
V
and the notation (
.
)
+
respectively refer to ”maximum between”, ”minimum between” and
”positive part”.
2 Multistage Stochastic Mirror Descent for Sparse Stochastic Opti-
mization
This section is dedicated to the formulation of the generic stochastic optimization problem, the description and
the analysis of the generic algorithm.
2.1 Problem statement
Let
X
be a convex closed subset of an Euclidean space
E
and (Ω
, P
) a probability space. We consider a mapping
G
:
X×
R
such that, for all
ω
Ω,
G
(
·, ω
) is convex on
X
and smooth, meaning that
G
(
·, ω
) is Lipschitz
continuous on Xwith a.s. bounded Lipschitz constant,
x, x0X, k∇G(x, ω)− ∇G(x0, ω)k ≤ L(ω)kxx0k,L(ω)ν a.s.. (4)
We define
g
(
x
) :=
E{G
(
x, ω
)
}
, where
E{·}
stands for the expectation with respect to
ω
, drawn from
P
. We
shall assume that the mapping
g
(
·
) is finite, convex and differentiable on
X
and we aim at solving the following
stochastic optimization problem
min
xX[g(x) = E{G(x, ω)}],(5)
assuming it admits an s-sparse optimal solution xfor some sparsity structure.
To solve this problem, stochastic oracle can be queried: when given at input a point
xX
, generates an
ω
Ω from
P
and outputs
G
(
x, ω
) and
G
(
x, ω
) :=
xG
(
x, ω
) (with a slight abuse of notations). We assume
that the oracle is unbiased, i.e.,
E{∇G(x, ω)}=g(x),xX.
To streamline presentation, we assume, as it is often the case in applications of stochastic optimization
problem (5), that
x
is unconditional, i.e.,
g
(
x
) = 0. or stated otherwise
E{∇G
(
x, ω
)
}
= 0; we also suppose
the sub-Gaussianity of G(x, ω), namely that, for some σ<
Enexp k∇G(x, ω)k2
2
oexp(1).(6)
3
2.2 Composite Stochastic Mirror Descent algorithm
As mentioned in the introduction, (stochastic) optimization over the set of sparse solutions can be done through
”composite” techniques. We take a similar approach here, by transforming the generic problem 5into the
following composite Stochastic Optimization problem, adapted to some norm
k·k
, and parameterized by
κ
0,
min
xXFκ(x) := 1
2g(x) + κkxk=1
2E{G(x, ω)}+κkxk.(7)
The purpose of this section is to derive a new (proximal) algorithm. We first provide necessary backgrounds and
notations.
Proximal setup, Bregman divergences and Proximal mapping.
Let
B
be the unit ball of the norm
k·k
and
θ
:
BR
be a distance-generating function (d.-g.f.) of
B
, i.e., a continuously differentiable convex
function which is strongly convex with respect to the norm k·k,
h∇θ(x)− ∇θ(x0), x x0i≥kxx0k2,x, x0X.
We assume w.l.o.g. that θ(x)θ(0) = 0 and denote Θ = maxkzk≤1θ(z).
We now introduce a local and renormalized version of the d.-g.f. θ.
Definition 2.1
For any
x0X
, let
XR
(
x0
) :=
{zX
:
kzx0k ≤ R}
be the ball of radius
R
around
x0
. It is
equipped with the d.-g.f. ϑR
x0(z) := R2θ((zx0)/R).
Note that ϑR
x0(z) is strongly convex on XR(x0) with modulus 1, ϑR
x0(x0) = 0, and ϑR
x0(z)ΘR2.
Definition 2.2 Given x0Xand R > 0, the Bregman divergence Vassociated to ϑis defined by
Vx0(x, z) = ϑR
x0(z)ϑR
x0(x)− h∇ϑR
x0(x), z xi, x, z X.
We can now define composite proximal mapping on
XR
(
x0
) [
39
,
40
] with respect to some convex and continuous
mapping h:XR.
Definition 2.3 The composite proximal mapping with respect to hand xis defined by
Proxh,x0(ζ, x) := arg min
zXR(x0)hζ, zi+h(z) + Vx0(x, z)
= arg min
zXR(x0)hζ− ∇ϑR
x0(x), zi+h(z) + ϑR
x0(z)(8)
If (8) can be efficiently solved to high accuracy and Θ is “not too large” (we refer to [
27
,
36
,
40
]); those setups
will be called “prox-friendly”. We now introduce the main building block of our algorithm, the Composite
Stochastic Mirror Descent.
Composite Stochastic Mirror Descent algorithm.
Given a sequence of positive step sizes
γi>
0, the
Composite Stochastic Mirror Descent (CSMD) is defined by the following recursion
xi= Proxγih,x0(γi1G(xi1, ωi), xi1), x0X. (9)
After msteps of CSMD, the final output is bxm(approximate solution) defined by
bxm=Pm1
i=0 γixi
Pm1
i=0 γi
(10)
For any integer
LN
, we can also define the
L
-minibatch CSMD. Let
ω(L)
i
= [
ω1
i, ..., ωL
i
] be i.i.d. realizations of
ωi. The associated (average) stochastic gradient is then simply defined as
Hxi1, ω(L)
i=1
L
L
X
`=1 G(xi1, ω`
i),
4
which yields the following recursion for the L-minibatch CSMD recursion:
x(L)
i= Proxγih,x0γi1Hxi1, ω(L)
i, x(L)
i1, x0X, (11)
with its approximate solution bx(L)
m=Pm1
i=0 γix(L)
i/Pm1
i=0 γiafter miterations.
From now on, we set h(x) = κkxk.
Proposition 2.1
If step-sizes are constant, i.e.,
γiγ
(4
ν
)
1
,
i
= 0
,
1
, ...
, and the initial point
x0X
such
that xXR(x0)then for any t&1 + ln m, with probability at least 14et
Fκ(bxm)Fκ(x).m1γ1R2(Θ + t) + κR +γσ2
(m+t),(12)
and the approximate solution bx(L)
mof the L-minibatch CSMD satisfies
Fκ(bx(L)
m)Fκ(x).m1γ1R2(Θ + t) + κR +γσ2
ΘL1(m+t).(13)
For the sake of clarity and conciseness, we denote CSMD(
x0, γ, κ, R, m, L
) the approximate solution
bx(L)
m
computed after
m
iterations of
L
-minibatch CSMD algorithm with initial point
x0
, step-size
γ
, and radius
R
using recursion (11).
2.3 Main contribution: a multistage adaptive algorithm
Our approach to find sparse solution to the original stochastic optimization problem (7) consists in solving a
sequence of auxiliary composite problems (7), with their sequence of parameters (
κ
,
x0
,
R
) defined recursively.
For the latter, we need to infer the quality of approximate solution to (5). To this end, we introduce the following
Reduced Strong Convexity (RSC) assumption, satisfied in the motivating example (it is discussed in the appendix
for the sake of fluency):
Assumption [RSC]
There exist some
δ >
0 and
ρ <
such that for any feasible solution
bxX
to the
composite problem (7) satisfying, with probability at least 1 ε,
Fκ(bx)Fκ(x)υ,
it holds, with probability at least 1 ε, that
kbxxk ≤ δρsκ +υκ1.(14)
Given the different problem parameters
s, ν, δ, ρ, κ, R
and some initial point
x0X
such that
xXR
(
x0
)
Algorithm 1works in stages. Each stage represents a run of CSMD algorithm with properly set penalty parameter
κ
. More precisely, at stage
k
+ 1, given the approximate solution
bxk
m
of stage
k
, a new instance of CSMD is
initialized on XRk+1 (xk+1
0) with xk+1
0=bxk
mand Rk+1 =Rk/2.
Furthermore, those stages are divided into two phases which we refer to as preliminary and asymptotic:
Preliminary phase:
During this phase, the step-sizes
γ
and the number of CSMD iterations per stage are
fixed; the error of approximate solutions converges linearly with the total number of calls to stochastic
oracle. This phase terminates when the error of approximate solution becomes independent of the initial
error of the algorithm; then the asymptotic phase begins.
Asymptotic phase:
In this phase, the step-size decreases and the length of the stage increases linearly; the
solution converges sublinearly, with the “standard” rate
ON1/2
where
N
is the total number of oracle
calls. When expensive proximal computation (8) results in high numerical cost of the iterative algorithm,
minibatches are used to keep the number of iterations per stage fixed.
5
摘要:

StochasticMirrorDescentforLarge-ScaleSparseRecoveryYannisBekri1yannis.bekri@univ-grenoble-alpes.frSasilaIlandarideva1sasila.ilandarideva@univ-grenoble-alpes.frAnatoliJuditsky1anatoli.juditsky@univ-grenoble-alpes.frVianneyPerchet2vianney.perchet@normalsup.org1LJK,UniversiteGrenobleAlpes,Grenoble2CRE...

展开>> 收起<<
Stochastic Mirror Descent for Large-Scale Sparse Recovery Yannis Bekri1 yannis.bekriuniv-grenoble-alpes.fr.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:1.51MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注