
with
s
(
·
) any primitive of
r
(
·
), i.e.,
r
(
t
) =
s0
(
t
). The equivalence between the original and the stochastic
optimization problems comes from the fact that
x∗
is a critical point of
g
(
·
), i.e.,
∇g
(
x∗
) = 0 since, under mild
assumptions,
∇g
(
x
) =
E{φ
[
r
(
φTx
)
−r
(
φTx∗
)]
}
. Hence, as soon as
g
as a unique minimizer (say,
g
is strongly
convex over X), solutions of both problems are identical.
As a consequence, we shall focus on the generic problem (2), that has already been widely tackled. For
instance, when given an observation sample (
φi, ηi
),
i∈
[
N
], one may build a Sample Average Approximation
(SAA) of the objective g(x)
bgN(x) = 1
N
N
X
i=1
G(x, (φi, ηi)) = 1
N
N
X
i=1
[s(φT
ix)−φT
ixηi] (3)
and then solve the resulting problem of minimizing
bgN
(
x
) over sparse
x
’s. The celebrated
`1
-norm minimization
approach allows to reduce this problem to convex optimization. We will provide a new algorithm adapted to this
high-dimensional case, and instantiating it to the original problem 1.
Existing approaches and related works.
Sparse recovery by Lasso and Dantzig Selector has been extensively
studied [
11
,
8
,
5
,
46
,
10
,
9
]. It computes a solution
bxN
to the
`1
-penalized problem
minxbgN
(
x
) +
λkxk1
where
λ≥
0 is the algorithm parameter [
35
]. This delivers “good solutions”, with high probability for sparsity level
s
as
large as
ONκΣ
ln n
, as soon as the random regressors (the
φi
) are drawn independently from a normal distribution
with a covariance matrix Σ such that
κΣI
Σ
ρκΣI1
, for some
κΣ>
0
, ρ ≥
1. However, computing this
solution may be challenging in a very high-dimensional setting: even popular iterative algorithms, like coordinate
descent, loops over a large number of variables. To mitigate this, randomized algorithms [
3
,
22
], screening rules
and working sets [
19
,
30
,
34
] may be used to diminish the size of the optimization problem at hand, while
iterative thresholding [1,7,20,16,33] is a “direct” approach to enhance sparsity of the solution.
Another approach relies on Stochastic Approximation (SA). As
∇G
(
x,
(
φi, ηi
)) =
φi
(
r
(
φT
ix
)
−ηi
) is an
unbiased estimate of
∇g
(
x
), iterative Stochastic Gradient Descent (SGD) algorithm may be used to build
approximate solutions. Unfortunately, unless regressors
φ
are sparse or possess a special structure, standard SA
leads to accuracy bounds for sparse recovery proportional to the dimension
n
which are essentially useless in
the high-dimensional setting. This motivates non-Euclidean SA procedures, such as Stochastic Mirror Descent
(SMD) [
37
], its application to sparse recovery enjoys almost dimension free convergence and it has been well
studied in the literature. For instance, under bounded regressors and with sub-Gaussian noise, SMD reaches
“slow rate” of sparse recovery of the type
g
(
bxN
)
−g∗
=
Oσpsln(n)/N
where
bxN
is the approximate solution
after
N
iterations [
44
,
45
]. Multistage routines may be used to improve the error estimates of SA under strong or
uniform convexity assumptions [
27
,
29
,
18
]. However, they do not always hold, as in sparse Generalized Linear
Regression, where they are replaced by Restricted Strong Convexity conditions. Then multistage procedures
[
2
,
17
] based on standard SMD algorithms [
24
,
38
] control the
`2
-error
kbxN−x∗k2
at the rate
Oσ
κΣqsln n
N
with high probability. This is the best “asymptotic” rate attainable when solving (2). However, those algorithms
have two major limitations. They both need a number of iterations to reach a given accuracy proportional to
the initial error
R
=
kx∗−x0k1
and the sparsity level
s
must be of order
OκΣqN
ln n
for the sparse linear
regression. These limits may be seen as a consequence of dealing with non-smooth objective
g
(
x
). Although it
slightly restricts the scope of corresponding algorithms, we shall consider smooth objectives and algorithm for
minimizing composite objectives (cf. [
25
,
32
,
39
]) to mitigate the aforementioned drawbacks of the multistage
algorithms from [2,17].
Principal contributions.
We provide a refined analysis of Composite Stochastic Mirror Descent (CSMD)
algorithms for computing sparse solutions to Stochastic Optimization problem leveraging smoothness of the
objective. This leads to a new “aggressive” choice of parameters in a multistage algorithm with significantly
improved performances compared to those in [
2
]. We summarize below some properties of the proposed procedure
for problem (2).
1We use ABfor two symmetric matrices Aand Bif B−A0, i.e. B−Ais positive semidefinite.
2