Dictionary Learning for the Almost-Linear Sparsity Regime Alexei Novikov NOVIKOV PSU.EDU Department of Mathematics_2

2025-05-06 0 0 792.28KB 50 页 10玖币

侵权投诉

Dictionary Learning for the Almost-Linear Sparsity Regime

Alexei Novikov NOVIKOV@PSU.EDU

Department of Mathematics

Penn State University

University Park, PA 16802 USA

Stephen White SEW347@PSU.EDU

Department of Mathematics

Penn State University

University Park, PA 16802 USA

Abstract

Dictionary learning, the problem of recovering a sparsely used matrix D∈RM×Kand N s-

sparse vectors xi∈RKfrom samples of the form yi=Dxi, is of increasing importance to

applications in signal processing and data science. When the dictionary is known, recovery of

xiis possible even for sparsity linear in dimension M, yet to date, the only algorithms which

provably succeed in the linear sparsity regime are Riemannian trust-region methods, which are

limited to orthogonal dictionaries, and methods based on the sum-of-squares hierarchy, which

requires super-polynomial time in order to obtain an error which decays in M. In this work, we

introduce SPORADIC (Spectral Oracle Dictionary Learning), an efﬁcient spectral method on

family of reweighted covariance matrices. We prove that in high enough dimensions, SPORADIC

can recover overcomplete (K > M) dictionaries satisfying the well-known restricted isometry

property (RIP) even when sparsity is linear in dimension up to logarithmic factors. Moreover, these

accuracy guarantees have an “oracle property” that the support and signs of the unknown sparse

vectors xican be recovered exactly with high probability, allowing for arbitrarily close estimation

of Dwith enough samples in polynomial time. To the author’s knowledge, SPORADIC is the ﬁrst

polynomial-time algorithm which provably enjoys such convergence guarantees for overcomplete

RIP matrices in the near-linear sparsity regime.

Keywords: Compressed Sensing, Dictionary Learning, High-dimensional Statistics, Sparsity,

Sparse Coding

1 Introduction

In data science and machine learning, the task of discovering sparse representations for large datasets

is of central importance. Sparse representations offer distinct advantages for storing and processing

data, as well as providing valuable insights into the underlying structure of a dataset. This problem

of sparse recovery frequently involves the recovery of a sparse vector x∈RKfrom a dense

sample y=Dx, where D is a sparsely-used matrix referred to as the ”dictionary.” Depending

on applications, this dictionary can be known from physics or hand-designed, such as the wavelet

bases used in image processing (e.g., Vetterli and Kovaˇ

cevic, 1995). Beyond numerous further

applications in signal and image processing (see Elad (2010) for a summary of developments),

sparse representations have been fruitfully applied in areas including computational neuroscience

(Olshausen and Field, 1996b,a, 1997) and machine learning (Argyriou et al., 2006; Ranzato et al.,

2007).

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

arXiv:2210.10855v2 [cs.LG] 27 Mar 2023

NOVIKOV AND WHITE

Figure 1: A sample Yas encountered in dictionary learning. Dis an unknown and typically dense

matrix, and each xiis an unknown vector with at most snonzero entries.

In today’s technology-driven world, interpreting and compressing increasingly varied types of

data have become paramount tasks. This has sparked a growing interest in techniques that not only

learn the fundamental sparse codes from data but also the dictionary itself: this is the dictionary

learning problem. Speciﬁcally, we aim to recover a sparsely-used matrix D∈RM×Kfrom N

samples of the form y=Dx, where x∈RKis sufﬁciently sparse (that is, xhas signiﬁcantly fewer

than Knonzero entries). Written in matrix form, we seek to recover Dfrom a matrix Y=DX ∈

RM×Nwith the prior knowledge that rows of Xare sparse. In many applications, practitioners

are particularly interested in recovering overcomplete dictionaries which satisfy K > M, as such

dictionaries offer greater ﬂexibility in selecting a basis and allow for a sparser representation (e.g.

Chen et al., 1998; Donoho et al., 2006).

1.1 Prior Work

In computer science literature, dictionary learning is typically formulated as a nonconvex optimization

problem of ﬁnding a dictionary Dand matrix with sparse columns Xsatisfying Y=DX such that

Xis as sparse as possible, namely:

Find D,Xminimizing kXk0subject to Y=DX.

As a nonconvex optimization, this problem is computationally challenging to solve. The most

popular heuristic for solving the dictionary learning problem is alternating minimization. Alternating

minimization algorithms rely on the fact that when Dor Xis known, the other can be solved using

known methods, most frequently based on L1relaxations that make the problem convex (Cand´

et al., 2006). Alternating minimization techniques thus alternate between a “sparse coding” step in

which a guess for Dis ﬁxed and the algorithm solves for X, and a “dictionary update” step in which

Xis ﬁxed and the dictionary is updated. This process is repeated until a convergence criterion is

met. Though theoretical guarantees for convergence of such methods exists, they often demand

extremely sparse representations and precise initialization (Chatterji and Bartlett, 2017; Agarwal

et al., 2016).

In this paper, we are interested in dictionary learning algorithms with provable guarantees.

Initial theoretical study of provable dictionary learning focused on the case when sis no greater

than √M; this is a well-known recovery boundary even when the dictionary Dis known. Spielman

et al. (2012) developed an algorithm that accurately recovers the dictionary in this sparsity regime,

but their algorithm does not generalize to recovery of overcomplete dictionaries. Arora et al. (2013)

OVERCOMPLETE DICTIONARY LEARNING FOR THE ALMOST-LINEAR SPARSITY REGIME

and Agarwal et al. (2017) then independently introduced similar correlation-and-clustering methods,

which enjoy similar theoretical guarantees for the s∼√Mregime for overcomplete dictionaries.

Cand´

es et al. (2006) showed that when the dictionary Dis known and satisﬁes a restricted

isometry property (RIP; see Deﬁnition 4) (Candes and Tao, 2005), it is possible efﬁciently to

recover xifrom yi=Dxiwhen xihas linearly-many (in M) nonzero entries. Accordingly,

there was tremendous interest in determining whether recovery in this scaling regime remained

possible when Dis unknown. In (Arora et al., 2014), the authors develop provable methods

for recovering dictionaries with sparsity linear in Mup to logarithmic factors, but their method

requires quasipolynomial running time. In a pair of papers, Sun et al. develop a polynomial-time

method which can provably recover invertible dictionaries with s=O(M)(Sun et al., 2017a,b).

However, this algorithm depends intimately on properties of orthogonal and invertible matrices and

thus is limited to the case of complete dictionaries (M=K); moreover, their theoretical guarantees

demand a very high sample complexity of KM8. More recently, Zhai et al. (2020b) introduced

a method based on L4norm optimization. Despite many nice properties of this approach further

elaborated by Zhai et al. (2020a), it remains limited to the complete case and the authors prove only

the accuracy of a global optimum without a guarantee of convergence in arbitrary dimension.

In the overcomplete setting, Barak et al. (2015) developed a tensor decomposition method based

on the sum-of-squares hierarchy that can recover overcomplete dictionaries with sparsity up to s=

OM1−γfor any γ > 0in polynomial time, but this time tends to super-polynomial as γ→0

and requires a constant error as M→ ∞. This and related methods have generally enjoyed the best

theoretical guarantees for efﬁcient dictionary learning in the overcomplete linear sparsity regime

due to their impressive generality, especially after their runtime was improved to polynomial time

by Ma et al. (2016) provided that the target error remains constant. Yet the requirement of constant

error is strict—with these methods, even in sublinear sparsity regimes such as s∼M0.99, an inverse

logarithmic decay in error requires superpolynomial time.

1.2 Our Contribution

For known overcomplete RIP matrices D,Xcan be recovered from Yfor sparsity linear in Mup to

log factors by methods such as L1-minimization (Candes and Tao, 2005) or lasso (Tibshirani, 1996).

While results such as those of Barak et al. (2015) provide very general guarantees for wide families

of matrices in this regime, these methods are slow owing to sum-of-squares techniques requiring

multiple levels of convex relaxation. Until now, there are no results extending dictionary learning

speciﬁcally to RIP matrices, where we might expect much better recovery properties owing to the

success of such methods for proving guarantees in the context of compressed sensing.

In this work, we answer this question in the afﬁrmative, showing that matrices with good

RIP constants along with a “uniform variance” assumption can be recovered efﬁciently even with

sparsity linear in Mup to logarithmic factors. These results are global: we require no initialization

for recovery to work; moreover, our guarantees have the “oracle property” that both the support and

signs of the sparsity pattern matrix Xare recovered exactly with high probability. This allows for

arbitrarily close recovery given enough samples, and even for truly exact recovery of Dif one is

willing to make stronger assumptions on the distribution of X. This is a signiﬁcant improvement

over the best known alternatives for the overcomplete near-linear sparsity regime, which only

achieved constant (Ma et al., 2016) error bounds in polynomial time. In addition to the beneﬁts

NOVIKOV AND WHITE

outlined above, our proposed algorithm SPORADIC (SParse ORAcle DICtionary Learning) is

conceptually simple, easy to implement, and easily parallelized.

1.3 Intuition

The correlation-based clustering method of Arora et al. (2014) offers an appealing intuition in the

s∼√Mregime: that pairs of samples yi,yjwhich “look similar” in the sense of being highly

correlated (|hyi,yji| is in some sense large) are likely to share support. More technically, if the

dictionary Dis incoherent (that is, |hdk1,dk2i| ≤ c/√Mfor k16=k2) and the coefﬁcients xi,xj

are symmetric, then if s√Mthen as M→ ∞, the correlation1hyi,yjiwill concentrate near

zero if yiand yjshare no support, but concentrate around ±1if they do. As a result, thresholding

|hyi,yji| becomes a reliable indicator of whether yiand yjshare a common dictionary element in

their support.

By constructing a graph with an edge between iand jwhenever |hyi,yji| ≥ τfor some

threshold τ(say 1/2), one can determine which groups of yi’s share the same common support

element by applying overlapping clustering methods. Yet correlation-based clustering cannot be

performed with accuracy once sparsity exceeds √M, as above this threshold correlation no longer

reliably indicates whether two samples share a common dictionary element. As a result, methods

based on the correlation hyi,yjihave not been widely employed in subsequent attempts to solve

dictionary learning in the linear sparsity regime.

We adopt a different approach which sidesteps the key challenges of these previous correlation

methods, allowing us to apply these methods successfully even in the linear sparsity regime. Our

proposed algorithm SPORADIC works by a natural adaptation of the correlation-based approach

of Arora et al. (2013) to the linear sparsity regime. Instead of immediately attempting to recover

dictionary elements, we ﬁrst pursue an intermediate step of recovering spanning subspaces, the

subspaces Sispanned by the supporting dictionary elements of each sample yi. Once these subspaces

are recovered, the individual dictionary elements can be recovered through pairwise comparison of

subspaces to ﬁnd their intersection.

To recover these subspaces, we employ a spectral method based on extracting the eigenvectors of

a modiﬁed covariance matrix of the samples Y. Speciﬁcally, for each jwe examine the correlation-

weighted covariance matrix b

Σj, deﬁned as the sample covariance of the correlation-weighted samples

hyi,yjiyi. By design, these reweighted samples will have greater variance in the directions of the

support elements of yj, meaning b

Σjwill have a rank-s“spike” in the directions spanned by support

elements of yj, while a random-matrix assumption on Dguarantees that, with high probability,

there will be no comparable spikes in other directions. As a result, the sleading eigenvectors of b

Σj

(that is, the seigenvectors corresponding to the slargest eigenvalues) will reliably span a subspace

close to that spanned by the support elements of yj. We provide theoretical guarantees that this

method accurately recovers spanning subspaces with sparsity s∼Mlog−(6+η)(M)for any η > 0,

which allows for recovery of the individual dictionary elements by a further subspace intersection

process (see Algorithm 2).

Although the resulting estimate enjoys only weak convergence guarantees—O(1/polylog(M))—

it is accurate enough that subsequent post-processing steps can improve the estimate drastically.

After obtaining a nontrivial estimate of the dictionary which is close column-wise, we show how

these estimated columns can be used as “approximate oracles” to ﬁnd a reﬁned dictionary estimate

1. We note this is a slight abuse of notation as the vectors yi,yjmay not be unit vectors.

OVERCOMPLETE DICTIONARY LEARNING FOR THE ALMOST-LINEAR SPARSITY REGIME

with substantially better error bounds. These bounds are tight enough to allow recovery of the

support and signs of X, from which an even better estimate can be gained by simple averaging.

Most signiﬁcantly, the error bounds for this ﬁnal estimate tend to zero in the number of samples N,

and truly exact recovery is even possible in certain cases.

The sample complexity required for SPORADIC to recover subspaces accurately depends on

the particular sparsity regime. In the most challenging linear-sparsity regime that is our main focus,

up to log factors subspace recovery requires a sample complexity of at most M4for the initial

dictionary estimate to be accurate, but in the “easier” regime s≈√M, our results suggest the

required sample complexity eases to N∼M(see Theorem 4.1). In the linear-sparsity case, the

bottleneck is caused by approximation of a covariance matrix in Frobenius norm, which is known

to require a factor of Madditional samples than does estimation in the operator norm. We believe

that in future work this step can be replaced by an approach requiring approximation only in the L2

operator norm, in which case the worst-case sample complexity would be lowered to M3.

2 Conventions and Data Model

We begin by stating the sparse dictionary learning problem explicitly:

Deﬁnition 1 (Sparse Dictionary Learning) Let D=d1d2. . . dKbe an (unknown) M×

Kmatrix with unit vector columns, called the dictionary. Let xbe an s-sparse random vector, and

deﬁne the random vector y=Dx. The sparse dictionary learning problem is:

Given Y=DX where Xis a K×Nmatrix with columns {xi}N

i=1 i.i.d. copies of x, recover D.

It is clear from the deﬁnition that Dcan only be recovered up to sign and permutation. Accordingly,

we employ the following deﬁnition for comparing two dictionaries, due to Arora et al. (2013): we

say that two dictionaries are column-wise ε-close if their columns are close in Euclidean norm after

an appropriate permutation and change of sign. In detail:

Deﬁnition 2 (Column-wise ε-close (Arora et al., 2013)) Two dictionaries D=d1d2. . . dK

and D0=d0

1d0

2. . . d0

K0are column-wise ε-close if they have the same dimensions M×K

and there exists a permutation πof {1, . . . , K}and a K-element sequence θk∈ {−1,1}such that

for all k= 1, . . . , K:

kdk−θkd0

π(k)k2≤ε

2.1 Notation and conventions

Vectors are represented by boldface lowercase letters, while matrices will be written as boldface

uppercase letters. Roman letters (both upper- and lowercase) will be used for both scalars and

random variables depending on context. We will use the notation |A| for the number of elements in

a ﬁnite set A, and Acfor its complement.

We use two matrix norms at different points in the text. The standard L2operator norm will

be denoted by k • k2while the Frobenius norm will be denoted k • kF. Vector norms always refer

to the standard L2(Euclidean) norm, and will be denoted k•k2. We will use the notation ab,

where both aand bare scalars depending on M, to mean limM→∞ |a|/|b|= 0, where the norm in

question may depend on context.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DictionaryLearningfortheAlmost-LinearSparsityRegimeAlexeiNovikovNOVIKOV@PSU.EDUDepartmentofMathematicsPennStateUniversityUniversityPark,PA16802USAStephenWhiteSEW347@PSU.EDUDepartmentofMathematicsPennStateUniversityUniversityPark,PA16802USAAbstractDictionarylearning,theproblemofrecoveringasparselyuse...

展开>> 收起<<

Dictionary Learning for the Almost-Linear Sparsity Regime Alexei Novikov NOVIKOV PSU.EDU Department of Mathematics_2.pdf

共50页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dictionary Learning for the Almost-Linear Sparsity Regime Alexei Novikov NOVIKOV PSU.EDU Department of Mathematics_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: