Dictionary Learning for the Almost-Linear Sparsity Regime Alexei Novikov NOVIKOV PSU.EDU Department of Mathematics_2

2025-05-06 0 0 792.28KB 50 页 10玖币
侵权投诉
Dictionary Learning for the Almost-Linear Sparsity Regime
Alexei Novikov NOVIKOV@PSU.EDU
Department of Mathematics
Penn State University
University Park, PA 16802 USA
Stephen White SEW347@PSU.EDU
Department of Mathematics
Penn State University
University Park, PA 16802 USA
Abstract
Dictionary learning, the problem of recovering a sparsely used matrix DRM×Kand N s-
sparse vectors xiRKfrom samples of the form yi=Dxi, is of increasing importance to
applications in signal processing and data science. When the dictionary is known, recovery of
xiis possible even for sparsity linear in dimension M, yet to date, the only algorithms which
provably succeed in the linear sparsity regime are Riemannian trust-region methods, which are
limited to orthogonal dictionaries, and methods based on the sum-of-squares hierarchy, which
requires super-polynomial time in order to obtain an error which decays in M. In this work, we
introduce SPORADIC (Spectral Oracle Dictionary Learning), an efficient spectral method on
family of reweighted covariance matrices. We prove that in high enough dimensions, SPORADIC
can recover overcomplete (K > M) dictionaries satisfying the well-known restricted isometry
property (RIP) even when sparsity is linear in dimension up to logarithmic factors. Moreover, these
accuracy guarantees have an “oracle property” that the support and signs of the unknown sparse
vectors xican be recovered exactly with high probability, allowing for arbitrarily close estimation
of Dwith enough samples in polynomial time. To the author’s knowledge, SPORADIC is the first
polynomial-time algorithm which provably enjoys such convergence guarantees for overcomplete
RIP matrices in the near-linear sparsity regime.
Keywords: Compressed Sensing, Dictionary Learning, High-dimensional Statistics, Sparsity,
Sparse Coding
1 Introduction
In data science and machine learning, the task of discovering sparse representations for large datasets
is of central importance. Sparse representations offer distinct advantages for storing and processing
data, as well as providing valuable insights into the underlying structure of a dataset. This problem
of sparse recovery frequently involves the recovery of a sparse vector xRKfrom a dense
sample y=Dx, where D is a sparsely-used matrix referred to as the ”dictionary. Depending
on applications, this dictionary can be known from physics or hand-designed, such as the wavelet
bases used in image processing (e.g., Vetterli and Kovaˇ
cevic, 1995). Beyond numerous further
applications in signal and image processing (see Elad (2010) for a summary of developments),
sparse representations have been fruitfully applied in areas including computational neuroscience
(Olshausen and Field, 1996b,a, 1997) and machine learning (Argyriou et al., 2006; Ranzato et al.,
2007).
©2023 Alexei Novikov and Stephen White.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
arXiv:2210.10855v2 [cs.LG] 27 Mar 2023
NOVIKOV AND WHITE
Figure 1: A sample Yas encountered in dictionary learning. Dis an unknown and typically dense
matrix, and each xiis an unknown vector with at most snonzero entries.
In today’s technology-driven world, interpreting and compressing increasingly varied types of
data have become paramount tasks. This has sparked a growing interest in techniques that not only
learn the fundamental sparse codes from data but also the dictionary itself: this is the dictionary
learning problem. Specifically, we aim to recover a sparsely-used matrix DRM×Kfrom N
samples of the form y=Dx, where xRKis sufficiently sparse (that is, xhas significantly fewer
than Knonzero entries). Written in matrix form, we seek to recover Dfrom a matrix Y=DX
RM×Nwith the prior knowledge that rows of Xare sparse. In many applications, practitioners
are particularly interested in recovering overcomplete dictionaries which satisfy K > M, as such
dictionaries offer greater flexibility in selecting a basis and allow for a sparser representation (e.g.
Chen et al., 1998; Donoho et al., 2006).
1.1 Prior Work
In computer science literature, dictionary learning is typically formulated as a nonconvex optimization
problem of finding a dictionary Dand matrix with sparse columns Xsatisfying Y=DX such that
Xis as sparse as possible, namely:
Find D,Xminimizing kXk0subject to Y=DX.
As a nonconvex optimization, this problem is computationally challenging to solve. The most
popular heuristic for solving the dictionary learning problem is alternating minimization. Alternating
minimization algorithms rely on the fact that when Dor Xis known, the other can be solved using
known methods, most frequently based on L1relaxations that make the problem convex (Cand´
es
et al., 2006). Alternating minimization techniques thus alternate between a “sparse coding” step in
which a guess for Dis fixed and the algorithm solves for X, and a “dictionary update” step in which
Xis fixed and the dictionary is updated. This process is repeated until a convergence criterion is
met. Though theoretical guarantees for convergence of such methods exists, they often demand
extremely sparse representations and precise initialization (Chatterji and Bartlett, 2017; Agarwal
et al., 2016).
In this paper, we are interested in dictionary learning algorithms with provable guarantees.
Initial theoretical study of provable dictionary learning focused on the case when sis no greater
than M; this is a well-known recovery boundary even when the dictionary Dis known. Spielman
et al. (2012) developed an algorithm that accurately recovers the dictionary in this sparsity regime,
but their algorithm does not generalize to recovery of overcomplete dictionaries. Arora et al. (2013)
2
OVERCOMPLETE DICTIONARY LEARNING FOR THE ALMOST-LINEAR SPARSITY REGIME
and Agarwal et al. (2017) then independently introduced similar correlation-and-clustering methods,
which enjoy similar theoretical guarantees for the sMregime for overcomplete dictionaries.
Cand´
es et al. (2006) showed that when the dictionary Dis known and satisfies a restricted
isometry property (RIP; see Definition 4) (Candes and Tao, 2005), it is possible efficiently to
recover xifrom yi=Dxiwhen xihas linearly-many (in M) nonzero entries. Accordingly,
there was tremendous interest in determining whether recovery in this scaling regime remained
possible when Dis unknown. In (Arora et al., 2014), the authors develop provable methods
for recovering dictionaries with sparsity linear in Mup to logarithmic factors, but their method
requires quasipolynomial running time. In a pair of papers, Sun et al. develop a polynomial-time
method which can provably recover invertible dictionaries with s=O(M)(Sun et al., 2017a,b).
However, this algorithm depends intimately on properties of orthogonal and invertible matrices and
thus is limited to the case of complete dictionaries (M=K); moreover, their theoretical guarantees
demand a very high sample complexity of KM8. More recently, Zhai et al. (2020b) introduced
a method based on L4norm optimization. Despite many nice properties of this approach further
elaborated by Zhai et al. (2020a), it remains limited to the complete case and the authors prove only
the accuracy of a global optimum without a guarantee of convergence in arbitrary dimension.
In the overcomplete setting, Barak et al. (2015) developed a tensor decomposition method based
on the sum-of-squares hierarchy that can recover overcomplete dictionaries with sparsity up to s=
OM1γfor any γ > 0in polynomial time, but this time tends to super-polynomial as γ0
and requires a constant error as M→ ∞. This and related methods have generally enjoyed the best
theoretical guarantees for efficient dictionary learning in the overcomplete linear sparsity regime
due to their impressive generality, especially after their runtime was improved to polynomial time
by Ma et al. (2016) provided that the target error remains constant. Yet the requirement of constant
error is strict—with these methods, even in sublinear sparsity regimes such as sM0.99, an inverse
logarithmic decay in error requires superpolynomial time.
1.2 Our Contribution
For known overcomplete RIP matrices D,Xcan be recovered from Yfor sparsity linear in Mup to
log factors by methods such as L1-minimization (Candes and Tao, 2005) or lasso (Tibshirani, 1996).
While results such as those of Barak et al. (2015) provide very general guarantees for wide families
of matrices in this regime, these methods are slow owing to sum-of-squares techniques requiring
multiple levels of convex relaxation. Until now, there are no results extending dictionary learning
specifically to RIP matrices, where we might expect much better recovery properties owing to the
success of such methods for proving guarantees in the context of compressed sensing.
In this work, we answer this question in the affirmative, showing that matrices with good
RIP constants along with a “uniform variance” assumption can be recovered efficiently even with
sparsity linear in Mup to logarithmic factors. These results are global: we require no initialization
for recovery to work; moreover, our guarantees have the “oracle property” that both the support and
signs of the sparsity pattern matrix Xare recovered exactly with high probability. This allows for
arbitrarily close recovery given enough samples, and even for truly exact recovery of Dif one is
willing to make stronger assumptions on the distribution of X. This is a significant improvement
over the best known alternatives for the overcomplete near-linear sparsity regime, which only
achieved constant (Ma et al., 2016) error bounds in polynomial time. In addition to the benefits
3
NOVIKOV AND WHITE
outlined above, our proposed algorithm SPORADIC (SParse ORAcle DICtionary Learning) is
conceptually simple, easy to implement, and easily parallelized.
1.3 Intuition
The correlation-based clustering method of Arora et al. (2014) offers an appealing intuition in the
sMregime: that pairs of samples yi,yjwhich “look similar” in the sense of being highly
correlated (|hyi,yji| is in some sense large) are likely to share support. More technically, if the
dictionary Dis incoherent (that is, |hdk1,dk2i| ≤ c/Mfor k16=k2) and the coefficients xi,xj
are symmetric, then if sMthen as M→ ∞, the correlation1hyi,yjiwill concentrate near
zero if yiand yjshare no support, but concentrate around ±1if they do. As a result, thresholding
|hyi,yji| becomes a reliable indicator of whether yiand yjshare a common dictionary element in
their support.
By constructing a graph with an edge between iand jwhenever |hyi,yji| ≥ τfor some
threshold τ(say 1/2), one can determine which groups of yis share the same common support
element by applying overlapping clustering methods. Yet correlation-based clustering cannot be
performed with accuracy once sparsity exceeds M, as above this threshold correlation no longer
reliably indicates whether two samples share a common dictionary element. As a result, methods
based on the correlation hyi,yjihave not been widely employed in subsequent attempts to solve
dictionary learning in the linear sparsity regime.
We adopt a different approach which sidesteps the key challenges of these previous correlation
methods, allowing us to apply these methods successfully even in the linear sparsity regime. Our
proposed algorithm SPORADIC works by a natural adaptation of the correlation-based approach
of Arora et al. (2013) to the linear sparsity regime. Instead of immediately attempting to recover
dictionary elements, we first pursue an intermediate step of recovering spanning subspaces, the
subspaces Sispanned by the supporting dictionary elements of each sample yi. Once these subspaces
are recovered, the individual dictionary elements can be recovered through pairwise comparison of
subspaces to find their intersection.
To recover these subspaces, we employ a spectral method based on extracting the eigenvectors of
a modified covariance matrix of the samples Y. Specifically, for each jwe examine the correlation-
weighted covariance matrix b
Σj, defined as the sample covariance of the correlation-weighted samples
hyi,yjiyi. By design, these reweighted samples will have greater variance in the directions of the
support elements of yj, meaning b
Σjwill have a rank-s“spike” in the directions spanned by support
elements of yj, while a random-matrix assumption on Dguarantees that, with high probability,
there will be no comparable spikes in other directions. As a result, the sleading eigenvectors of b
Σj
(that is, the seigenvectors corresponding to the slargest eigenvalues) will reliably span a subspace
close to that spanned by the support elements of yj. We provide theoretical guarantees that this
method accurately recovers spanning subspaces with sparsity sMlog(6+η)(M)for any η > 0,
which allows for recovery of the individual dictionary elements by a further subspace intersection
process (see Algorithm 2).
Although the resulting estimate enjoys only weak convergence guarantees—O(1/polylog(M))
it is accurate enough that subsequent post-processing steps can improve the estimate drastically.
After obtaining a nontrivial estimate of the dictionary which is close column-wise, we show how
these estimated columns can be used as “approximate oracles” to find a refined dictionary estimate
1. We note this is a slight abuse of notation as the vectors yi,yjmay not be unit vectors.
4
OVERCOMPLETE DICTIONARY LEARNING FOR THE ALMOST-LINEAR SPARSITY REGIME
with substantially better error bounds. These bounds are tight enough to allow recovery of the
support and signs of X, from which an even better estimate can be gained by simple averaging.
Most significantly, the error bounds for this final estimate tend to zero in the number of samples N,
and truly exact recovery is even possible in certain cases.
The sample complexity required for SPORADIC to recover subspaces accurately depends on
the particular sparsity regime. In the most challenging linear-sparsity regime that is our main focus,
up to log factors subspace recovery requires a sample complexity of at most M4for the initial
dictionary estimate to be accurate, but in the “easier” regime sM, our results suggest the
required sample complexity eases to NM(see Theorem 4.1). In the linear-sparsity case, the
bottleneck is caused by approximation of a covariance matrix in Frobenius norm, which is known
to require a factor of Madditional samples than does estimation in the operator norm. We believe
that in future work this step can be replaced by an approach requiring approximation only in the L2
operator norm, in which case the worst-case sample complexity would be lowered to M3.
2 Conventions and Data Model
We begin by stating the sparse dictionary learning problem explicitly:
Definition 1 (Sparse Dictionary Learning) Let D=d1d2. . . dKbe an (unknown) M×
Kmatrix with unit vector columns, called the dictionary. Let xbe an s-sparse random vector, and
define the random vector y=Dx. The sparse dictionary learning problem is:
Given Y=DX where Xis a K×Nmatrix with columns {xi}N
i=1 i.i.d. copies of x, recover D.
It is clear from the definition that Dcan only be recovered up to sign and permutation. Accordingly,
we employ the following definition for comparing two dictionaries, due to Arora et al. (2013): we
say that two dictionaries are column-wise ε-close if their columns are close in Euclidean norm after
an appropriate permutation and change of sign. In detail:
Definition 2 (Column-wise ε-close (Arora et al., 2013)) Two dictionaries D=d1d2. . . dK
and D0=d0
1d0
2. . . d0
K0are column-wise ε-close if they have the same dimensions M×K
and there exists a permutation πof {1, . . . , K}and a K-element sequence θk∈ {−1,1}such that
for all k= 1, . . . , K:
kdkθkd0
π(k)k2ε
2.1 Notation and conventions
Vectors are represented by boldface lowercase letters, while matrices will be written as boldface
uppercase letters. Roman letters (both upper- and lowercase) will be used for both scalars and
random variables depending on context. We will use the notation |A| for the number of elements in
a finite set A, and Acfor its complement.
We use two matrix norms at different points in the text. The standard L2operator norm will
be denoted by k • k2while the Frobenius norm will be denoted k • kF. Vector norms always refer
to the standard L2(Euclidean) norm, and will be denoted k•k2. We will use the notation ab,
where both aand bare scalars depending on M, to mean limM→∞ |a|/|b|= 0, where the norm in
question may depend on context.
5
摘要:

DictionaryLearningfortheAlmost-LinearSparsityRegimeAlexeiNovikovNOVIKOV@PSU.EDUDepartmentofMathematicsPennStateUniversityUniversityPark,PA16802USAStephenWhiteSEW347@PSU.EDUDepartmentofMathematicsPennStateUniversityUniversityPark,PA16802USAAbstractDictionarylearning,theproblemofrecoveringasparselyuse...

展开>> 收起<<
Dictionary Learning for the Almost-Linear Sparsity Regime Alexei Novikov NOVIKOV PSU.EDU Department of Mathematics_2.pdf

共50页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:50 页 大小:792.28KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 50
客服
关注