Optimal Eigenvalue Shrinkage in the Semicircle Limit David L. Donoho and Michael J. Feldman Department of Statistics Stanford University

2025-05-02 0 0 1.59MB 34 页 10玖币
侵权投诉
Optimal Eigenvalue Shrinkage in the Semicircle Limit
David L. Donoho and Michael J. Feldman
Department of Statistics, Stanford University
Abstract
Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of
covariance estimation often assume the proportional-growth asymptotic framework, where the sample size
nand dimension pare comparable, with n, p → ∞ and γn=p/n γ > 0. Yet, many datasets—perhaps
most—have very different numbers of rows and columns. We consider instead the disproportional-growth
asymptotic framework, where n, p → ∞ and γn0 or γn→ ∞. Either disproportional limit induces
novel behavior unseen within previous proportional and fixed-panalyses.
We study the spiked covariance model, with theoretical covariance a low-rank perturbation of the
identity. For each of 15 different loss functions, we exhibit in closed form new optimal shrinkage and
thresholding rules; for some losses, optimality takes the particularly strong form of unique asymptotic
admissibility. Our optimal procedures demand extensive eigenvalue shrinkage and offer substantial per-
formance benefits over the standard empirical covariance estimator.
Practitioners may ask whether to view their data as arising within (and apply the procedures of) the
proportional or disproportional frameworks. Conveniently, it is possible to remain framework agnostic:
one unified set of closed-form shrinkage rules (depending only on the aspect ratio γnof the given data)
offers full asymptotic optimality under either framework.
At the heart of the phenomena we explore is the spiked Wigner model, in which a low-rank matrix is
perturbed by symmetric noise. The (appropriately scaled) spectral distributions of the spiked covariance
under disproportional growth and the spiked Wigner converge to a common limit—the semicircle law.
Exploiting this connection, we derive optimal eigenvalue shrinkage rules for estimation of the low-rank
component, of independent and fundamental interest. These rules visibly correspond to our formulas for
optimal shrinkage in covariance estimation.
1 Introduction
Suppose we observe p-dimensional Gaussian vectors x1, . . . , xn
i.i.d.
∼ N(0,Σ), with Σ Σpthe p-by-ptheo-
retical covariance matrix. Traditionally, to estimate Σ, we form the empirical (sample) covariance matrix
SSn=1
nPn
i=1 xix
i; this is the maximum likelihood estimator. Under the classical asymptotic framework
where pis fixed and n→ ∞,Sis a consistent estimator of Σ (under any matrix norm).
In recent decades, many impressive random matrix-theoretic studies consider ppntending to infinity
with n. Generally, these studies focus on proportional growth, where the sample size and dimension are
comparable:
n, p → ∞, γn=p
nγ > 0.(1.1)
Under this framework, certain striking mathematical phenomena are elegantly brought to light. An imme-
diate deliverable for statisticians particularly is the discovery that in such a high-dimensional setting, the
maximum likelihood estimator Sis an inconsistent estimator of Σ (under various matrix norms).
1.1 The Empirical Covariance Matrix in the Proportional Framework
We consider proportional growth and Johnstone’s spiked covariance model, where the theoretical covariance
is a low-rank perturbation of identity. All except finitely many eigenvalues (i)p
i=1 of Σ are identity:
1≥ ··· ≥ r1, ℓr+1 =··· =p= 1 .(1.2)
1
arXiv:2210.04488v2 [math.ST] 30 Jul 2023
The rank rand the leading theoretical eigenvalues (i)r
i=1, which we refer to as “spiked” eigenvalues, are
fixed and independent of n. Let λiλi,n denote the eigenvalues of S, ordered decreasingly λ1≥ ··· ≥ λp.
Inconsistency of Sunder proportional growth stems from several phenomena absent under classical fixed-
plarge-nasymptotic studies. Their discovery is due to Marchenko and Pastur [28], Baik, Ben Arous, and
P´ece [6], Baik and Silverstein [5], and Paul [31].
1. Eigenvalue spreading. In the standard normal case Σ = I, where IIpdenotes the p-dimensional
identity matrix, the empirical spectral measure of Sconverges under (1.1) weakly almost surely to
the Marchenko-Pastur distribution with parameter γ. For γ(0,1], this distribution, or bulk, is
non-degenerate, absolutely continuous, and has support [(1 γ)2,(1 + γ)2]=[λ(γ), λ+(γ)].
Intuitively, empirical eigenvalues, rather than concentrating near their theoretical counterparts (which
in this case are all simply 1), spread out across a fixed-size interval, preventing consistency of Sfor Σ.
2. Eigenvalue bias. As it turns out, the leading empirical eigenvalues (λi)r
i=1 do not converge to their
theoretical counterparts (i)r
i=1, rather, they are biased upwards. Under (1.1) and (1.2), for fixed i1,
λi
a.s.
λ(i),(1.3)
where λ()λ(ℓ, γ) is the “eigenvalue mapping” function, given piecewise by
λ() =
+γ
1ℓ > 1 + γ
(1 + γ)21 + γ
.(1.4)
The transition point +(γ) = 1+γbetween the two behaviors is known as the Baik-Ben Arous-P´ech´e
(BBP) transition. Below the transition, 1 < ℓ +(γ), “weak signal” leads to a limiting eigenvalue
independent of . For fixed isuch that i+(γ), λitends to λ+(γ) = (1 + γ)2, the upper bulk-edge
of the Marchenko-Pastur distribution with parameter γ.
Above the transition, ℓ>ℓ+(γ), “strong signal” produces an empirical eigenvalue dependent on ,
though biased upwards. For fixed isuch that i> ℓ+(γ), λi“emerges from the bulk,” approaching a
limit λ(i)> ℓi. This asymptotic bias in extreme eigenvalues is a further cause of inconsistency of S
in several loss measures, including operator norm loss.
3. Eigenvector inconsistency. The eigenvectors v1, . . . , vpof Sdo not align asymptotically with the
corresponding eigenvectors u1, . . . , upof Σ. Under (1.1) and (1.2), assuming supercritical spiked
eigenvalues—those with i> ℓ+(γ)—are distinct, the limiting angles are deterministic and obey
|⟨ui, vj⟩| a.s.
δij ·c(i),1i, j r; (1.5)
here the “cosine” function c()c(ℓ, γ) is given piecewise by
c2() =
1γ/(1)2
1 + γ/(1) ℓ > 1 + γ
01 + γ
.(1.6)
Again, a phase transition occurs at +(γ). This misalignment of empirical and theoretical eigenvectors
further contributes to inconsistency; this is easiest to see for Frobenius loss.
1.2 Shrinkage Estimation
Charles Stein proposed eigenvalue shrinkage as an alternative to traditional covariance estimation [35, 36].
Let S=VΛVbe an eigendecomposition, where Vis orthogonal and Λ = diag(λ1, . . . , λp). Let η:
[0,)[0,) denote a scalar “rule” or “nonlinearity” or “shrinker,” and adopt the convention η(Λ)
diag(η(λ1), . . . , η(λp)).1Estimators of the form b
Ση=V η(Λ)Vare studied in hundreds of papers; see the
works of Donoho, Gavish, and Johnstone [16] (and the extensive references therein) and Ledoit and Wolf
[24, 25]. Note that despite possible ambiguities in the choice of eigenvectors V,b
Σηis well defined.2
1These are common synonyms in shrinkage literature. Note that a nonlinearity may in fact act linearly and a shrinker may
act not as a contraction.
2The signs of eigenvectors are arbitrary. In the case of degenerate eigenvalues, there is additional eigenvector ambiguity.
2
The standard empirical covariance estimator Sresults from the identity rule, η(λ) = λ; we will see
that under various losses, rules acting as contractions are beneficial, obeying |η(λ)1|<|λ1|. In
the spiked model, a well-chosen shrinker mitigates the estimation errors induced by eigenvalue bias and
eigenvector inconsistency. Working under the proportional framework, the authors of [16] examine dozens
of loss functions Land derive for each an asymptotically unique admissible shrinker η(·|L), in many cases
far outperforming S.
1.3 Which Choice of Asymptotic Framework?
The modern “big data” explosion exhibits all manner of ratios of dimension to sample size. Indeed, there
are internet traffic datasets with billions of samples and thousands of dimensions, and computational biology
datasets with thousands of samples and millions of dimensions. To consider only asymptotic frameworks
where row and column counts are roughly balanced, as they are under proportional growth, is a restriction,
and perhaps, even an obstacle.
Although proportional-growth analysis has yielded many valuable insights, practitioners have expressed
doubts about its applicability. In a given application, with a single dataset of size (ndata, pdata), is the
proportional-growth model relevant? No infinite sequence of dataset sizes is visible.
Implicit in the choice of asymptotic framework is an assumption on how this one dataset embeds in a
sequence of growing datasets. Should one view the data as arising within the fixed-pasymptotic framework
(n, pdata) with only nvarying? If so, long tradition recommends estimating Σ by S. On the other hand, if one
views the dataset size as arising from a sequence of proportionally-growing datasets of sizes (n, pdata/ndata ·
n), with constant aspect ratio γ=pdata/ndata, recent trends in the theoretical literature recommend to
apply eigenvalue shrinkage. Current theory offers little guidance on the choice of asymptotic framework,
which dictates whether and how much to shrink. Moreover, there are many possible asymptotic frameworks
containing (ndata, pdata).
1.4 Disproportional Growth
Within the full spectrum of power law scalings pnα,α0, the much-studied proportional-growth limit
corresponds to the single case α= 1. The classical p-fixed, ngrowing relation again corresponds to the
single case α= 0. This paper considers disproportional growth, encompassing everything else:
n, p → ∞, γn=p/n 0 or .
Note that all power law scalings 0 < α < ,α̸= 1 are included, as well as non-power law scalings, such
as p= log nor p=en. The disproportional-growth framework splits naturally into instances; to describe
them, we use terminology that assumes the underlying data matrices XXnare p×n.
1. The “wide matrix” disproportional limit obeys:
n, p → ∞, γn=p/n 0.(1.7)
In this limit, which includes power laws with α(0,1), nis much larger than p, and yet we are outside
the classical, fixed-plarge-nsetting.
2. The “tall matrix” disproportional limit involves arrays with many more columns than rows; formally:
n, p → ∞, γn=p/n → ∞.(1.8)
This limit, including power laws with α(1,), admits many additional scalings of numbers of rows
to columns.
Properties of covariance matrices in the two disproportionate limits are closely linked. Indeed, the non-
zero eigenvalues of XXand XXare equal. For any sequence of tall datasets with γn→ ∞, there is an
accompanying sequence of wide datasets with γn0 and related spectral properties.
3
1.5 The γn0Asymptotic Framework
The γn0 regime seems, at first glance, very different from the proportional case, γnγ > 0. Neither
eigenvalue spreading nor eigenvalue bias are apparent: under (1.2), empirical eigenvalues converge to their
theoretical counterparts, λi
a.s.
i, 1 ip. Moreover, the leading eigenvectors of Sconsistently estimate
the corresponding eigenvectors of Σ: |⟨ui, vj⟩| a.s.
δij , 1 i, j r. Eigenvalue shrinkage therefore seems
irrelevant as Sitself is a consistent estimator of Σ in Frobenius and operator norms. To the contrary, we
introduce an asymptotic framework in which well-designed shrinkage rules confer substantial relative gains
over the identity rule, paralleling gains seen earlier under proportional growth.
As γn0, the empirical spectral measure of Shas support with width approximately 4γn. Accordingly,
we study spiked eigenvalues varying with n,
i
i,n = 1 +
iγn(1 + o(1)) ,
where (
i)r
i=1 are new parameters held constant. This scale, we shall see, is the critical scale under which
eigenvalue bias and eigenvector inconsistency occur. Analogs of (1.3)-(1.6) as γn0 are given by simple
expressions involving
and normalized empirical eigenvalues
λ= (λ1γn)/γn, with a phase transition
occurring precisely at
= 1. Above the transition,
> 1, (1)
λapproaches a limit dependent on
, though
biased upwards, and (2) the angles between the leading eigenvectors of Sand corresponding eigenvectors of
Σ tend to nonzero limits.
The consequences of such high-dimensional phenomena are similar to yet distinct from those uncovered in
the proportional setting. For many choices of loss function, Sis outperformed substantially by well-designed
shrinkage rules, particularly near the phase transition at +(γn). We will consider a range of loss functions
L, deriving for each a shrinker η(·|L) which is optimal as γn0. Analogous results hold as γn→ ∞.
1.6 Estimation in the Spiked Wigner Model
At the heart of our analysis is a connection to the spiked Wigner model. Let W=Wndenote a Wigner
matrix, a real symmetric matrix of size n×nwith independent entries on the upper triangle distributed as
N(0,1). Let Θ = Θndenote a symmetric n×n“signal” matrix of fixed rank r; under the spiked Wigner
model observed data Y=Ynobeys
Y= Θ + 1
nW . (1.9)
Let θ1≥ ··· ≥ θr+>0> θr++1 ≥ ··· ≥ θrdenote the non-zero eigenvalues of Θ, so there are r+positive
values and r=rr+negative.
A standard approach to recovering Θ from noisy data Yuses the eigenvalues of Y,λ1(Y)≥ ··· ≥ λn(Y),
and the associated eigenvectors v1, . . . , vn:
b
Θr=
r+
X
i=1
λi(Y)viv
i+
n
X
i=nr+1
λi(Y)viv
i.
The rank-aware estimator b
Θrcan be improved upon substantially by estimators of the form
b
Θη=
n
X
i=1
η(λi(Y))viv
i,(1.10)
with η:R+R+a well-chosen shrinkage rule.
Optimal formulas for ηunder the spiked Wigner model appear below; they are identical, after appropriate
formal substitutions, to optimal formulas for covariance estimation in the disproportionate, γn0 limit.
Moreover, the driving theoretical quantities in each setting—leading eigenvalue bias, eigenvector inconsis-
tency, optimal shrinkers, and losses—are all “isomorphic.” These equivalencies stem from the following two
important limit theorems, which—although they concern quite different sequences of matrices—set forth
identical limiting distributions.
4
Theorem 1.1 (Wigner [38, 39], Arnold [1]).The empirical spectral measure of W/nconverges weakly
almost surely to the semicircle law, with density ω(x) = (2π)1p(4 x2)+.
Wigner proved convergence in probability of the empirical spectral measure; this was strengthened to
almost sure convergence by Arnold. By Cauchy’s interlacing theorem, the conclusion of Theorem 1.1 applies
as well to spiked Wigners Yfollowing model (1.9).
Theorem 1.2 (Bai and Yin [3]).As γn0, the spectral measure of γ1/2
n(SI)converges weakly almost
surely to the semicircle law, that is, to the same limit as in Theorem 1.1.
1.7 Our Contributions
Given this background, we now state our contributions:
1. We study the disproportional γn0 framework with an eye towards developing analogs of (1.3)-(1.6).
In the critical scaling of this regime, spiked eigenvalues decay towards one as 1 +
γn, where
is a
new formal parameter. Analogs of (1.3)-(1.6) as a function of
are presented in Lemma 3.1 below.
On this scale, the analog of the BBP phase transition—the critical spike strength above which leading
eigenvectors of Scorrelate with those of Σ—now occurs at
= 1. While equivalent formulas are given
by Bloemendal et al. [11], we work under weaker assumptions, allowing general rates at which n, p → ∞
while γn0, and giving a simple, direct argument. Analogous results hold as γn→ ∞, explored in
later sections.
2. From the disproportional analogs of (1.3)-(1.6), we derive new optimal rules for shrinkage of leading
eigenvalues under fifteen canonical loss functions. Optimal shrinkage provides improvement by multi-
plicative factors; e.g., Table 2 indicates relative loss improvements over the standard covariance of 50%
or higher, when
is not large. Furthermore, for some losses, we obtain unique asymptotic admissibility
(see Definition 3.5): within this framework, no other rule is better under any set of spiked eigen-
value parameters. We derive closed forms for the relative gain of optimal shrinkage over the empirical
covariance matrix. In addition, we find optimal hard thresholding levels under each loss.
3. Remarkably, the n, p → ∞,γn0 limit is dissimilar to classical fixed-pstatistics: for any rate γn0,
non-trivial eigenvalue shrinkage is optimal, and for two sets of loss functions, uniquely asymptotically
admissible.
4. Our optimal rules and losses are the limits, in the disproportional framework, of proportional-regime
optimal rules and losses. Consequently, we obtain frame-agnostic shrinkage rules that achieve optimal
performance across the proportional and disproportional (γn0 or γn→ ∞) asymptotics. Given a
dataset of size (ndata, pdata), there is a single shrinkage rule depending only on γdata =pdata/ndata (and
the loss function of choice) with optimal performance in any asymptotic embedding of (ndata, pdata).
5. We obtain asymptotically optimal rules and losses for the spiked Wigner model, which are formally
identical to optimal rules and losses of the bilateral spiked covariance model (where spiked eigenvalues
may be elevated above or depressed below one).
6. We consider extensions of shrinkage to divergent spiked eigenvalues (where spiked eigenvalues, previ-
ously bounded, may now diverge). Divergent spikes are motivated by applications in which the leading
eigenvalues of the covariance matrix are orders of magnitude greater than the median eigenvalue.
Eigenvalue bias and eigenvector inconsistency do not occur appreciably under such strong signals, yet
optimal shrinkage remains provably beneficial.
Our results offer several key takeaways. Firstly, we directly face a widespread criticism of prior theoretical
work, that row and column counts are assumed proportional; such criticism is based on the empirical obser-
vation that many—if not most—modern datasets having highly asymmetric numbers of rows and columns.
Secondly, we show that nontrivial leading eigenvalue shrinkage is beneficial under any of the discussed post-
classical frameworks, proportional or disproportional growth, and any of a variety of loss functions.
Finally, we resolve the following “framework conundrum.” In view of theoretical studies under various
asymptotic frameworks, a practitioner might well think as follows:
5
摘要:

OptimalEigenvalueShrinkageintheSemicircleLimitDavidL.DonohoandMichaelJ.FeldmanDepartmentofStatistics,StanfordUniversityAbstractModerndatasetsaretrendingtowardseverhigherdimension.Inresponse,recenttheoreticalstudiesofcovarianceestimationoftenassumetheproportional-growthasymptoticframework,wherethesam...

展开>> 收起<<
Optimal Eigenvalue Shrinkage in the Semicircle Limit David L. Donoho and Michael J. Feldman Department of Statistics Stanford University.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:1.59MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注