Optimal Eigenvalue Shrinkage in the Semicircle Limit David L. Donoho and Michael J. Feldman Department of Statistics Stanford University

2025-05-02 0 0 1.59MB 34 页 10玖币

侵权投诉

Optimal Eigenvalue Shrinkage in the Semicircle Limit

David L. Donoho and Michael J. Feldman

Department of Statistics, Stanford University

Abstract

Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of

covariance estimation often assume the proportional-growth asymptotic framework, where the sample size

nand dimension pare comparable, with n, p → ∞ and γn=p/n →γ > 0. Yet, many datasets—perhaps

most—have very diﬀerent numbers of rows and columns. We consider instead the disproportional-growth

asymptotic framework, where n, p → ∞ and γn→0 or γn→ ∞. Either disproportional limit induces

novel behavior unseen within previous proportional and ﬁxed-panalyses.

We study the spiked covariance model, with theoretical covariance a low-rank perturbation of the

identity. For each of 15 diﬀerent loss functions, we exhibit in closed form new optimal shrinkage and

thresholding rules; for some losses, optimality takes the particularly strong form of unique asymptotic

admissibility. Our optimal procedures demand extensive eigenvalue shrinkage and oﬀer substantial per-

formance beneﬁts over the standard empirical covariance estimator.

Practitioners may ask whether to view their data as arising within (and apply the procedures of) the

proportional or disproportional frameworks. Conveniently, it is possible to remain framework agnostic:

one uniﬁed set of closed-form shrinkage rules (depending only on the aspect ratio γnof the given data)

oﬀers full asymptotic optimality under either framework.

At the heart of the phenomena we explore is the spiked Wigner model, in which a low-rank matrix is

perturbed by symmetric noise. The (appropriately scaled) spectral distributions of the spiked covariance

under disproportional growth and the spiked Wigner converge to a common limit—the semicircle law.

Exploiting this connection, we derive optimal eigenvalue shrinkage rules for estimation of the low-rank

component, of independent and fundamental interest. These rules visibly correspond to our formulas for

optimal shrinkage in covariance estimation.

1 Introduction

Suppose we observe p-dimensional Gaussian vectors x1, . . . , xn

i.i.d.

∼ N(0,Σ), with Σ ≡Σpthe p-by-ptheo-

retical covariance matrix. Traditionally, to estimate Σ, we form the empirical (sample) covariance matrix

S≡Sn=1

nPn

i=1 xix′

i; this is the maximum likelihood estimator. Under the classical asymptotic framework

where pis ﬁxed and n→ ∞,Sis a consistent estimator of Σ (under any matrix norm).

In recent decades, many impressive random matrix-theoretic studies consider p≡pntending to inﬁnity

with n. Generally, these studies focus on proportional growth, where the sample size and dimension are

comparable:

n, p → ∞, γn=p

n→γ > 0.(1.1)

Under this framework, certain striking mathematical phenomena are elegantly brought to light. An imme-

diate deliverable for statisticians particularly is the discovery that in such a high-dimensional setting, the

maximum likelihood estimator Sis an inconsistent estimator of Σ (under various matrix norms).

1.1 The Empirical Covariance Matrix in the Proportional Framework

We consider proportional growth and Johnstone’s spiked covariance model, where the theoretical covariance

is a low-rank perturbation of identity. All except ﬁnitely many eigenvalues (ℓi)p

i=1 of Σ are identity:

ℓ1≥ ··· ≥ ℓr≥1, ℓr+1 =··· =ℓp= 1 .(1.2)

arXiv:2210.04488v2 [math.ST] 30 Jul 2023

The rank rand the leading theoretical eigenvalues (ℓi)r

i=1, which we refer to as “spiked” eigenvalues, are

ﬁxed and independent of n. Let λi≡λi,n denote the eigenvalues of S, ordered decreasingly λ1≥ ··· ≥ λp.

Inconsistency of Sunder proportional growth stems from several phenomena absent under classical ﬁxed-

plarge-nasymptotic studies. Their discovery is due to Marchenko and Pastur [28], Baik, Ben Arous, and

P´ech´e [6], Baik and Silverstein [5], and Paul [31].

1. Eigenvalue spreading. In the standard normal case Σ = I, where I≡Ipdenotes the p-dimensional

identity matrix, the empirical spectral measure of Sconverges under (1.1) weakly almost surely to

the Marchenko-Pastur distribution with parameter γ. For γ∈(0,1], this distribution, or bulk, is

non-degenerate, absolutely continuous, and has support [(1 −√γ)2,(1 + √γ)2]=[λ−(γ), λ+(γ)].

Intuitively, empirical eigenvalues, rather than concentrating near their theoretical counterparts (which

in this case are all simply 1), spread out across a ﬁxed-size interval, preventing consistency of Sfor Σ.

2. Eigenvalue bias. As it turns out, the leading empirical eigenvalues (λi)r

i=1 do not converge to their

theoretical counterparts (ℓi)r

i=1, rather, they are biased upwards. Under (1.1) and (1.2), for ﬁxed i≥1,

λi

a.s.

−−→ λ(ℓi),(1.3)

where λ(ℓ)≡λ(ℓ, γ) is the “eigenvalue mapping” function, given piecewise by

λ(ℓ) = 





ℓ+γℓ

ℓ−1ℓ > 1 + √γ

(1 + √γ)2ℓ≤1 + √γ

.(1.4)

The transition point ℓ+(γ) = 1+√γbetween the two behaviors is known as the Baik-Ben Arous-P´ech´e

(BBP) transition. Below the transition, 1 < ℓ ≤ℓ+(γ), “weak signal” leads to a limiting eigenvalue

independent of ℓ. For ﬁxed isuch that ℓi≤ℓ+(γ), λitends to λ+(γ) = (1 + √γ)2, the upper bulk-edge

of the Marchenko-Pastur distribution with parameter γ.

Above the transition, ℓ>ℓ+(γ), “strong signal” produces an empirical eigenvalue dependent on ℓ,

though biased upwards. For ﬁxed isuch that ℓi> ℓ+(γ), λi“emerges from the bulk,” approaching a

limit λ(ℓi)> ℓi. This asymptotic bias in extreme eigenvalues is a further cause of inconsistency of S

in several loss measures, including operator norm loss.

3. Eigenvector inconsistency. The eigenvectors v1, . . . , vpof Sdo not align asymptotically with the

corresponding eigenvectors u1, . . . , upof Σ. Under (1.1) and (1.2), assuming supercritical spiked

eigenvalues—those with ℓi> ℓ+(γ)—are distinct, the limiting angles are deterministic and obey

|⟨ui, vj⟩| a.s.

−−→ δij ·c(ℓi),1≤i, j ≤r; (1.5)

here the “cosine” function c(ℓ)≡c(ℓ, γ) is given piecewise by

c2(ℓ) = 









1−γ/(ℓ−1)2

1 + γ/(ℓ−1) ℓ > 1 + √γ

0ℓ≤1 + √γ

.(1.6)

Again, a phase transition occurs at ℓ+(γ). This misalignment of empirical and theoretical eigenvectors

further contributes to inconsistency; this is easiest to see for Frobenius loss.

1.2 Shrinkage Estimation

Charles Stein proposed eigenvalue shrinkage as an alternative to traditional covariance estimation [35, 36].

Let S=VΛV′be an eigendecomposition, where Vis orthogonal and Λ = diag(λ1, . . . , λp). Let η:

[0,∞)→[0,∞) denote a scalar “rule” or “nonlinearity” or “shrinker,” and adopt the convention η(Λ) ≡

diag(η(λ1), . . . , η(λp)).1Estimators of the form b

Ση=V η(Λ)V′are studied in hundreds of papers; see the

works of Donoho, Gavish, and Johnstone [16] (and the extensive references therein) and Ledoit and Wolf

[24, 25]. Note that despite possible ambiguities in the choice of eigenvectors V,b

Σηis well deﬁned.2

1These are common synonyms in shrinkage literature. Note that a nonlinearity may in fact act linearly and a shrinker may

act not as a contraction.

2The signs of eigenvectors are arbitrary. In the case of degenerate eigenvalues, there is additional eigenvector ambiguity.

The standard empirical covariance estimator Sresults from the identity rule, η(λ) = λ; we will see

that under various losses, rules acting as contractions are beneﬁcial, obeying |η(λ)−1|<|λ−1|. In

the spiked model, a well-chosen shrinker mitigates the estimation errors induced by eigenvalue bias and

eigenvector inconsistency. Working under the proportional framework, the authors of [16] examine dozens

of loss functions Land derive for each an asymptotically unique admissible shrinker η∗(·|L), in many cases

far outperforming S.

1.3 Which Choice of Asymptotic Framework?

The modern “big data” explosion exhibits all manner of ratios of dimension to sample size. Indeed, there

are internet traﬃc datasets with billions of samples and thousands of dimensions, and computational biology

datasets with thousands of samples and millions of dimensions. To consider only asymptotic frameworks

where row and column counts are roughly balanced, as they are under proportional growth, is a restriction,

and perhaps, even an obstacle.

Although proportional-growth analysis has yielded many valuable insights, practitioners have expressed

doubts about its applicability. In a given application, with a single dataset of size (ndata, pdata), is the

proportional-growth model relevant? No inﬁnite sequence of dataset sizes is visible.

Implicit in the choice of asymptotic framework is an assumption on how this one dataset embeds in a

sequence of growing datasets. Should one view the data as arising within the ﬁxed-pasymptotic framework

(n, pdata) with only nvarying? If so, long tradition recommends estimating Σ by S. On the other hand, if one

views the dataset size as arising from a sequence of proportionally-growing datasets of sizes (n, pdata/ndata ·

n), with constant aspect ratio γ=pdata/ndata, recent trends in the theoretical literature recommend to

apply eigenvalue shrinkage. Current theory oﬀers little guidance on the choice of asymptotic framework,

which dictates whether and how much to shrink. Moreover, there are many possible asymptotic frameworks

containing (ndata, pdata).

1.4 Disproportional Growth

Within the full spectrum of power law scalings p≍nα,α≥0, the much-studied proportional-growth limit

corresponds to the single case α= 1. The classical p-ﬁxed, ngrowing relation again corresponds to the

single case α= 0. This paper considers disproportional growth, encompassing everything else:

n, p → ∞, γn=p/n →0 or ∞.

Note that all power law scalings 0 < α < ∞,α̸= 1 are included, as well as non-power law scalings, such

as p= log nor p=en. The disproportional-growth framework splits naturally into instances; to describe

them, we use terminology that assumes the underlying data matrices X≡Xnare p×n.

1. The “wide matrix” disproportional limit obeys:

n, p → ∞, γn=p/n →0.(1.7)

In this limit, which includes power laws with α∈(0,1), nis much larger than p, and yet we are outside

the classical, ﬁxed-plarge-nsetting.

2. The “tall matrix” disproportional limit involves arrays with many more columns than rows; formally:

n, p → ∞, γn=p/n → ∞.(1.8)

This limit, including power laws with α∈(1,∞), admits many additional scalings of numbers of rows

to columns.

Properties of covariance matrices in the two disproportionate limits are closely linked. Indeed, the non-

zero eigenvalues of XX′and X′Xare equal. For any sequence of tall datasets with γn→ ∞, there is an

accompanying sequence of wide datasets with γn→0 and related spectral properties.

1.5 The γn→0Asymptotic Framework

The γn→0 regime seems, at ﬁrst glance, very diﬀerent from the proportional case, γn→γ > 0. Neither

eigenvalue spreading nor eigenvalue bias are apparent: under (1.2), empirical eigenvalues converge to their

theoretical counterparts, λi

a.s.

−−→ ℓi, 1 ≤i≤p. Moreover, the leading eigenvectors of Sconsistently estimate

the corresponding eigenvectors of Σ: |⟨ui, vj⟩| a.s.

−−→ δij , 1 ≤i, j ≤r. Eigenvalue shrinkage therefore seems

irrelevant as Sitself is a consistent estimator of Σ in Frobenius and operator norms. To the contrary, we

introduce an asymptotic framework in which well-designed shrinkage rules confer substantial relative gains

over the identity rule, paralleling gains seen earlier under proportional growth.

As γn→0, the empirical spectral measure of Shas support with width approximately 4√γn. Accordingly,

we study spiked eigenvalues varying with n,

↼

ℓi≡↼

ℓi,n = 1 + ↼

ℓi√γn(1 + o(1)) ,

where (↼

ℓi)r

i=1 are new parameters held constant. This scale, we shall see, is the critical scale under which

eigenvalue bias and eigenvector inconsistency occur. Analogs of (1.3)-(1.6) as γn→0 are given by simple

expressions involving ↼

ℓand normalized empirical eigenvalues ↼

λ= (λ−1−γn)/√γn, with a phase transition

occurring precisely at ↼

ℓ= 1. Above the transition, ↼

ℓ > 1, (1) ↼

λapproaches a limit dependent on ↼

ℓ, though

biased upwards, and (2) the angles between the leading eigenvectors of Sand corresponding eigenvectors of

Σ tend to nonzero limits.

The consequences of such high-dimensional phenomena are similar to yet distinct from those uncovered in

the proportional setting. For many choices of loss function, Sis outperformed substantially by well-designed

shrinkage rules, particularly near the phase transition at ℓ+(γn). We will consider a range of loss functions

L, deriving for each a shrinker η∗(·|L) which is optimal as γn→0. Analogous results hold as γn→ ∞.

1.6 Estimation in the Spiked Wigner Model

At the heart of our analysis is a connection to the spiked Wigner model. Let W=Wndenote a Wigner

matrix, a real symmetric matrix of size n×nwith independent entries on the upper triangle distributed as

N(0,1). Let Θ = Θndenote a symmetric n×n“signal” matrix of ﬁxed rank r; under the spiked Wigner

model observed data Y=Ynobeys

Y= Θ + 1

√nW . (1.9)

Let θ1≥ ··· ≥ θr+>0> θr++1 ≥ ··· ≥ θrdenote the non-zero eigenvalues of Θ, so there are r+positive

values and r−=r−r+negative.

A standard approach to recovering Θ from noisy data Yuses the eigenvalues of Y,λ1(Y)≥ ··· ≥ λn(Y),

and the associated eigenvectors v1, . . . , vn:

Θr=

i=1

λi(Y)viv′

i=n−r−+1

λi(Y)viv′

The rank-aware estimator b

Θrcan be improved upon substantially by estimators of the form

Θη=

i=1

η(λi(Y))viv′

i,(1.10)

with η:R+→R+a well-chosen shrinkage rule.

Optimal formulas for ηunder the spiked Wigner model appear below; they are identical, after appropriate

formal substitutions, to optimal formulas for covariance estimation in the disproportionate, γn→0 limit.

Moreover, the driving theoretical quantities in each setting—leading eigenvalue bias, eigenvector inconsis-

tency, optimal shrinkers, and losses—are all “isomorphic.” These equivalencies stem from the following two

important limit theorems, which—although they concern quite diﬀerent sequences of matrices—set forth

identical limiting distributions.

Theorem 1.1 (Wigner [38, 39], Arnold [1]).The empirical spectral measure of W/√nconverges weakly

almost surely to the semicircle law, with density ω(x) = (2π)−1p(4 −x2)+.

Wigner proved convergence in probability of the empirical spectral measure; this was strengthened to

almost sure convergence by Arnold. By Cauchy’s interlacing theorem, the conclusion of Theorem 1.1 applies

as well to spiked Wigners Yfollowing model (1.9).

Theorem 1.2 (Bai and Yin [3]).As γn→0, the spectral measure of γ−1/2

n(S−I)converges weakly almost

surely to the semicircle law, that is, to the same limit as in Theorem 1.1.

1.7 Our Contributions

Given this background, we now state our contributions:

1. We study the disproportional γn→0 framework with an eye towards developing analogs of (1.3)-(1.6).

In the critical scaling of this regime, spiked eigenvalues decay towards one as 1 + ↼

ℓ√γn, where ↼

ℓis a

new formal parameter. Analogs of (1.3)-(1.6) as a function of ↼

ℓare presented in Lemma 3.1 below.

On this scale, the analog of the BBP phase transition—the critical spike strength above which leading

eigenvectors of Scorrelate with those of Σ—now occurs at ↼

ℓ= 1. While equivalent formulas are given

by Bloemendal et al. [11], we work under weaker assumptions, allowing general rates at which n, p → ∞

while γn→0, and giving a simple, direct argument. Analogous results hold as γn→ ∞, explored in

later sections.

2. From the disproportional analogs of (1.3)-(1.6), we derive new optimal rules for shrinkage of leading

eigenvalues under ﬁfteen canonical loss functions. Optimal shrinkage provides improvement by multi-

plicative factors; e.g., Table 2 indicates relative loss improvements over the standard covariance of 50%

or higher, when ↼

ℓis not large. Furthermore, for some losses, we obtain unique asymptotic admissibility

(see Deﬁnition 3.5): within this framework, no other rule is better under any set of spiked eigen-

value parameters. We derive closed forms for the relative gain of optimal shrinkage over the empirical

covariance matrix. In addition, we ﬁnd optimal hard thresholding levels under each loss.

3. Remarkably, the n, p → ∞,γn→0 limit is dissimilar to classical ﬁxed-pstatistics: for any rate γn→0,

non-trivial eigenvalue shrinkage is optimal, and for two sets of loss functions, uniquely asymptotically

admissible.

4. Our optimal rules and losses are the limits, in the disproportional framework, of proportional-regime

optimal rules and losses. Consequently, we obtain frame-agnostic shrinkage rules that achieve optimal

performance across the proportional and disproportional (γn→0 or γn→ ∞) asymptotics. Given a

dataset of size (ndata, pdata), there is a single shrinkage rule depending only on γdata =pdata/ndata (and

the loss function of choice) with optimal performance in any asymptotic embedding of (ndata, pdata).

5. We obtain asymptotically optimal rules and losses for the spiked Wigner model, which are formally

identical to optimal rules and losses of the bilateral spiked covariance model (where spiked eigenvalues

may be elevated above or depressed below one).

6. We consider extensions of shrinkage to divergent spiked eigenvalues (where spiked eigenvalues, previ-

ously bounded, may now diverge). Divergent spikes are motivated by applications in which the leading

eigenvalues of the covariance matrix are orders of magnitude greater than the median eigenvalue.

Eigenvalue bias and eigenvector inconsistency do not occur appreciably under such strong signals, yet

optimal shrinkage remains provably beneﬁcial.

Our results oﬀer several key takeaways. Firstly, we directly face a widespread criticism of prior theoretical

work, that row and column counts are assumed proportional; such criticism is based on the empirical obser-

vation that many—if not most—modern datasets having highly asymmetric numbers of rows and columns.

Secondly, we show that nontrivial leading eigenvalue shrinkage is beneﬁcial under any of the discussed post-

classical frameworks, proportional or disproportional growth, and any of a variety of loss functions.

Finally, we resolve the following “framework conundrum.” In view of theoretical studies under various

asymptotic frameworks, a practitioner might well think as follows:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimalEigenvalueShrinkageintheSemicircleLimitDavidL.DonohoandMichaelJ.FeldmanDepartmentofStatistics,StanfordUniversityAbstractModerndatasetsaretrendingtowardseverhigherdimension.Inresponse,recenttheoreticalstudiesofcovarianceestimationoftenassumetheproportional-growthasymptoticframework,wherethesam...

展开>> 收起<<

Optimal Eigenvalue Shrinkage in the Semicircle Limit David L. Donoho and Michael J. Feldman Department of Statistics Stanford University.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimal Eigenvalue Shrinkage in the Semicircle Limit David L. Donoho and Michael J. Feldman Department of Statistics Stanford University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: