Fundamental limits in structured PCA and how to reach them Jean Barbier Francesco Camilli

2025-05-06 1 0 1.56MB 107 页 10玖币
侵权投诉
Fundamental limits in structured PCA,
and how to reach them
Jean Barbier,, Francesco Camilli,,
Marco Mondelli, Manuel S´aenz
Abdus Salam International Center for Theoretical Physics, Italy
Institute of Science and Technology, Austria
Universidad de La Rep´ublica, Uruguay
Abstract
How do statistical dependencies in measurement noise influence high-dimensional in-
ference? To answer this, we study the paradigmatic spiked matrix model of principal
components analysis (PCA), where a rank-one matrix is corrupted by additive noise. We
go beyond the usual independence assumption on the noise entries, by drawing the noise
from a low-order polynomial orthogonal matrix ensemble. The resulting noise correlations
make the setting relevant for applications but analytically challenging. We provide the
first characterization of the Bayes-optimal limits of inference in this model. If the spike
is rotation-invariant, we show that standard spectral PCA is optimal. However, for more
general priors, both PCA and the existing approximate message passing algorithm (AMP)
fall short of achieving the information-theoretic limits, which we compute using the replica
method from statistical mechanics. We thus propose a novel AMP, inspired by the theory
of Adaptive Thouless-Anderson-Palmer equations, which saturates the theoretical limit.
This AMP comes with a rigorous state evolution analysis tracking its performance. Al-
though we focus on specific noise distributions, our methodology can be generalized to a
wide class of trace matrix ensembles at the cost of more involved expressions. Finally,
despite the seemingly strong assumption of rotation-invariant noise, our theory empiri-
cally predicts algorithmic performance on real data, pointing at remarkable universality
properties.
Corresponding authors: jbarbier@ictp.it,fcamilli@ictp.it
1
arXiv:2210.01237v2 [cs.IT] 2 Jun 2023
Contents
I Main part 3
1 Introduction 4
2 Setting and main results 6
2.1 Result 1: Information-theoretical limits . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Result 2: Optimality of PCA for rotationally invariant priors . . . . . . . . . . . 8
2.3 Result 3a: Optimal pre-processing of the data . . . . . . . . . . . . . . . . . . . 8
2.4 Result 3b: Bayes-optimal AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Numerical results and discussion 11
3.1 BAMP vs the replica prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Universality of the rotational invariance assumption . . . . . . . . . . . . . . . . 12
4 Methods 12
4.1 Outline of the replica computation . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Auxiliary AMP and Onsager coefficients . . . . . . . . . . . . . . . . . . . . . . 14
II Supplementary Information 16
5 Introduction, problem setting and main results 16
5.1 Introduction and related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.2 Probabilistic model of PCA with structured noise . . . . . . . . . . . . . . . . . 19
5.3 Concrete examples: the quartic and sestic ensembles . . . . . . . . . . . . . . . . 21
5.4 Mainresults...................................... 23
6 The inhomogeneous spherical integral 29
6.1 Definition and variational characterization . . . . . . . . . . . . . . . . . . . . . 30
6.2 Specialcases...................................... 31
6.2.1 Low-rank HCIZ integral . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2.2 Low-rank spherical integral with external field and diagonal replica coupling 31
6.2.3 Low-rank spherical integral with non-diagonal replica coupling and replica
symmetricoverlap .............................. 31
6.3 Derivation of the variational formula . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Information-theoretic analysis by the replica
method 35
7.1 An equivalent quadratic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Replica symmetric free entropy using the inhomogeneous spherical integral . . . 38
2
7.3 Replica saddle point equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4 The replica formula for the pure sestic potential . . . . . . . . . . . . . . . . . . 44
7.5 Replica saddle point equations for the pure sestic potential . . . . . . . . . . . . 46
7.6 Spectral PCA is optimal for rotation-invariant signals . . . . . . . . . . . . . . . 49
8 Sub-optimality of the previously proposed AMP 52
8.1 Analysis of the one-step AMP fixed point performance . . . . . . . . . . . . . . 53
8.2 Analysis of the replica Bayes-optimal fixed point . . . . . . . . . . . . . . . . . . 54
8.3 What is actually doing this sub-optimal AMP?
Mismatched estimation with Gaussian likelihood . . . . . . . . . . . . . . . . . . 56
9 Towards an optimal AMP: AdaTAP formalism 62
9.1 The AdaTAP single-instance free entropy . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Saddle point: reduction to an Ising model, AdaTAP equations and optimal pre-
processingofthedata ................................ 64
9.3 Optimal pre-processing for the order 6 potential . . . . . . . . . . . . . . . . . . 65
9.4 Simplifying the AdaTAP equations by self-averaging of the Onsager reaction term 67
10 Approximate message passing, optimally 70
10.1 BAMP: Bayes-optimal AMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.2 Onsager coefficients and state evolution recursion . . . . . . . . . . . . . . . . . 72
10.3 AMP-AP: AMP with Alternating Posteriors . . . . . . . . . . . . . . . . . . . . 75
11 Numerics 76
11.1 Spectral properties of the pre-processed matrix J(Y) ............... 76
11.2 BAMP and AMP-AP improve over the existing AMP and match the replica
prediction for the MMSE, and empirical universality of the rotational invariance
assumption ...................................... 77
A Approximation of non-polynomial potentials 83
B Learning the optimal pre-processing J(Y)86
C Proofs for BAMP 88
C.1 AuxiliaryAMP .................................... 88
C.2 State evolution of auxiliary AMP . . . . . . . . . . . . . . . . . . . . . . . . . . 88
C.3 ProofofTheorem2.................................. 90
3
Part I
Main part
1 Introduction
The success of inference and learning algorithms depends strongly on the structure of the high-
dimensional noisy data they process. Consequently, quantifying how this structure helps al-
gorithms to overcome the curse of dimensionality has become a central topic in statistics and
machine learning. Classical examples include sparsity in compressed sensing [39], low-rank
structure in matrix recovery [32], or community structure in community detection [1]. In all
these models, structure is usually assumed only at the signal’s level. But the decomposition
of the data into “signal” (the component considered of interest) and “noise” (the rest) is often
arbitrary and application-dependent. E.g., in classification of “dogs/cats”, the training images
contain a lot of information unrelated to dogs and cats, e.g., on the notions of “inside/outside”,
“day/night”, etc. Yet, this highly structured potential source of information is discarded as
random noise (independent, Gaussian, etc.). Most of the research effort has thus focused on
understanding how the signal structure alone helps inferring it. In contrast, much less is known
on the role of the noise structure and how to exploit it to improve inference.
Given their ubiquitous appearance in the statistics literature, spiked matrix models, which
were originally formulated as models for probabilistic principal component analysis (PCA) [62],
are now a paradigm in high-dimensional inference. Thanks to their universality features they,
and their generalizations, find numerous applications in other central problems, including com-
munity detection [1], group synchronization [99] and sub-matrix localization or high-dimensional
clustering [72]. They thus offer the perfect benchmark to quantify the influence of noise struc-
ture. In this paper we focus on the following estimation problem: a statistician needs to extract
a rank-one matrix (the spike)P:= XX,XRN, from the data
Y=λ
NP+ZRN×N(1)
with “noise” Zand signal-to-noise ratio (SNR) λ0.
The spectral properties of finite rank perturbations of large random matrices like (1) were
intensively investigated in random matrix theory (see e.g. [8, 9, 26]), showing the presence of
a threshold phenomenon coined BBP transition (in reference to the authors of [8]): when λis
large enough, the top eigenvalue of Ydetaches from the bulk of eigenvalues. Its corresponding
eigenvector has then a non-trivial projection onto the sought ground truth X, and can be used
as its estimator. The problem has also been approached from the angle of Bayesian inference
[68, 37, 12, 71]. In particular, besides the previous spectral estimator, there exists a whole
family of iterative algorithms, known as approximate message passing (AMP), that can be
tailored to take further advantage of prior structural information about the signal and noise.
AMP algorithms were first proposed for estimation in linear models [63, 38], but have since
4
been applied to a range of statistical estimation problems, including generalized linear models
[16, 103] and low-rank matrix estimation [37, 86]. An attractive feature of AMP is that its
performance in the high-dimensional limit can often be characterized by a succinct recursion
called state evolution [24, 29]. Using the state evolution analysis, it has been proved that AMP
achieves Bayes-optimal performance for some models [37, 86, 16], and a conjecture posits that
for a wide range of estimation problems, AMP is optimal among polynomial-time algorithms
[87].
The references mentioned above rely on the assumption of independent and identically dis-
tributed (i.i.d.) noise, often taken Gaussian Zij =Zji ∼ N(0,1), under which (1) is the
well-known spiked Wigner model [62]. This independence, or “absence of structure”, in the
noise simplifies greatly the analysis. In order to relax this property, we may seek inspiration
from the statistical physics literature on disordered systems. An idea that was first brought
forth in [23, 95] for the Sherrington-Kirkpatrick model, and later imported also in high dimen-
sional inference [5, 57] is that of giving an inhomogeneous variance profile to the noise matrix
elements (we mention that this idea in inference is similar to the earlier definition of “spatially
coupled systems” [47, 70] in coding theory, see [12] for its use in the present context). The
procedure makes the (Zij ) no longer identically distributed, but it leaves them independent.
This is an important step towards more structure in the noise. Yet, the independence assump-
tion is a rather strong one. In fact, [57] showed that a broad class of observation models, as
long as the independence assumption holds, are information-theoretically equivalent to one with
independent Gaussian noise.
One way to go beyond is to consider noises belonging to the wider class of rotationally
invariant matrices. Since the appearance of the seminal works [80, 81, 97], there has been a
remarkable development in this direction, as evidenced by the rapidly growing number of papers
on spin glasses [92, 94, 77] and inference [53, 45, 110, 109] that take into account structured
disorder, including the present one. Indeed, we hereby consider a spiked model in which the
noise Zis drawn from an orthogonal matrix ensemble different from the Gaussian orthogonal
ensemble (the only one with independent entries). Intuitively, the presence of dependencies
in the noise should be an advantage for an algorithm sharp enough to see patterns within it
and use them to retrieve the sought low-rank matrix. Going in that direction, [45] proposed
a version of AMP designed for rotationally invariant noises (using earlier ideas of [94, 92]).
Furthermore, in a recent work [14], part of the authors analysed a Bayes estimator and an
AMP, both assuming Gaussian noise, whereas the actual noise in the data was drawn from a
generic orthogonal matrix ensemble. However, besides intuition and the mentioned works, to
the best of our knowledge there is little theoretical understanding of the true role played by
noise structure in spiked matrix estimation and more generically in inference. In particular,
prior to our work there was no theoretical prediction of optimal performance to benchmark
practical inference algorithms.
5
摘要:

FundamentallimitsinstructuredPCA,andhowtoreachthemJeanBarbier†,∗,FrancescoCamilli†,∗,MarcoMondelli⋄,ManuelS´aenz▽†AbdusSalamInternationalCenterforTheoreticalPhysics,Italy⋄InstituteofScienceandTechnology,Austria▽UniversidaddeLaRep´ublica,UruguayAbstractHowdostatisticaldependenciesinmeasurementnoisein...

展开>> 收起<<
Fundamental limits in structured PCA and how to reach them Jean Barbier Francesco Camilli.pdf

共107页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:107 页 大小:1.56MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 107
客服
关注