Subspace Recovery from Heterogeneous Data with Non-isotropic Noise John Duchi

2025-05-02 0 0 1.6MB 38 页 10玖币
侵权投诉
Subspace Recovery from Heterogeneous Data with
Non-isotropic Noise
John Duchi
Stanford University
Vitaly Feldman
Apple
Lunjia Hu
Stanford University
Kunal Talwar
Apple
Abstract
Recovering linear subspaces from data is a fundamental and important task in statistics and
machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic
formulation of this problem: the principal component analysis (PCA), with a focus on dealing
with irregular noise. Our data come from
n
users with user
i
contributing data samples from a
d
-dimensional distribution with mean
µi
. Our goal is to recover the linear subspace shared by
µ1, . . . , µn
using the data points from all users, where every data point from user
i
is formed by
adding an independent mean-zero noise vector to
µi
. If we only have one data point from every
user, subspace recovery is information-theoretically impossible when the covariance matrices
of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in
previous work. We avoid these assumptions by leveraging at least two data points from each
user, which allows us to design an efficiently-computable estimator under non-spherical and
user-dependent noise. We prove an upper bound for the estimation error of our estimator in
general scenarios where the number of data points and amount of noise can vary across users,
and prove an information-theoretic error lower bound that not only matches the upper bound
up to a constant factor, but also holds even for spherical Gaussian noise. This implies that
our estimator does not introduce additional estimation error (up to a constant factor) due to
irregularity in the noise. We show additional results for a linear regression problem in a similar
setup.
1 Introduction
We study the problem of learning low-dimensional structure amongst data distributions, given
multiple samples from each distribution. This problem arises naturally in settings such as federated
learning, where we want to learn from data coming from a set of individuals, each of which has
samples from their own distribution. These distributions however are related to each other, and in
this work, we consider the setting when these distributions have means lying in a low-dimensional
subspace. The goal is to learn this subspace, even when the distributions may have different (and
potentially non-spherical) variances. This heterogeneity can manifest itself in practice as differing
number of samples per user, or the variance differing across individuals, possibly depending on their
mean. Recovery of the subspace containing the means can in turn help better estimate individual
Part of this work was performed while LH was interning at Apple. LH is also supported by Omer Reingold’s NSF
Award IIS-1908774, Omer Reingold’s Simons Foundation Investigators Award 689988, and Moses Charikar’s Simons
Foundation Investigators Award.
1
arXiv:2210.13497v1 [cs.LG] 24 Oct 2022
means. In other words, this can allow for learning good estimator for all individual means, by
leveraging information from all the individuals.
The irregularity of the noise makes this task challenging even when we have sufficiently many
individual distributions. For example, suppose we have
n
individuals and for every
i
= 1
, . . . , n
,
an unknown
µiRd
. For simplicity, suppose that
µ1, . . . , µn
are distributed independently as
N
(0
, σ2uuT
)for
σR0
and an unknown unit vector
uRd
. In this setting, our goal is to
recover the one-dimensional subspace, equivalently the vector
u
. For every
i
, we have a data point
xi
=
µi
+
zi
where
ziRd
is a mean-zero noise vector. If
zi
is drawn independently from a spherical
Gaussian
N
(0
, α2I
), we can recover the unknown subspace with arbitrary accuracy as
n
grows to
infinity because
1
nPxixT
i
concentrates to
E
[
xixT
i
] =
σ2uuT
+
α2I
, whose top eigenvector is
±u
.
However, if the noise
zi
is drawn from a non-spherical distribution, the top eigenvector of
1
nPxixT
i
can deviate from
±u
significantly, and to make things worse, if the noise
zi
is drawn independently
from a non-spherical Gaussian
N
(0
, σ2
(
IuuT
) +
α2I
), then our data points
xi
=
µi
+
zi
distribute
independently as N(0,(σ2+α2)I), giving no information about the vector u.1
The information-theoretic impossibility in this example however disappears as soon as one has
at least two samples from each distribution. Indeed, given two data points
xi1
=
µi
+
zi1
and
xi2
=
µi
+
zi2
from user
i
, as long as the noise
zi1, zi2
are independent and have zero mean, we
always have
E
[
xi1xT
i2
] =
σ2uuT
regardless of the specific distributions of
zi1
and
zi2
. This allows us
to recover the subspace in this example, as long as we have sufficiently many users each contributing
at least two examples.
As this is commonly the case in our motivating examples, we make this assumption of multiple
data points per user, and show that this intuition extends well beyond this particular example. We
design efficiently computable estimators for this subspace recovery problem given samples from
multiple heteroscedastic distributions (see Section 1.1 for details). We prove upper bounds on the
error of our estimator measured in the maximum principal angle (see Section 2 for definition). We
also prove an information-theoretic error lower bound, showing that our estimator achieves the
optimal error up to a constant factor in general scenarios where the number of data points and the
amount of noise can vary across users. Somewhat surprisingly, our lower bound holds even when
the noise distributes as spherical Gaussians. Thus non-spherical noise in setting does not lead to
increased error.
We then show that our techniques extend beyond the mean estimation problem to a linear
regression setting where for each
µi
, we get (at least two) samples (
xij, xT
ijµi
+
zij
)where
zij
is
zero-mean noise from some noise distribution that depends on
i
and
xij
. This turns out to be a
model that was recently studied in the meta-learning literature under more restrictive assumptions
(e.g.
zij
is independent of
xij
) [Kong et al., 2020, Tripuraneni et al., 2021, Collins et al., 2021,
Thekumparampil et al., 2021]. We show a simple estimator achieving an error upper bound matching
the ones in prior work without making these restrictive assumptions.
1.1 Our contributions
PCA with heterogeneous and non-isotropic noise: Upper Bounds.
In the PCA setting, the
data points from each user
i
are drawn from a user-specific distribution with mean
µiRd
, and we
assume that
µ1, . . . , µn
lie in a shared
k
-dimensional subspace that we want to recover. Specifically,
1
This information-theoretic impossibility naturally extends to recovering
k
-dimensional subspaces for
k >
1by
replacing the unit vector uRdwith a matrix URd×kwith orthonormal columns.
2
we have
mi
data points
xij Rd
from user
i
for
j
= 1
, . . . , mi
, and each data point is determined by
xij
=
µi
+
zij
where
zij Rd
is a noise vector drawn independently from a mean zero distribution.
We allow the distribution of
zij
to be non-spherical and non-identical across different pairs (
i, j
).
We use
ηiR0
to quantify the amount of noise in user
i
’s data points by assuming that
zij
is an
ηi-sub-Gaussian random variable.
As mentioned earlier, if we only have a single data point from each user, it is information-
theoretically impossible to recover the subspace. Thus, we focus on the case where
mi
2for every
i= 1, . . . , n. In this setting, for appropriate weights w1, . . . , wnR0, we compute a matrix A:
A=
n
X
i=1
wi
mi(mi1) X
j16=j2
xij1xT
ij2,(1)
where the inner summation is over all pairs
j1, j2∈ {
1
, . . . , mi}
satisfying
j16
=
j2
. Our estimator is
then defined by the subspace spanned by the top-
k
eigenvectors of
A
. Although the inner summation
is over
mi
(
mi
1) terms, the time complexity for computing it need not grow quadratically with
mi
because of the following equation:
X
j16=j2
xij1xT
ij2=
mi
X
j=1
xij
mi
X
j=1
xij
T
mi
X
j=1
xijxT
ij.
The flexibility in the weights
w1, . . . , wn
allows us to deal with variations in
mi
and
ηi
for
different users
i
. In the special case where
η1
=
···
=
ηn
=
η
and
m1
=
···
=
mn
=
m
, we choose
w1
=
···
=
wn
= 1
/n
and we show that our estimator achieves the following error upper bound with
success probability at least 1δ:
sin θ=O ησ1
σ2
km+η2
σ2
kmrd+ log(1)
n!.
Here,
θ
is the maximum principal angle between our estimator and the true subspace shared by
µ1, . . . , µn
, and
σ2
`
is the
`
-th largest eigenvalue of
1
nPn
i=1 µiµT
i
. Our error upper bound for general
mi, ηi, wiis given in Theorem 3.1.
We instantiate our error upper bound to the case where
µ1, . . . , µn
are drawn iid from a Gaussian
distribution
N
(0
, σ2UUT
), where the columns of
URd×k
form an orthonormal basis of the subspace
containing µ1, . . . , µn. By choosing the weights w1, . . . , wnaccording to m1, . . . , mnand η1, . . . , ηn,
our estimator achieves the error upper bound
sin θO sd+ log(1)
Pn
i=1 γ0
i!(2)
under a mild assumption (Assumption 3.2), where
γ0
i
is defined in Definition 3.1 and often equals
η2
i
σ2mi+η4
i
σ4m2
i1.
PCA: Lower Bounds.
We show that the error upper bound
(2)
is optimal up to a constant factor
by proving a matching information-theoretic lower bound (Theorem 3.7). Our lower bound holds
for general
mi
and
ηi
that can vary among users
i
, and it holds even when the noise vectors
zij
are
3
drawn from spherical Gaussians, showing that our estimator essentially pays no additional cost in
error or sample complexity due to non-isotropic noise.
We prove the lower bound using Fano’s method on a local packing over the Grassmannian
manifold. We carefully select a non-trivial hard distribution so that the strength of our lower bound
is not affected by a group of fewer than
k
users each having a huge amount of data points with little
noise.
Linear Models.
While the PCA setting is the main focus of our paper, we extend our research to a
related linear models setting that has recently been well studied in the meta-learning and federated
learning literature [Kong et al., 2020, Tripuraneni et al., 2021, Collins et al., 2021, Thekumparampil
et al., 2021]. Here, the user-specific distribution of each user
i
is parameterized by
βiRd
, and we
again assume that
β1, . . . , βn
lie in a linear subspace that we want to recover. From each user
i
we
observe
mi
data points (
xij, yij
)
Rd×R
for
j
= 1
, . . . , mi
drawn from the user-specific distribution
satisfying
yij
=
xT
ijβi
+
zij
for an
O
(1)-sub-Gaussian measurement vector
xij Rd
with zero mean
and identity covariance and an
ηi
-sub-Gaussian mean-zero noise term
zij R
. While it may seem
that non-isotropic noise is less of a challenge in this setting since each noise term
zij
is a scalar, our
goal is to handle a challenging scenario where the variances of the noise terms
zij
can depend on
the realized measurements
xij
, which is a more general and widely applicable setting compared to
those in prior work. Similarly to the PCA setting, our relaxed assumptions on the noise make it
information-theoretically impossible to do subspace recovery if we only have one data point from
each user (see Section 4), and thus we assume each user contributes at least two data points. We
use the subspace spanned by the top-keigenvectors of the following matrix Aas our estimator:
A=
n
X
i=1
wi
mi(mi1) X
j16=j2
(xij1yij1)(xij2yij2)T.(3)
In the special case where
η1
=
···
=
ηn
=
η, m1
=
···
=
mn
=
m
, and
kβik2r
for all
i
, our
estimator achieves the following error upper bound:
sin θO log3(nd/δ)sd(r4+r2η2+η4/m)
mnσ4
k!,(4)
where
σ2
k
is the
k
-th largest eigenvalue of
1
nPn
i=1 βiβT
i
(Corollary L.2). Our error upper bound
extends smoothly to more general cases where
ηi
and
mi
vary among users (Theorem L.1). Moreover,
our upper bound matches the ones in prior work [e.g. Tripuraneni et al., 2021, Theorem 3] despite
requiring less restrictive assumptions.
1.2 Related Work
Principal component analysis under non-isotropic noise has been studied by Vaswani and Narayana-
murthy [2017], Zhang et al. [2018] and Narayanamurthy and Vaswani [2020]. When translated
to our setting, these papers focus on having only one data point from each user and thus they
require additional assumptions—either the level of non-isotropy is low, or the noise is coordinate-wise
independent and the subspace is incoherent. The estimation error guarantees in these papers depend
crucially on how well these additional assumptions are satisfied. Zhu et al. [2019] and Cai et al.
[2021] study PCA with noise and missing data, and Chen et al. [2021] and Cheng et al. [2021] study
4
eigenvalue and eigenvector estimation under heteroscedastic noise. These four papers all assume
that the noise is coordinate-wise independent and the subspace/eigenspace is incoherent.
The linear models setting we consider has recently been studied as a basic setting of meta-learning
and federated learning by Kong et al. [2020], Tripuraneni et al. [2021], Collins et al. [2021], and
Thekumparampil et al. [2021]. These papers all make the assumption that the noise terms
zij
are
independent of the measurements
xij
, an assumption that we relax in this paper. Collins et al. [2021]
and Thekumparampil et al. [2021] make improvements in sample complexity and error guarantees
compared to earlier work by Kong et al. [2020] and Tripuraneni et al. [2021], but Collins et al.
[2021] focus on the noiseless setting (
zij
= 0) and Thekumparampil et al. [2021] require at least
Ω(
k2
)examples per user. Tripuraneni et al. [2021] and Thekumparampil et al. [2021] assume that
the measurements
xij
are drawn from the standard (multivariate) Gaussian distribution, where as
Kong et al. [2020], Collins et al. [2021] and our work make the relaxed assumption that
xij
are
sub-Gaussian with identity covariance, which, in particular, allows the fourth-order moments of
xij
to be non-isotropic. There is a large body of prior work on meta-learning beyond the linear setting
[see e.g. Maurer et al., 2016, Tripuraneni et al., 2020, Du et al., 2020].
When collecting data from users, it is often important to ensure that private information about
users is not revealed through the release of the learned estimator. Many recent works proposed and
analyzed estimators that achieve user-level differential privacy in settings including mean estimation
[Levy et al., 2021, Esfandiari et al., 2021], meta-learning [Jain et al., 2021] and PAC learning [Ghazi
et al., 2021]. Recently, Cummings et al. [2021] study one-dimensional mean estimation in a setting
similar to ours, under a differential privacy constraint.
The matrix
A
we define in
(1)
is a weighted sum of
Ai
:=
1
mi(mi1) Pj16=j2xij1xT
ij2
over users
i
= 1
, . . . , n
, and each
Ai
has the form of a
U
-statistic [Halmos, 1946, Hoeffding, 1948].
U
-statistics
have been applied to many statistical tasks including tensor completion [Xia and Yuan, 2019] and
various testing problems [Zhong and Chen, 2011, He et al., 2021, Schrab et al., 2022]. In our definition
of
Ai
, we do not make the assumption that the distributions of
xi1, . . . , ximi
are identical although
the assumption is commonly used in applications of
U
-statistics. The matrix
A
in
(3)
is also a
weighted sum of U-statistics where we again do not make the assumption of identical distribution.
1.3 Paper Organization
In Section 2, we formally define the maximum principal angle and other notions we use throughout
the paper. Our results in the PCA setting and the linear models setting are presented in Sections 3
and 4, respectively. We defer most technical proofs to the appendices.
2 Preliminaries
We use
kAk
to denote the spectral norm of a matrix
A
, and use
kuk2
to denote the
`2
norm of a
vector
u
. For positive integers
kd
, we use
Od,k
to denote the set of matrices
ARd×k
satisfying
ATA
=
Ik
, where
Ik
is the
k×k
identity matrix. We use
Od
to denote
Od,d
, which is the set of
d×d
orthogonal matrices. We use
col
(
A
)to denote the linear subspace spanned by the columns of a
matrix A. We use the base-elogarithm throughout the paper.
Maximum Principal Angle.
Let
U, ˆ
UOd
be two orthogonal matrices. Suppose the columns of
U
and
ˆ
U
are partitioned as
U
= [
U1U2
]
,ˆ
U
= [
ˆ
U1ˆ
U2
]where
U1,ˆ
U1Od,k
for an integer
k
satisfying
5
摘要:

SubspaceRecoveryfromHeterogeneousDatawithNon-isotropicNoiseJohnDuchiStanfordUniversityVitalyFeldmanAppleLunjiaHu*StanfordUniversityKunalTalwarAppleAbstractRecoveringlinearsubspacesfromdataisafundamentalandimportanttaskinstatisticsandmachinelearning.MotivatedbyheterogeneityinFederatedLearningsettings...

展开>> 收起<<
Subspace Recovery from Heterogeneous Data with Non-isotropic Noise John Duchi.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:1.6MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注