Subspace Recovery from Heterogeneous Data with Non-isotropic Noise John Duchi

2025-05-02 0 0 1.6MB 38 页 10玖币

侵权投诉

Subspace Recovery from Heterogeneous Data with

Non-isotropic Noise

John Duchi

Stanford University

Vitaly Feldman

Apple

Lunjia Hu∗

Stanford University

Kunal Talwar

Apple

Abstract

Recovering linear subspaces from data is a fundamental and important task in statistics and

machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic

formulation of this problem: the principal component analysis (PCA), with a focus on dealing

with irregular noise. Our data come from

users with user

contributing data samples from a

-dimensional distribution with mean

µi

. Our goal is to recover the linear subspace shared by

µ1, . . . , µn

using the data points from all users, where every data point from user

is formed by

adding an independent mean-zero noise vector to

µi

. If we only have one data point from every

user, subspace recovery is information-theoretically impossible when the covariance matrices

of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in

previous work. We avoid these assumptions by leveraging at least two data points from each

user, which allows us to design an eﬃciently-computable estimator under non-spherical and

user-dependent noise. We prove an upper bound for the estimation error of our estimator in

general scenarios where the number of data points and amount of noise can vary across users,

and prove an information-theoretic error lower bound that not only matches the upper bound

up to a constant factor, but also holds even for spherical Gaussian noise. This implies that

our estimator does not introduce additional estimation error (up to a constant factor) due to

irregularity in the noise. We show additional results for a linear regression problem in a similar

setup.

1 Introduction

We study the problem of learning low-dimensional structure amongst data distributions, given

multiple samples from each distribution. This problem arises naturally in settings such as federated

learning, where we want to learn from data coming from a set of individuals, each of which has

samples from their own distribution. These distributions however are related to each other, and in

this work, we consider the setting when these distributions have means lying in a low-dimensional

subspace. The goal is to learn this subspace, even when the distributions may have diﬀerent (and

potentially non-spherical) variances. This heterogeneity can manifest itself in practice as diﬀering

number of samples per user, or the variance diﬀering across individuals, possibly depending on their

mean. Recovery of the subspace containing the means can in turn help better estimate individual

∗

Part of this work was performed while LH was interning at Apple. LH is also supported by Omer Reingold’s NSF

Award IIS-1908774, Omer Reingold’s Simons Foundation Investigators Award 689988, and Moses Charikar’s Simons

Foundation Investigators Award.

arXiv:2210.13497v1 [cs.LG] 24 Oct 2022

means. In other words, this can allow for learning good estimator for all individual means, by

leveraging information from all the individuals.

The irregularity of the noise makes this task challenging even when we have suﬃciently many

individual distributions. For example, suppose we have

individuals and for every

= 1

, . . . , n

an unknown

µi∈Rd

. For simplicity, suppose that

µ1, . . . , µn

are distributed independently as

, σ2uuT

)for

σ∈R≥0

and an unknown unit vector

u∈Rd

. In this setting, our goal is to

recover the one-dimensional subspace, equivalently the vector

. For every

, we have a data point

µi

where

zi∈Rd

is a mean-zero noise vector. If

is drawn independently from a spherical

Gaussian

, α2I

), we can recover the unknown subspace with arbitrary accuracy as

grows to

inﬁnity because

nPxixT

concentrates to

[

xixT

] =

σ2uuT

α2I

, whose top eigenvector is

±u

However, if the noise

is drawn from a non-spherical distribution, the top eigenvector of

nPxixT

can deviate from

±u

signiﬁcantly, and to make things worse, if the noise

is drawn independently

from a non-spherical Gaussian

, σ2

(

I−uuT

) +

α2I

), then our data points

µi

distribute

independently as N(0,(σ2+α2)I), giving no information about the vector u.1

The information-theoretic impossibility in this example however disappears as soon as one has

at least two samples from each distribution. Indeed, given two data points

xi1

µi

zi1

and

xi2

µi

zi2

from user

, as long as the noise

zi1, zi2

are independent and have zero mean, we

always have

[

xi1xT

] =

σ2uuT

regardless of the speciﬁc distributions of

zi1

and

zi2

. This allows us

to recover the subspace in this example, as long as we have suﬃciently many users each contributing

at least two examples.

As this is commonly the case in our motivating examples, we make this assumption of multiple

data points per user, and show that this intuition extends well beyond this particular example. We

design eﬃciently computable estimators for this subspace recovery problem given samples from

multiple heteroscedastic distributions (see Section 1.1 for details). We prove upper bounds on the

error of our estimator measured in the maximum principal angle (see Section 2 for deﬁnition). We

also prove an information-theoretic error lower bound, showing that our estimator achieves the

optimal error up to a constant factor in general scenarios where the number of data points and the

amount of noise can vary across users. Somewhat surprisingly, our lower bound holds even when

the noise distributes as spherical Gaussians. Thus non-spherical noise in setting does not lead to

increased error.

We then show that our techniques extend beyond the mean estimation problem to a linear

regression setting where for each

µi

, we get (at least two) samples (

xij, xT

ijµi

zij

)where

zij

zero-mean noise from some noise distribution that depends on

and

xij

. This turns out to be a

model that was recently studied in the meta-learning literature under more restrictive assumptions

(e.g.

zij

is independent of

xij

) [Kong et al., 2020, Tripuraneni et al., 2021, Collins et al., 2021,

Thekumparampil et al., 2021]. We show a simple estimator achieving an error upper bound matching

the ones in prior work without making these restrictive assumptions.

1.1 Our contributions

PCA with heterogeneous and non-isotropic noise: Upper Bounds.

In the PCA setting, the

data points from each user

are drawn from a user-speciﬁc distribution with mean

µi∈Rd

, and we

assume that

µ1, . . . , µn

lie in a shared

-dimensional subspace that we want to recover. Speciﬁcally,

This information-theoretic impossibility naturally extends to recovering

-dimensional subspaces for

k >

1by

replacing the unit vector u∈Rdwith a matrix U∈Rd×kwith orthonormal columns.

we have

data points

xij ∈Rd

from user

for

= 1

, . . . , mi

, and each data point is determined by

xij

µi

zij

where

zij ∈Rd

is a noise vector drawn independently from a mean zero distribution.

We allow the distribution of

zij

to be non-spherical and non-identical across diﬀerent pairs (

i, j

We use

ηi∈R≥0

to quantify the amount of noise in user

’s data points by assuming that

zij

is an

ηi-sub-Gaussian random variable.

As mentioned earlier, if we only have a single data point from each user, it is information-

theoretically impossible to recover the subspace. Thus, we focus on the case where

mi≥

2for every

i= 1, . . . , n. In this setting, for appropriate weights w1, . . . , wn∈R≥0, we compute a matrix A:

i=1

mi(mi−1) X

j16=j2

xij1xT

ij2,(1)

where the inner summation is over all pairs

j1, j2∈ {

, . . . , mi}

satisfying

j16

. Our estimator is

then deﬁned by the subspace spanned by the top-

eigenvectors of

. Although the inner summation

is over

(

mi−

1) terms, the time complexity for computing it need not grow quadratically with

because of the following equation:

j16=j2

xij1xT

ij2=



j=1

xij





j=1

xij



−

j=1

xijxT

ij.

The ﬂexibility in the weights

w1, . . . , wn

allows us to deal with variations in

and

ηi

for

diﬀerent users

. In the special case where

η1

···

ηn

and

···

, we choose

···

= 1

and we show that our estimator achieves the following error upper bound with

success probability at least 1−δ:

sin θ=O ησ1

σ2

k√m+η2

σ2

kmrd+ log(1/δ)

n!.

Here,

is the maximum principal angle between our estimator and the true subspace shared by

µ1, . . . , µn

, and

σ2

is the

-th largest eigenvalue of

nPn

i=1 µiµT

. Our error upper bound for general

mi, ηi, wiis given in Theorem 3.1.

We instantiate our error upper bound to the case where

µ1, . . . , µn

are drawn iid from a Gaussian

distribution

, σ2UUT

), where the columns of

U∈Rd×k

form an orthonormal basis of the subspace

containing µ1, . . . , µn. By choosing the weights w1, . . . , wnaccording to m1, . . . , mnand η1, . . . , ηn,

our estimator achieves the error upper bound

sin θ≤O sd+ log(1/δ)

i=1 γ0

i!(2)

under a mild assumption (Assumption 3.2), where

γ0

is deﬁned in Deﬁnition 3.1 and often equals

η2

σ2mi+η4

σ4m2

i−1.

PCA: Lower Bounds.

We show that the error upper bound

(2)

is optimal up to a constant factor

by proving a matching information-theoretic lower bound (Theorem 3.7). Our lower bound holds

for general

and

ηi

that can vary among users

, and it holds even when the noise vectors

zij

are

drawn from spherical Gaussians, showing that our estimator essentially pays no additional cost in

error or sample complexity due to non-isotropic noise.

We prove the lower bound using Fano’s method on a local packing over the Grassmannian

manifold. We carefully select a non-trivial hard distribution so that the strength of our lower bound

is not aﬀected by a group of fewer than

users each having a huge amount of data points with little

noise.

Linear Models.

While the PCA setting is the main focus of our paper, we extend our research to a

related linear models setting that has recently been well studied in the meta-learning and federated

learning literature [Kong et al., 2020, Tripuraneni et al., 2021, Collins et al., 2021, Thekumparampil

et al., 2021]. Here, the user-speciﬁc distribution of each user

is parameterized by

βi∈Rd

, and we

again assume that

β1, . . . , βn

lie in a linear subspace that we want to recover. From each user

observe

data points (

xij, yij

)

∈Rd×R

for

= 1

, . . . , mi

drawn from the user-speciﬁc distribution

satisfying

yij

ijβi

zij

for an

(1)-sub-Gaussian measurement vector

xij ∈Rd

with zero mean

and identity covariance and an

ηi

-sub-Gaussian mean-zero noise term

zij ∈R

. While it may seem

that non-isotropic noise is less of a challenge in this setting since each noise term

zij

is a scalar, our

goal is to handle a challenging scenario where the variances of the noise terms

zij

can depend on

the realized measurements

xij

, which is a more general and widely applicable setting compared to

those in prior work. Similarly to the PCA setting, our relaxed assumptions on the noise make it

information-theoretically impossible to do subspace recovery if we only have one data point from

each user (see Section 4), and thus we assume each user contributes at least two data points. We

use the subspace spanned by the top-keigenvectors of the following matrix Aas our estimator:

i=1

mi(mi−1) X

j16=j2

(xij1yij1)(xij2yij2)T.(3)

In the special case where

η1

···

ηn

η, m1

···

, and

kβik2≤r

for all

, our

estimator achieves the following error upper bound:

sin θ≤O log3(nd/δ)sd(r4+r2η2+η4/m)

mnσ4

k!,(4)

where

σ2

is the

-th largest eigenvalue of

nPn

i=1 βiβT

(Corollary L.2). Our error upper bound

extends smoothly to more general cases where

ηi

and

vary among users (Theorem L.1). Moreover,

our upper bound matches the ones in prior work [e.g. Tripuraneni et al., 2021, Theorem 3] despite

requiring less restrictive assumptions.

1.2 Related Work

Principal component analysis under non-isotropic noise has been studied by Vaswani and Narayana-

murthy [2017], Zhang et al. [2018] and Narayanamurthy and Vaswani [2020]. When translated

to our setting, these papers focus on having only one data point from each user and thus they

require additional assumptions—either the level of non-isotropy is low, or the noise is coordinate-wise

independent and the subspace is incoherent. The estimation error guarantees in these papers depend

crucially on how well these additional assumptions are satisﬁed. Zhu et al. [2019] and Cai et al.

[2021] study PCA with noise and missing data, and Chen et al. [2021] and Cheng et al. [2021] study

eigenvalue and eigenvector estimation under heteroscedastic noise. These four papers all assume

that the noise is coordinate-wise independent and the subspace/eigenspace is incoherent.

The linear models setting we consider has recently been studied as a basic setting of meta-learning

and federated learning by Kong et al. [2020], Tripuraneni et al. [2021], Collins et al. [2021], and

Thekumparampil et al. [2021]. These papers all make the assumption that the noise terms

zij

are

independent of the measurements

xij

, an assumption that we relax in this paper. Collins et al. [2021]

and Thekumparampil et al. [2021] make improvements in sample complexity and error guarantees

compared to earlier work by Kong et al. [2020] and Tripuraneni et al. [2021], but Collins et al.

[2021] focus on the noiseless setting (

zij

= 0) and Thekumparampil et al. [2021] require at least

Ω(

)examples per user. Tripuraneni et al. [2021] and Thekumparampil et al. [2021] assume that

the measurements

xij

are drawn from the standard (multivariate) Gaussian distribution, where as

Kong et al. [2020], Collins et al. [2021] and our work make the relaxed assumption that

xij

are

sub-Gaussian with identity covariance, which, in particular, allows the fourth-order moments of

xij

to be non-isotropic. There is a large body of prior work on meta-learning beyond the linear setting

[see e.g. Maurer et al., 2016, Tripuraneni et al., 2020, Du et al., 2020].

When collecting data from users, it is often important to ensure that private information about

users is not revealed through the release of the learned estimator. Many recent works proposed and

analyzed estimators that achieve user-level diﬀerential privacy in settings including mean estimation

[Levy et al., 2021, Esfandiari et al., 2021], meta-learning [Jain et al., 2021] and PAC learning [Ghazi

et al., 2021]. Recently, Cummings et al. [2021] study one-dimensional mean estimation in a setting

similar to ours, under a diﬀerential privacy constraint.

The matrix

we deﬁne in

(1)

is a weighted sum of

mi(mi−1) Pj16=j2xij1xT

ij2

over users

= 1

, . . . , n

, and each

has the form of a

-statistic [Halmos, 1946, Hoeﬀding, 1948].

-statistics

have been applied to many statistical tasks including tensor completion [Xia and Yuan, 2019] and

various testing problems [Zhong and Chen, 2011, He et al., 2021, Schrab et al., 2022]. In our deﬁnition

, we do not make the assumption that the distributions of

xi1, . . . , ximi

are identical although

the assumption is commonly used in applications of

-statistics. The matrix

(3)

is also a

weighted sum of U-statistics where we again do not make the assumption of identical distribution.

1.3 Paper Organization

In Section 2, we formally deﬁne the maximum principal angle and other notions we use throughout

the paper. Our results in the PCA setting and the linear models setting are presented in Sections 3

and 4, respectively. We defer most technical proofs to the appendices.

2 Preliminaries

We use

kAk

to denote the spectral norm of a matrix

, and use

kuk2

to denote the

norm of a

vector

. For positive integers

k≤d

, we use

Od,k

to denote the set of matrices

A∈Rd×k

satisfying

ATA

, where

is the

k×k

identity matrix. We use

to denote

Od,d

, which is the set of

d×d

orthogonal matrices. We use

col

(

)to denote the linear subspace spanned by the columns of a

matrix A. We use the base-elogarithm throughout the paper.

Maximum Principal Angle.

Let

U, ˆ

U∈Od

be two orthogonal matrices. Suppose the columns of

and

are partitioned as

= [

U1U2

]

,ˆ

= [

U1ˆ

]where

U1,ˆ

U1∈Od,k

for an integer

satisfying

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SubspaceRecoveryfromHeterogeneousDatawithNon-isotropicNoiseJohnDuchiStanfordUniversityVitalyFeldmanAppleLunjiaHu*StanfordUniversityKunalTalwarAppleAbstractRecoveringlinearsubspacesfromdataisafundamentalandimportanttaskinstatisticsandmachinelearning.MotivatedbyheterogeneityinFederatedLearningsettings...

展开>> 收起<<

Subspace Recovery from Heterogeneous Data with Non-isotropic Noise John Duchi.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Subspace Recovery from Heterogeneous Data with Non-isotropic Noise John Duchi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: