Sample-Efficient Personalization Modeling User Parameters as Low Rank Plus Sparse Components Soumyabrata Pal1 Prateek Varshney1 Prateek Jain1 Abhradeep Guha Thakurta2 Gagan Madan1

2025-05-03 0 0 1.72MB 104 页 10玖币

侵权投诉

Sample-Efficient Personalization: Modeling User

Parameters as Low Rank Plus Sparse Components

Soumyabrata Pal∗1, Prateek Varshney∗1, Prateek Jain1, Abhradeep Guha Thakurta2, Gagan Madan1,

Gaurav Aggarwal1, Pradeep Shenoy1, and Gaurav Srivastava1

1Google Research India

2Google Brain

1,2{soumyabrata,vprateek,prajain,athakurta,gaganmadan,

gauravaggarwal,shenoypradeep,gasrivastava}@google.com

Abstract

Personalization of machine learning (ML) predictions for individual users/domains/enterprises is

critical for practical recommendation systems. Standard personalization approaches involve learning

a user/domain speciﬁc embedding that is fed into a ﬁxed global model which can be limiting. On the

other hand, personalizing/ﬁne-tuning model itself for each user/domain – a.k.a meta-learning – has

high storage/infrastructure cost. Moreover, rigorous theoretical studies of scalable personalization

approaches have been very limited. To address the above issues, we propose a novel meta-learning style

approach that models network weights as a sum of low-rank and sparse components. This captures

common information from multiple individuals/users together in the low-rank part while sparse part

captures user-speciﬁc idiosyncrasies. We then study the framework in the linear setting, where the

problem reduces to that of estimating the sum of a rank-

and a

-column sparse matrix using a small

number of linear measurements. We propose a computationally eﬃcient alternating minimization

method with iterative hard thresholding – AMHT-LRS– to learn the low-rank and sparse part.

Theoretically, for the realizable Gaussian data setting, we show that AMHT-LRS solves the problem

eﬃciently with nearly optimal sample complexity. Finally, a signiﬁcant challenge in personalization is

ensuring privacy of each user’s sensitive data. We alleviate this problem by proposing a diﬀerentially

private variant of our method that also is equipped with strong generalization guarantees.

Keywords Personalization Recommendation

Alternating Minimization

Low Rank

Sparsity

Diﬀerential Privacy

1 Introduction

Typical industrial recommendation systems cater to a large number of users/domains/enterprises with a

small amount of user-speciﬁc data Wang et al. (2016); Zhang et al. (2020). For instance, YouTube has

∼

3billion unique monthly active users while the average likes per user is small. So, personalizing

predictions for each user corresponds to the challenging task of training ML models with a few data-points.

Typically, personalization literature has approached this problem from collaborative Schafer et al. (1999)

or content ﬁltering Burke (2003) approach. Both these approaches, in some sense learn a user embedding

or user speciﬁc feature vector which is then consumed by a global model to provide the personalized

predictions. This can also be seen as a variation of the popular prompt learning approach Lester et al.

(2021).

arXiv:2210.03505v3 [cs.LG] 6 Sep 2023

Naturally, the prompt learning approach is limiting because the global model might not be able to

capture all the variations across users/domains unless it is of extremely large size which in turn leads to

large inference/training cost. Moreover, user-descriptive feature vectors, limited by privacy concerns,

might not be able to capture the user-taste explicitly.

On the other extreme, such user-speciﬁc personalized models can be learned by ﬁne-tuning a global

model for each user. Vanilla approaches for ﬁne-tuning can be categorized as: 1) Neighborhood Models:

these methods learn a global model, which is then entirely "ﬁne-tuned" to speciﬁc tasks (Guo et al.,

2020; Howard and Ruder, 2018; Zaken et al., 2021) 2) Representation Learning: these methods learn a

low-dimensional representation of points which can be used to train task-speciﬁc learners (Javed and

White, 2019; Raghu et al., 2019; Lee et al., 2019; Bertinetto et al., 2018; Hu et al., 2021). Neighborhood

ﬁne-tuning techniques have two key limitations: I) Infrastructure costs of hosting such models is

prohibitive. For example, consider a scenario where we have a pre-trained 1GB model which is ﬁne-tuned

for 1M users, then the total storage cost itself is 1PB, and II) they typically work for a small number

of data-rich users/domains/tasks, but the key regime for personalization in our work is a long tail of

data-starved users. In this regime, neighborhood ﬁne-tuning might lead to over-ﬁtting as well. Simple

ﬁxes like ﬁne-tuning only the last layer often lead to a signiﬁcantly worse performance (Chen et al.,

2020a; Salman et al., 2020). Further, note that representation learning techniques can learn only the

(low dimensional) common information across the users but cannot capture the user-level peculiarities as

they share a ﬁxed smaller representation space. These limitations make model personalization for a large

number of data-starved users a challenging problem from the perspective of both storage and inference.

Recently, with the advent of large ML models, there has been a huge push towards practical Parameter

Eﬃcient Fine-Tuning (PEFT) with a variety of techniques (see the survey Lialin et al. (2023) and

references therein). However, as described in Lialin et al. (2023), a rigorous theoretical study of such

approaches that can be scaled to large number of tasks (users) has been limited. In this work, for the

multi-user personalization problem, our ﬁrst contribution is to introduce the LRS (Low Rank and Sparse)

framework that combines representation learning and sparse ﬁne-tuning (two distinctive class of methods

for PEFT) with two main goals: 1) propose a practical/eﬃcient algorithm for learning a model for each

user with minimal memory/parameter overhead, and 2) theoretically establish strong generalization

properties despite having a limited amount of data for many users.

Let us describe the LRS framework for multiple users in its full generality for any parameterized class of

functions

. Let x

∈Rd

be the input point, and let

(x;Θ)

∈R

be the predicted label using a

function

f∈ F

parameterized by Θ. For simplicity, assume that Θ, the set of parameters, is represented

in the form of a high dimensional vector. Let there be

users, each with a small amount of user-speciﬁc

data and each associated with a set of learnable model parameters Θ

(i)

. Then, the goal is to learn Θ

(i)

for each user

i∈

[

]such that (a) Θ

(i)

does not over-ﬁt despite a small number of data-points generated

speciﬁcally from user

, (b)

{

(i),

≤i≤t}

can be stored and used for fast inference even for large

, Our method LRS attempts to address all the three requirements using a simple low-rank + sparse

approach: we model each Θ

(i)

as Θ

(i)

:= U

⋆·

⋆(i)

, where U

⋆

is a orthonormal matrix representing

a global set of shared parameters (across users) corresponding to a low (say

)-dimensional subspace,

⋆(i)

is a

-dimensional vector and b

⋆(i)

is a

-sparse vector. Thus, we represent Θ

(i)

:= Θ

(i,1)

+Θ

(i,2)

where the ﬁrst term Θ

(i,1)

⋆·

⋆(i)

denotes the low rank component of Θ

(i)

that is, the model

parameters of the

ith

user lying in a low-dimensional subspace (common for all users) and the second

term Θ

(i,2)

⋆(i)

is restricted to be sparse. Hence, for each user, we only need to store weights of

the

-dimensional vector w

⋆(i)

and the non-zero weights of the

-sparse b

⋆(i)

. Therefore, if

and

are

small, then the memory overhead per user is small thus allowing eﬃcient deployment. Moreover, we

can expect that a small amount of user-speciﬁc data should be suﬃcient to train these few additional

parameters per user without over-ﬁtting. Finally, our framework provides users with the ﬂexibility to

either contribute to a central model using privacy preserving bill-board models (see Section 3) or learn

their parameters locally without contributing to the central model. This allows for greater customization

and adaptability based on individual user preferences and requirements.

Theoretically speaking, a recent line of work (Thekumparampil et al., 2021; Du et al., 2020; Tripuraneni

et al., 2021; Boursier et al., 2022; Jain et al., 2021) analyzes a framework exclusively on representation

learning. They model Θ

(i)

:= U

⋆·

⋆(i)

as the parameters associated with the

ith

user and provide

theoretical guarantees in the linear model setting. However, their algorithm/analysis do not extend to

our case since their model does not capture the additional sparsity component in Θ

(i)

- this additional

non-convex constraint introduces several technical challenges in the analysis. Moreover, these works do

not explore the personalization framework with privacy constraints.

In our second and primary contribution, we consider the problem of analyzing the LRS framework from

a theoretical lens. Speciﬁcally, we analyze the instantiation of LRS in the context of linear models. In

this case, the training data for each user corresponds to a few linear measurements of the underlying

user-speciﬁc model parameter Θ

(i)

. Our objective is to estimate the individual components of Θ

(i)

Training a model in the LRS framework involves learning a global set of parameters characterizing the

low dimensional subspace and the local user-speciﬁc parameters jointly. To this end, we propose a

simple alternating minimization (AM) style iterative technique. Our method AMHT-LRS alternatingly

estimates the global parameters and user-speciﬁc parameters independently for each user. To ensure

sparsity of (b

⋆(i)

)’s we use an iterative hard-thresholding style estimator (Jain et al., 2014). For linear

models, even estimating the global parameters i.e. the low rank subspace induced by U

⋆

is an NP-hard

problem Thekumparampil et al. (2021). Therefore, similar to Thekumparampil et al. (2021), we consider

the realizable setting where the data is generated from a Gaussian distribution. In this case, we provide a

novel analysis demonstrating the eﬃcient convergence of our method, AMHT-LRS, towards the optimal

solution. We believe our analysis techniques are of independent interest as we track entry-wise error of

diﬀerent parameter estimates across iterations of AMHT-LRS. Below, we state our main result (Thm.

1) informally:

Theorem (Informal).Suppose we are given

m·t

samples from

linear regression training tasks (each

corresponding to a user) of dimension

with

samples each. In the LRS framework, the goal is to learn

a new regression task’s parameters using

samples, i.e., learn the

-dimensional weight vector deﬁning

the user-speciﬁc low rank representation and the user-speciﬁc

-sparse vector. Then AMHT-LRS with

total

m˙

= Ω(

kdr4

)samples and

= Ω(

max

(

k, r3

)) samples per user can recover all the parameters

exactly and in time nearly linear in m·t.

That is, in the linear instantiation of LRS, AMHT-LRS is able to estimate the underlying model

parameters up to desired precision as long as the total number of users in the recommendation system

is large enough. Moreover, we show that the suﬃcient sample complexity per user for our estimation

guarantees scales only linear in

and cubically in

(nearly optimal); recall that

r, k

are much smaller

than the ambient dimension

. The detailed analysis of Theorem 1 is provided in Appendix D, E.

Furthermore, using the billboard model of (

ϵ, δ

)diﬀerential privacy (DP) (Jain et al., 2021; Chien et al.,

2021; Kearns et al., 2014), we can extend AMHT-LRS to preserve privacy of each individual user. For

similar sample complexity as in the above theorem albeit with slightly worse dependence on

, we can

guarantee strong generalization error up to a standard error term due to privacy (see Theorem 3 and

Appendix D).

Finally, to validate our theoretical contributions, we demonstrate experimental results on synthetic and

real-world datasets using linear models. Also, we experiment with our framework applied to neural

networks architectures (see Appendix A,B). Our experiments demonstrate the advantage/eﬃcacy of our

framework compared to natural baseline frameworks applied to the same model architecture.

1.1 Other Related Work

Comparison with Hu et al. (2021): LORA (Low Rank Adaptation of Large Language Models) was

proposed by Hu et al. (2021) for meta-learning with large number of tasks at scale. Although the authors

demonstrate promising experimental results, LORA only allows a central model (in a low dimensional

manifold) and does not incorporate sparse ﬁne-tuning. Hence, LORA becomes ineﬀective when output

dimension is small. Moreover, LORA does not have any theoretical guarantees even in simple settings.

Comparison with Prompt-based and Batch-norm Fine-tuning: Another popular approach for

personalization is to use prompt-based or batch-norm based ﬁne-tuning (Wang et al., 2022; Liu et al.,

2021; Lester et al., 2021); this usually involves a task-based feature embedding concatenated with the

covariate. Note that in a linear model, such an approach will only lead to an additional scalar bias

which can be easily modeled in our framework; thus our framework is richer and more expressive with a

smaller number of parameters.

1.2 Preliminaries - Private Personalization

For model personalization in recommendation systems, where we wish to have a personalized model

for each user, privacy guarantees are of utmost importance. Due to sensitivity of user-data, we would

want to preserve privacy of each user for which we use user-level (

ϵ, δ

)-DP as the privacy notion (see

Deﬁnition 1). In this setting, each user

i∈

[

]holds a set of data samples

D(i)

{

(i)

≤j≤m}

Furthermore, users interact via a central algorithm that maintains the common representation matrix

⋆

which is guaranteed to be diﬀerentially private with respect to all the data samples of any single user.

The central algorithm publishes the current U

⋆

to all the users (a.k.a. on a billboard) and obtains further

updates from the users. It has been shown in prior works (Jain et al., 2021; Chien et al., 2021; Thakkar

et al., 2019) that such a billboard mechanism allows for signiﬁcantly more accurate privacy preserving

methods while ensuring user-level privacy. In particular, it allows learning of U

⋆

eﬀectively, while each

user can keep a part of the model which is personal to them, for example, the w

⋆(i),

⋆(i)

)’s in our

context. See Section 3 of Jain et al. (2021) for more details about billboard model in the personalization

setting. Traditionally, such model of private computation is called the billboard model of DP and is a

subclass of joint DP (Kearns et al., 2014).

Deﬁnition 1 (Diﬀerential Privacy Dwork et al. (2006a;b); Bun and Steinke (2016)).A randomized

algorithm

is (

ε, δ

)-diﬀerentially private (DP) if for any pair of data sets

and

D′

that diﬀer in one

user (i.e., |D△D′|= 1), and for all Sin the output range of A, we have

Pr[A(D)∈S]≤eε·Pr[A(D′)∈S] + δ,

where probability is over the randomness of

. Similarly, an algorithm

-zero Concentrated DP

(zCDP) if Dα(A(D)||A(D′)) ≤αρ, where Dαis the Rényi divergence of order α.

In Deﬁnition 1, when we deﬁne the notion of neighborhood, we deﬁne it with respect to the addition

(removal) of a single user (i.e., additional removal of all the data samples

for any user

i∈

[

]). In the

literature Dwork and Roth (2014), the deﬁnition is referred to as user-level DP.

2 Linear LRS with Gaussian Data

2.1 Problem Statement and Algorithm AMHT-LRS

Notations: [

]denotes the set

{

, . . . , m}

. For a matrix A,A

denotes

ith

row of A. For a vector

denotes

ith

element of x. We sometimes use x

to denote an indexed vector; in this case

xj,i

denotes the

ith

element of x

∥ · ∥2

denotes euclidean norm of a vector and the operator norm of a

matrix.

∥ · ∥∞,∥ · ∥0

will denote the

ℓ∞

and

ℓ0

norms of a vector respectively.

∥

∥2,∞

maxi∥

i∥2

and

∥

∥F

qPi∥Ai∥2

denote the L

2,∞

and Frobenius norm of a matrix respectively. For a sparse

vector v

∈Rd

, we deﬁne the support

supp

(v)

⊆

[

]to be the set of indices

{i∈

[

]

|vi̸

= 0

}

. We use I

to denote the identity matrix.

(

)

Ω

(

)notations subsume logarithmic factors. For a matrix V

∈Rd×r

vec

(V)

∈Rdr

vectorizes the matrix Vby stacking columns sequentially. Similarly

vec−1

d×r

(v)inverts the

vec

(

)operation by reconstructing a

d×r

matrix from a vector of dimension

i.e. for the matrix V,

vec−1

d×r

(

vec

(V)) = V. Finally, let

Rd×R→Rd

be a hard thresholding function that takes a vector

∈Rd

and a parameter ∆as input and returns a vector v

′∈Rd

such that

v′

|vi|>

∆and 0

otherwise.

In this section, we describe our LRS framework for the linear setting with Gaussian data, provide an

eﬃcient algorithm for parameter estimation, and provide rigorous analysis under realizable setting.

Formally speaking, consider

-dimensional linear regression tasks indexed by

i∈

[

]where the

ith

training task is associated with the

ith

user. Recall that according to the deﬁnition of LRS framework,

every user/task i∈[t]is associated with a set of unknown learnable parameters Θ(i)∈Rdthat can be

decomposed as Θ

(i)

⋆

⋆(i)

. Here U

⋆∈Rd×r

(satisfying (U

⋆

)

⋆

=I) is the global shared set

of parameters (across users) corresponding to the orthonormal basis vectors of a

-dimensional subspace.

For the

ith

user, the

-dimensional user-speciﬁc parameters w

⋆(i)∈Rr

corresponds to the weights of the

basis vectors of low dimensional subspace deﬁned by columns of U

⋆

; similarly, the user-speciﬁc

-sparse

vector b

⋆(i)∈Rd

with

∥

⋆(i)∥0

corresponds to the sparse component of the unknown parameters of

the

ith

user. Henceforth, we will refer to the problem of estimating the unknown parameters of user

i∈[t]to be the ith training task.

For each task

i∈

[

m≪d

samples

{

(i)

j, y(i)

)

j=1 ∈

(

Rd×R

)

are provided labelled as task

. Next,

we assume the following generative model for the data:

for all i∈

[

]

,for all j∈

[

], the covariates

{

(i)

j}i,j

are independently generated from a

-dimensional Gaussian with identity covariance (denoted as

)) and the expected response is a linear function of the corresponding covariate. More precisely,

we have:

x(i)

j∼ N(0,Id)and y(i)

j|x(i)

j=⟨x(i)

j,U⋆w⋆(i)+b⋆(i)⟩+z(i)

j∀i∈[t], j ∈[m],(1)

where

z(i)

j∼ N

, σ2

)are zero mean Gaussian random variables with variance

σ2

. Furthermore,

{

(i)

j, z(i)

j}i,j

are independent random variables. We use X

(i)∈Rm×d

to represent the matrix of

covariates for the

ith

task s.t. X

(i)

= (x

(i)

)

. Similarly, we write y

(i),

(i)∈Rm

to represent the

user-speciﬁc response vector and noise vector respectively.

Therefore, given the data-set

{

(i)

j, y(i)

)

}i,j

, the problem reduces to that of designing statistically and

computationally eﬃcient algorithms to estimate the common representation learning parameter U

⋆

well as task-speciﬁc parameters

{

⋆(i)}i∈[t],{

⋆(i)}i∈[t]

. The ERM (Empirical Risk Minimizer) for this

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Sample-EfficientPersonalization:ModelingUserParametersasLowRankPlusSparseComponentsSoumyabrataPal∗1,PrateekVarshney∗1,PrateekJain1,AbhradeepGuhaThakurta2,GaganMadan1,GauravAggarwal1,PradeepShenoy1,andGauravSrivastava11GoogleResearchIndia2GoogleBrain1,2{soumyabrata,vprateek,prajain,athakurta,gaganmad...

展开>> 收起<<

Sample-Efficient Personalization Modeling User Parameters as Low Rank Plus Sparse Components Soumyabrata Pal1 Prateek Varshney1 Prateek Jain1 Abhradeep Guha Thakurta2 Gagan Madan1.pdf

共104页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sample-Efficient Personalization Modeling User Parameters as Low Rank Plus Sparse Components Soumyabrata Pal1 Prateek Varshney1 Prateek Jain1 Abhradeep Guha Thakurta2 Gagan Madan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: