Sample-Efficient Personalization Modeling User Parameters as Low Rank Plus Sparse Components Soumyabrata Pal1 Prateek Varshney1 Prateek Jain1 Abhradeep Guha Thakurta2 Gagan Madan1

2025-05-03 0 0 1.72MB 104 页 10玖币
侵权投诉
Sample-Efficient Personalization: Modeling User
Parameters as Low Rank Plus Sparse Components
Soumyabrata Pal1, Prateek Varshney1, Prateek Jain1, Abhradeep Guha Thakurta2, Gagan Madan1,
Gaurav Aggarwal1, Pradeep Shenoy1, and Gaurav Srivastava1
1Google Research India
2Google Brain
1,2{soumyabrata,vprateek,prajain,athakurta,gaganmadan,
gauravaggarwal,shenoypradeep,gasrivastava}@google.com
Abstract
Personalization of machine learning (ML) predictions for individual users/domains/enterprises is
critical for practical recommendation systems. Standard personalization approaches involve learning
a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the
other hand, personalizing/fine-tuning model itself for each user/domain – a.k.a meta-learning – has
high storage/infrastructure cost. Moreover, rigorous theoretical studies of scalable personalization
approaches have been very limited. To address the above issues, we propose a novel meta-learning style
approach that models network weights as a sum of low-rank and sparse components. This captures
common information from multiple individuals/users together in the low-rank part while sparse part
captures user-specific idiosyncrasies. We then study the framework in the linear setting, where the
problem reduces to that of estimating the sum of a rank-
r
and a
k
-column sparse matrix using a small
number of linear measurements. We propose a computationally efficient alternating minimization
method with iterative hard thresholding – AMHT-LRS– to learn the low-rank and sparse part.
Theoretically, for the realizable Gaussian data setting, we show that AMHT-LRS solves the problem
efficiently with nearly optimal sample complexity. Finally, a significant challenge in personalization is
ensuring privacy of each user’s sensitive data. We alleviate this problem by proposing a differentially
private variant of our method that also is equipped with strong generalization guarantees.
Keywords Personalization Recommendation
·
Alternating Minimization
·
Low Rank
·
Sparsity
·
Differential Privacy
1 Introduction
Typical industrial recommendation systems cater to a large number of users/domains/enterprises with a
small amount of user-specific data Wang et al. (2016); Zhang et al. (2020). For instance, YouTube has
1
.
3billion unique monthly active users while the average likes per user is small. So, personalizing
predictions for each user corresponds to the challenging task of training ML models with a few data-points.
Typically, personalization literature has approached this problem from collaborative Schafer et al. (1999)
or content filtering Burke (2003) approach. Both these approaches, in some sense learn a user embedding
or user specific feature vector which is then consumed by a global model to provide the personalized
predictions. This can also be seen as a variation of the popular prompt learning approach Lester et al.
(2021).
arXiv:2210.03505v3 [cs.LG] 6 Sep 2023
Naturally, the prompt learning approach is limiting because the global model might not be able to
capture all the variations across users/domains unless it is of extremely large size which in turn leads to
large inference/training cost. Moreover, user-descriptive feature vectors, limited by privacy concerns,
might not be able to capture the user-taste explicitly.
On the other extreme, such user-specific personalized models can be learned by fine-tuning a global
model for each user. Vanilla approaches for fine-tuning can be categorized as: 1) Neighborhood Models:
these methods learn a global model, which is then entirely "fine-tuned" to specific tasks (Guo et al.,
2020; Howard and Ruder, 2018; Zaken et al., 2021) 2) Representation Learning: these methods learn a
low-dimensional representation of points which can be used to train task-specific learners (Javed and
White, 2019; Raghu et al., 2019; Lee et al., 2019; Bertinetto et al., 2018; Hu et al., 2021). Neighborhood
fine-tuning techniques have two key limitations: I) Infrastructure costs of hosting such models is
prohibitive. For example, consider a scenario where we have a pre-trained 1GB model which is fine-tuned
for 1M users, then the total storage cost itself is 1PB, and II) they typically work for a small number
of data-rich users/domains/tasks, but the key regime for personalization in our work is a long tail of
data-starved users. In this regime, neighborhood fine-tuning might lead to over-fitting as well. Simple
fixes like fine-tuning only the last layer often lead to a significantly worse performance (Chen et al.,
2020a; Salman et al., 2020). Further, note that representation learning techniques can learn only the
(low dimensional) common information across the users but cannot capture the user-level peculiarities as
they share a fixed smaller representation space. These limitations make model personalization for a large
number of data-starved users a challenging problem from the perspective of both storage and inference.
Recently, with the advent of large ML models, there has been a huge push towards practical Parameter
Efficient Fine-Tuning (PEFT) with a variety of techniques (see the survey Lialin et al. (2023) and
references therein). However, as described in Lialin et al. (2023), a rigorous theoretical study of such
approaches that can be scaled to large number of tasks (users) has been limited. In this work, for the
multi-user personalization problem, our first contribution is to introduce the LRS (Low Rank and Sparse)
framework that combines representation learning and sparse fine-tuning (two distinctive class of methods
for PEFT) with two main goals: 1) propose a practical/efficient algorithm for learning a model for each
user with minimal memory/parameter overhead, and 2) theoretically establish strong generalization
properties despite having a limited amount of data for many users.
Let us describe the LRS framework for multiple users in its full generality for any parameterized class of
functions
F
. Let x
Rd
be the input point, and let
by
=
f
(x;Θ)
R
be the predicted label using a
function
f∈ F
parameterized by Θ. For simplicity, assume that Θ, the set of parameters, is represented
in the form of a high dimensional vector. Let there be
t
users, each with a small amount of user-specific
data and each associated with a set of learnable model parameters Θ
(i)
. Then, the goal is to learn Θ
(i)
for each user
i
[
t
]such that (a) Θ
(i)
does not over-fit despite a small number of data-points generated
specifically from user
i
, (b)
{
Θ
(i),
1
it}
can be stored and used for fast inference even for large
t
, Our method LRS attempts to address all the three requirements using a simple low-rank + sparse
approach: we model each Θ
(i)
as Θ
(i)
:= U
·
w
(i)
+b
(i)
, where U
is a orthonormal matrix representing
a global set of shared parameters (across users) corresponding to a low (say
r
)-dimensional subspace,
w
(i)
is a
r
-dimensional vector and b
(i)
is a
k
-sparse vector. Thus, we represent Θ
(i)
:= Θ
(i,1)
+Θ
(i,2)
,
where the first term Θ
(i,1)
=U
·
w
(i)
denotes the low rank component of Θ
(i)
that is, the model
parameters of the
ith
user lying in a low-dimensional subspace (common for all users) and the second
term Θ
(i,2)
=b
(i)
is restricted to be sparse. Hence, for each user, we only need to store weights of
the
r
-dimensional vector w
(i)
and the non-zero weights of the
k
-sparse b
(i)
. Therefore, if
r
and
k
are
small, then the memory overhead per user is small thus allowing efficient deployment. Moreover, we
can expect that a small amount of user-specific data should be sufficient to train these few additional
2
parameters per user without over-fitting. Finally, our framework provides users with the flexibility to
either contribute to a central model using privacy preserving bill-board models (see Section 3) or learn
their parameters locally without contributing to the central model. This allows for greater customization
and adaptability based on individual user preferences and requirements.
Theoretically speaking, a recent line of work (Thekumparampil et al., 2021; Du et al., 2020; Tripuraneni
et al., 2021; Boursier et al., 2022; Jain et al., 2021) analyzes a framework exclusively on representation
learning. They model Θ
(i)
:= U
·
w
(i)
as the parameters associated with the
ith
user and provide
theoretical guarantees in the linear model setting. However, their algorithm/analysis do not extend to
our case since their model does not capture the additional sparsity component in Θ
(i)
- this additional
non-convex constraint introduces several technical challenges in the analysis. Moreover, these works do
not explore the personalization framework with privacy constraints.
In our second and primary contribution, we consider the problem of analyzing the LRS framework from
a theoretical lens. Specifically, we analyze the instantiation of LRS in the context of linear models. In
this case, the training data for each user corresponds to a few linear measurements of the underlying
user-specific model parameter Θ
(i)
. Our objective is to estimate the individual components of Θ
(i)
.
Training a model in the LRS framework involves learning a global set of parameters characterizing the
low dimensional subspace and the local user-specific parameters jointly. To this end, we propose a
simple alternating minimization (AM) style iterative technique. Our method AMHT-LRS alternatingly
estimates the global parameters and user-specific parameters independently for each user. To ensure
sparsity of (b
(i)
)’s we use an iterative hard-thresholding style estimator (Jain et al., 2014). For linear
models, even estimating the global parameters i.e. the low rank subspace induced by U
is an NP-hard
problem Thekumparampil et al. (2021). Therefore, similar to Thekumparampil et al. (2021), we consider
the realizable setting where the data is generated from a Gaussian distribution. In this case, we provide a
novel analysis demonstrating the efficient convergence of our method, AMHT-LRS, towards the optimal
solution. We believe our analysis techniques are of independent interest as we track entry-wise error of
different parameter estimates across iterations of AMHT-LRS. Below, we state our main result (Thm.
1) informally:
Theorem (Informal).Suppose we are given
m·t
samples from
t
linear regression training tasks (each
corresponding to a user) of dimension
d
with
m
samples each. In the LRS framework, the goal is to learn
a new regression task’s parameters using
m
samples, i.e., learn the
r
-dimensional weight vector defining
the user-specific low rank representation and the user-specific
k
-sparse vector. Then AMHT-LRS with
total
m˙
t
= Ω(
kdr4
)samples and
m
= Ω(
max
(
k, r3
)) samples per user can recover all the parameters
exactly and in time nearly linear in m·t.
That is, in the linear instantiation of LRS, AMHT-LRS is able to estimate the underlying model
parameters up to desired precision as long as the total number of users in the recommendation system
is large enough. Moreover, we show that the sufficient sample complexity per user for our estimation
guarantees scales only linear in
k
and cubically in
r
(nearly optimal); recall that
r, k
are much smaller
than the ambient dimension
d
. The detailed analysis of Theorem 1 is provided in Appendix D, E.
Furthermore, using the billboard model of (
ϵ, δ
)differential privacy (DP) (Jain et al., 2021; Chien et al.,
2021; Kearns et al., 2014), we can extend AMHT-LRS to preserve privacy of each individual user. For
similar sample complexity as in the above theorem albeit with slightly worse dependence on
r
, we can
guarantee strong generalization error up to a standard error term due to privacy (see Theorem 3 and
Appendix D).
Finally, to validate our theoretical contributions, we demonstrate experimental results on synthetic and
real-world datasets using linear models. Also, we experiment with our framework applied to neural
3
networks architectures (see Appendix A,B). Our experiments demonstrate the advantage/efficacy of our
framework compared to natural baseline frameworks applied to the same model architecture.
1.1 Other Related Work
Comparison with Hu et al. (2021): LORA (Low Rank Adaptation of Large Language Models) was
proposed by Hu et al. (2021) for meta-learning with large number of tasks at scale. Although the authors
demonstrate promising experimental results, LORA only allows a central model (in a low dimensional
manifold) and does not incorporate sparse fine-tuning. Hence, LORA becomes ineffective when output
dimension is small. Moreover, LORA does not have any theoretical guarantees even in simple settings.
Comparison with Prompt-based and Batch-norm Fine-tuning: Another popular approach for
personalization is to use prompt-based or batch-norm based fine-tuning (Wang et al., 2022; Liu et al.,
2021; Lester et al., 2021); this usually involves a task-based feature embedding concatenated with the
covariate. Note that in a linear model, such an approach will only lead to an additional scalar bias
which can be easily modeled in our framework; thus our framework is richer and more expressive with a
smaller number of parameters.
1.2 Preliminaries - Private Personalization
For model personalization in recommendation systems, where we wish to have a personalized model
for each user, privacy guarantees are of utmost importance. Due to sensitivity of user-data, we would
want to preserve privacy of each user for which we use user-level (
ϵ, δ
)-DP as the privacy notion (see
Definition 1). In this setting, each user
i
[
t
]holds a set of data samples
D(i)
=
{
x
(i)
j,
1
jm}
.
Furthermore, users interact via a central algorithm that maintains the common representation matrix
U
which is guaranteed to be differentially private with respect to all the data samples of any single user.
The central algorithm publishes the current U
to all the users (a.k.a. on a billboard) and obtains further
updates from the users. It has been shown in prior works (Jain et al., 2021; Chien et al., 2021; Thakkar
et al., 2019) that such a billboard mechanism allows for significantly more accurate privacy preserving
methods while ensuring user-level privacy. In particular, it allows learning of U
effectively, while each
user can keep a part of the model which is personal to them, for example, the w
(i),
(b
(i)
)’s in our
context. See Section 3 of Jain et al. (2021) for more details about billboard model in the personalization
setting. Traditionally, such model of private computation is called the billboard model of DP and is a
subclass of joint DP (Kearns et al., 2014).
Definition 1 (Differential Privacy Dwork et al. (2006a;b); Bun and Steinke (2016)).A randomized
algorithm
A
is (
ε, δ
)-differentially private (DP) if for any pair of data sets
D
and
D
that differ in one
user (i.e., |DD|= 1), and for all Sin the output range of A, we have
Pr[A(D)S]eε·Pr[A(D)S] + δ,
where probability is over the randomness of
A
. Similarly, an algorithm
A
is
ρ
-zero Concentrated DP
(zCDP) if Dα(A(D)||A(D)) αρ, where Dαis the Rényi divergence of order α.
In Definition 1, when we define the notion of neighborhood, we define it with respect to the addition
(removal) of a single user (i.e., additional removal of all the data samples
Di
for any user
i
[
t
]). In the
literature Dwork and Roth (2014), the definition is referred to as user-level DP.
4
2 Linear LRS with Gaussian Data
2.1 Problem Statement and Algorithm AMHT-LRS
Notations: [
m
]denotes the set
{
1
,
2
, . . . , m}
. For a matrix A,A
i
denotes
ith
row of A. For a vector
x,
xi
denotes
ith
element of x. We sometimes use x
j
to denote an indexed vector; in this case
xj,i
denotes the
ith
element of x
j
.
∥ · ∥2
denotes euclidean norm of a vector and the operator norm of a
matrix.
∥ · ∥,∥ · ∥0
will denote the
and
0
norms of a vector respectively.
A
2,
=
maxi
A
i2
and
A
F
=
qPiAi2
2
denote the L
2,
and Frobenius norm of a matrix respectively. For a sparse
vector v
Rd
, we define the support
supp
(v)
[
d
]to be the set of indices
{i
[
d
]
|vi̸
= 0
}
. We use I
to denote the identity matrix.
e
O
(
·
)
,e
(
·
)notations subsume logarithmic factors. For a matrix V
Rd×r
,
vec
(V)
Rdr
vectorizes the matrix Vby stacking columns sequentially. Similarly
vec1
d×r
(v)inverts the
vec
(
·
)operation by reconstructing a
d×r
matrix from a vector of dimension
dr
i.e. for the matrix V,
vec1
d×r
(
vec
(V)) = V. Finally, let
HT
:
Rd×RRd
be a hard thresholding function that takes a vector
v
Rd
and a parameter as input and returns a vector v
Rd
such that
v
i
=
vi
if
|vi|>
and 0
otherwise.
In this section, we describe our LRS framework for the linear setting with Gaussian data, provide an
efficient algorithm for parameter estimation, and provide rigorous analysis under realizable setting.
Formally speaking, consider
t
,
d
-dimensional linear regression tasks indexed by
i
[
t
]where the
ith
training task is associated with the
ith
user. Recall that according to the definition of LRS framework,
every user/task i[t]is associated with a set of unknown learnable parameters Θ(i)Rdthat can be
decomposed as Θ
(i)
=U
w
(i)
+b
(i)
. Here U
Rd×r
(satisfying (U
)
T
U
=I) is the global shared set
of parameters (across users) corresponding to the orthonormal basis vectors of a
r
-dimensional subspace.
For the
ith
user, the
r
-dimensional user-specific parameters w
(i)Rr
corresponds to the weights of the
basis vectors of low dimensional subspace defined by columns of U
; similarly, the user-specific
k
-sparse
vector b
(i)Rd
with
b
(i)0
=
k
corresponds to the sparse component of the unknown parameters of
the
ith
user. Henceforth, we will refer to the problem of estimating the unknown parameters of user
i[t]to be the ith training task.
For each task
i
[
t
],
md
samples
{
(x
(i)
j, y(i)
j
)
}m
j=1
(
Rd×R
)
m
are provided labelled as task
i
. Next,
we assume the following generative model for the data:
for all i
[
t
]
,for all j
[
m
], the covariates
{
x
(i)
j}i,j
are independently generated from a
d
-dimensional Gaussian with identity covariance (denoted as
N
(0
,
I
d
)) and the expected response is a linear function of the corresponding covariate. More precisely,
we have:
x(i)
j∼ N(0,Id)and y(i)
j|x(i)
j=x(i)
j,Uw(i)+b(i)+z(i)
ji[t], j [m],(1)
where
z(i)
j∼ N
(0
, σ2
)are zero mean Gaussian random variables with variance
σ2
. Furthermore,
{
x
(i)
j, z(i)
j}i,j
are independent random variables. We use X
(i)Rm×d
to represent the matrix of
covariates for the
ith
task s.t. X
(i)
j
= (x
(i)
j
)
T
. Similarly, we write y
(i),
z
(i)Rm
to represent the
user-specific response vector and noise vector respectively.
Therefore, given the data-set
{
(x
(i)
j, y(i)
j
)
}i,j
, the problem reduces to that of designing statistically and
computationally efficient algorithms to estimate the common representation learning parameter U
as
well as task-specific parameters
{
w
(i)}i[t],{
b
(i)}i[t]
. The ERM (Empirical Risk Minimizer) for this
5
摘要:

Sample-EfficientPersonalization:ModelingUserParametersasLowRankPlusSparseComponentsSoumyabrataPal∗1,PrateekVarshney∗1,PrateekJain1,AbhradeepGuhaThakurta2,GaganMadan1,GauravAggarwal1,PradeepShenoy1,andGauravSrivastava11GoogleResearchIndia2GoogleBrain1,2{soumyabrata,vprateek,prajain,athakurta,gaganmad...

展开>> 收起<<
Sample-Efficient Personalization Modeling User Parameters as Low Rank Plus Sparse Components Soumyabrata Pal1 Prateek Varshney1 Prateek Jain1 Abhradeep Guha Thakurta2 Gagan Madan1.pdf

共104页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:104 页 大小:1.72MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 104
客服
关注