Naturally, the prompt learning approach is limiting because the global model might not be able to
capture all the variations across users/domains unless it is of extremely large size which in turn leads to
large inference/training cost. Moreover, user-descriptive feature vectors, limited by privacy concerns,
might not be able to capture the user-taste explicitly.
On the other extreme, such user-specific personalized models can be learned by fine-tuning a global
model for each user. Vanilla approaches for fine-tuning can be categorized as: 1) Neighborhood Models:
these methods learn a global model, which is then entirely "fine-tuned" to specific tasks (Guo et al.,
2020; Howard and Ruder, 2018; Zaken et al., 2021) 2) Representation Learning: these methods learn a
low-dimensional representation of points which can be used to train task-specific learners (Javed and
White, 2019; Raghu et al., 2019; Lee et al., 2019; Bertinetto et al., 2018; Hu et al., 2021). Neighborhood
fine-tuning techniques have two key limitations: I) Infrastructure costs of hosting such models is
prohibitive. For example, consider a scenario where we have a pre-trained 1GB model which is fine-tuned
for 1M users, then the total storage cost itself is 1PB, and II) they typically work for a small number
of data-rich users/domains/tasks, but the key regime for personalization in our work is a long tail of
data-starved users. In this regime, neighborhood fine-tuning might lead to over-fitting as well. Simple
fixes like fine-tuning only the last layer often lead to a significantly worse performance (Chen et al.,
2020a; Salman et al., 2020). Further, note that representation learning techniques can learn only the
(low dimensional) common information across the users but cannot capture the user-level peculiarities as
they share a fixed smaller representation space. These limitations make model personalization for a large
number of data-starved users a challenging problem from the perspective of both storage and inference.
Recently, with the advent of large ML models, there has been a huge push towards practical Parameter
Efficient Fine-Tuning (PEFT) with a variety of techniques (see the survey Lialin et al. (2023) and
references therein). However, as described in Lialin et al. (2023), a rigorous theoretical study of such
approaches that can be scaled to large number of tasks (users) has been limited. In this work, for the
multi-user personalization problem, our first contribution is to introduce the LRS (Low Rank and Sparse)
framework that combines representation learning and sparse fine-tuning (two distinctive class of methods
for PEFT) with two main goals: 1) propose a practical/efficient algorithm for learning a model for each
user with minimal memory/parameter overhead, and 2) theoretically establish strong generalization
properties despite having a limited amount of data for many users.
Let us describe the LRS framework for multiple users in its full generality for any parameterized class of
functions
F
. Let x
∈Rd
be the input point, and let
by
=
f
(x;Θ)
∈R
be the predicted label using a
function
f∈ F
parameterized by Θ. For simplicity, assume that Θ, the set of parameters, is represented
in the form of a high dimensional vector. Let there be
t
users, each with a small amount of user-specific
data and each associated with a set of learnable model parameters Θ
(i)
. Then, the goal is to learn Θ
(i)
for each user
i∈
[
t
]such that (a) Θ
(i)
does not over-fit despite a small number of data-points generated
specifically from user
i
, (b)
{
Θ
(i),
1
≤i≤t}
can be stored and used for fast inference even for large
t
, Our method LRS attempts to address all the three requirements using a simple low-rank + sparse
approach: we model each Θ
(i)
as Θ
(i)
:= U
⋆·
w
⋆(i)
+b
⋆(i)
, where U
⋆
is a orthonormal matrix representing
a global set of shared parameters (across users) corresponding to a low (say
r
)-dimensional subspace,
w
⋆(i)
is a
r
-dimensional vector and b
⋆(i)
is a
k
-sparse vector. Thus, we represent Θ
(i)
:= Θ
(i,1)
+Θ
(i,2)
,
where the first term Θ
(i,1)
=U
⋆·
w
⋆(i)
denotes the low rank component of Θ
(i)
that is, the model
parameters of the
ith
user lying in a low-dimensional subspace (common for all users) and the second
term Θ
(i,2)
=b
⋆(i)
is restricted to be sparse. Hence, for each user, we only need to store weights of
the
r
-dimensional vector w
⋆(i)
and the non-zero weights of the
k
-sparse b
⋆(i)
. Therefore, if
r
and
k
are
small, then the memory overhead per user is small thus allowing efficient deployment. Moreover, we
can expect that a small amount of user-specific data should be sufficient to train these few additional
2