A Kernel-Based View of Language Model Fine-Tuning

2025-04-24 0 0 900.65KB 32 页 10玖币
侵权投诉
A Kernel-Based View of Language Model Fine-Tuning
Sadhika Malladi 1Alexander Wettig 1Dingli Yu 1Danqi Chen 1Sanjeev Arora 1
Abstract
It has become standard to solve NLP tasks by
fine-tuning pre-trained language models (LMs),
especially in low-data settings. There is minimal
theoretical understanding of empirical success,
e.g., why fine-tuning a model with
108
or more
parameters on a couple dozen training points does
not result in overfitting. We investigate whether
the Neural Tangent Kernel (NTK)—which origi-
nated as a model to study the gradient descent dy-
namics of infinitely wide networks with suitable
random initialization—describes fine-tuning of
pre-trained LMs. This study was inspired by the
decent performance of NTK for computer vision
tasks (Wei et al.,2022). We extend the NTK for-
malism to Adam and use Tensor Programs (Yang,
2020b) to characterize conditions under which the
NTK lens may describe fine-tuning updates to pre-
trained language models. Extensive experiments
on 14 NLP tasks validate our theory and show
that formulating the downstream task as a masked
word prediction problem through prompting often
induces kernel-based dynamics during fine-tuning.
Finally, we use this kernel view to propose an ex-
planation for the success of parameter-efficient
subspace-based fine-tuning methods.1
1. Introduction
It is now customary to solve most supervised natural lan-
guage processing (NLP) tasks such as topic classification
and textual entailment by fine-tuning a pre-trained language
model (e.g., (Devlin et al.,2019;Liu et al.,2020b;Clark
et al.,2020;Raffel et al.,2020;Joshi et al.,2020)). We lack
theoretical understanding of this fine-tuning paradigm. Why
do we not see over-fitting when fine-tuning a very large
1
Department of Computer Science, Princeton University,
Princeton, NJ, USA. Correspondence to: Sadhika Malladi
<
small-
adi@cs.princeton.edu>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
1
Our code and pre-computed kernels are publicly available at
https://github.com/princeton-nlp/LM-Kernel-FT.
language model using a couple dozen instances of the super-
vised task? Why is fine-tuning so sensitive to details such
as whether or not we include a prompt (e.g., adding “It was
[great/terrible]” for sentiment analysis (Schick & Sch
¨
utze,
2021;Gao et al.,2021)? Why does restricting optimiza-
tion to a low-rank subspace of model parameters (Hu et al.,
2021;Li et al.,2018;Aghajanyan et al.,2021) still result
in performance comparable to full fine-tuning? Answering
such questions requires understanding how the sequence of
parameter updates changes in various scenarios, e.g., the
addition of a prompt, or the introduction of randomly ini-
tialized parameters. The current theory of deep learning, at
first sight, seems too primitive to address such questions,
especially since fine-tuning has to start from a parameter
initialization inherited from pre-training.
Recently, Wei et al. (2022) suggested replacing fine-tuning
with Neural Tangent Kernel (NTK), an idea invented for the
study of infinite-width deep neural networks (Jacot et al.,
2018;Du et al.,2019a) and previously applied to solving
vision tasks with infinitely wide ConvNets (Arora et al.,
2019b). They note that the NTK can be defined for any
neural model
f
and any initialization
θ0
by representing an
input
ξ
by the gradient it induces
f(ξ;θ0)
, which yields a
kernel matrix:
K(ξ, ξ) = ⟨∇f(ξ;θ0),f(ξ;θ0).(1)
This kernel is well-defined for any parameter vector
θ0
.
However, for an infinite-width network initialized with
θ0
sampled from a suitably-scaled Gaussians, it can be shown
that the kernel matrix is unchanged during gradient descent,
which turns the classification task into a form of kernel
regression with respect to this kernel (Jacot et al.,2018).
In the fine-tuning setting, however, the initialization
θ0
is
inherited from the pre-trained network, and not sampled
from the Gaussian distribution. Nevertheless, (Wei et al.,
2022) found that kernel regression using this “empirical
NTK” (eNTK) defined with the inherited
θ0
performs well,
achieving classification accuracy within 6% absolute of ac-
tual fine-tuning on several image recognition tasks. In other
words, their work hints that mathematical understanding
of the fine-tuning phenomenon (e.g., its sample efficiency)
could go via the theory of kernel classifiers.
The current paper furthers an empirical and theoretical
understanding of the pre-training and fine-tuning (FT)
1
arXiv:2210.05643v4 [cs.LG] 6 Jun 2023
A Kernel-Based View of Language Model Fine-Tuning
paradigm for NLP tasks. Our contributions are:
1.
We formally extend the standard NTK theory de-
veloped for gradient descent to characterize kernel-
based dynamics when training with Adam. We
propose and rigorously prove the correctness of a new
kernel formula relying on the sign of the gradient to
describe early-stage training (e.g., fine-tuning) with
Adam (Section 4).
2.
We formally extend infinite-width analysis to ac-
count for a pre-trained initialization and character-
ize conditions under which fine-tuning can exhibit
kernel behavior. Using insights into the importance
of prompting, we formally prove the existence of a
rigorous mechanism through which prompt-based FT
of complex architectures (e.g., Transformers) can ex-
hibit kernel behavior (Section 5). Analysis proceeds
in the context of networks whose widths go to infinity
(i.e., through the Tensor Programs framework), but
unlike standard NTK theory, it allows a non-random
initialization (i.e., one that results from pre-training).
3.
We perform an extensive empirical analysis on
14 diverse NLP tasks to reveal when and to what
extent fine-tuning exhibits kernel behavior. We
find that using a meaningful prompt is crucial for the
eNTK to achieve good performance, suggesting that
prompting induces a well-characterized optimization
benefit for fine-tuning. Further experiments reveal
that the trajectory of prompt-based FT can often be
described by kernel-based dynamics when the eNTK
succeeds (Section 6).
4.
We straightforwardly apply the kernel view of FT
dynamics to formally analyze the success of fine-
tuning methods that update in a low-rank subspace
of model parameters (e.g., LoRA, (Hu et al.,2021)).
These results in Section 7highlight how a kernel-
based understanding of FT can aid in the practical
design and theoretical analysis of efficient variants.
2. Related Work
Kernel view of training. The infinite-width limit is a
well-studied theoretical model for deep network optimiza-
tion. Jacot et al. (2018) introduced NTK to capture training
a deep and infinitely wide neural network from a random
initialization. Subsequent experiments showed that the ker-
nels underperformed for standard tasks (Arora et al.,2019b)
but performed well on small datasets (i.e., hundreds of ex-
amples) (Arora et al.,2020). Many works (Allen-Zhu et al.,
2019a;b;Arora et al.,2019a;Du et al.,2019b;a;Li & Liang,
2018;Zou et al.,2018;Cao & Gu,2019) have since applied
this lens to understand the optimization and generalization
behavior of deep networks. However, such analyses do not
directly apply to the pre-training and fine-tuning framework
because (1) the network trained during FT is inherited and
non-random; and (2) LMs are often trained with Adam,
and the NTK formula only describes training an infinitely
wide network with SGD. In this work, we handle a non-
random (i.e., pre-trained) initialization by assuming that the
pre-training task is sufficiently related to the downstream
task (Definition 5.3), and we derive new kernels to model
early-stage training with Adam (Section 4).
Theory of self-supervised learning and transfer learning.
Several existing theoretical works on transfer learning study
the performance of linear probing on a representation to
provide guarantees on various tasks related to the original
training data (Du et al.,2021;Tripuraneni et al.,2020;Wu
et al.,2020). Chua et al. (2021) show that regularized fine-
tuning in a meta-learning setting exhibits kernel behavior if
the pre-training and downstream tasks are closely related.
Along similar lines, Mu et al. (2020); Maddox et al. (2021);
Achille et al. (2021) suggest through experiments and theory
that gradient-based features, corresponding to a lineariza-
tion of fine-tuning, can perform well on visual downstream
tasks. We characterize when kernel dynamics describe fine-
tuning a pre-trained masked language model on downstream
language understanding tasks.
Saunshi et al. (2021) study autoregressive language models
to rigorously characterize why prompting can improve zero-
shot task performance, but their analysis precludes an inves-
tigation of FT. We focus on the masked language model pre-
training objective, but it is worth noting that there are many
works (Saunshi et al.,2019;Tosh et al.,2021a;b;Lee et al.,
2021;Tsai et al.,2021) studying transfer when pre-training
with a contrastive objective. However, experiments on lan-
guage modeling (Abnar et al.,2021) and contrastive learning
(Saunshi et al.,2022) recently demonstrated that properties
of transfer between self-supervised pre-training and super-
vised FT cannot be fully captured by model-agnostic analy-
ses that directly relate the pre-training and downstream task
errors. Kernel theory provides a principled optimization-
and architecture-aware framework to analyze FT.
Optimization of Transformers. Several works (Zhang
et al.,2020;Liu et al.,2020a;Li et al.,2022) have
documented issues with optimizing Transformer-based
architectures with SGD instead of Adam. To study the
unique properties of optimizing transformers with Adam,
we derive a new kernel formula (Theorem 4.3) to capture
early-stage training with Adam. Table 2compares the
performance of this kernel to FT with Adam and SGD.
Variants of fine-tuning methods. A standard way of fine-
tuning pre-trained LMs as introduced in (Radford et al.,
2018;Devlin et al.,2019) is to add a linear classifier on
top of a pre-trained encoder and update all the parameters
2
A Kernel-Based View of Language Model Fine-Tuning
together. Subsequent work (Schick & Sch
¨
utze,2021;Gao
et al.,2021) formulated downstream tasks as a language
modeling problem (i.e., prompt-based FT) and demonstrated
empirical success in low-data scenarios (see Liu et al. (2022)
for a comprehensive survey). Another line of research stud-
ies parameter-efficient fine-tuning methods in which only a
subset of model parameters are updated (Lester et al.,2021;
Ben Zaken et al.,2022;Li & Liang,2021) or the parameters
updates are restricted to a low-dimensional subspace (Hu
et al.,2021;Aghajanyan et al.,2021). We show that good
eNTK performance arises only when studying prompt-based
FT in Section 6(Figure 1) and we later show in Section 7
that subspace-based FT methods such as LoRA (Hu et al.,
2021) have a simple interpretation through the kernel.
3. Preliminaries
3.1. Pre-Training and Fine-Tuning Paradigm
We focus our attention on masked language models (MLMs),
such as BERT (Devlin et al.,2019) and RoBERTa (Liu et al.,
2020b), which are trained to minimize the cross-entropy
loss on independently predicting masked tokens (i.e., a
|V|
-
way classification task, where
V
is the vocabulary). Given
a text input sof length Tfrom the pre-training distribution
SPT
, replace a small percentage (e.g., 15%) of tokens with
[MASK]
tokens. This masked input is then fed into the
representation function
h:SPT T×Rn
(e.g., a Trans-
former encoder) to produce a low-dimensional contextual
embedding for each position in the input. The contextual
embeddings are independently multiplied by a classifier
head (i.e., word embeddings)
ΦRn×|V|
to produce logits
that will be used to compute the probability of a token filling
each masked position.
Using a pre-trained model to solve downstream tasks effec-
tively has been a highly active area of research. We focus
on fine-tuning (FT) methods, which adapt the pre-trained
model to a new input distribution
SFT
using additional train-
ing on the C-way downstream classification task.
1.
Standard FT (Devlin et al.,2019;Liu et al.,2020b):
To solve a
C
-way downstream classification task, ini-
tialize and learn
2
a new classifier head
Γ : RnRC
on top of the contextual
[CLS]
embedding, denoted
h[CLS]
. In this case, the model output
f:SFT RC
for the eNTK construction is f(s) = Γ(h[CLS](s)).
2.
Prompt-based FT (Schick & Sch
¨
utze,2021;Gao
2
In our experiments, Standard FT corresponds to initializing
Γ
at the linear probing solution (i.e., training
Γ
on the downstream
task while freezing all other layers) and then performing FT. We
do this because when FT exhibits kernel behavior (Definition 3.2),
it finds solutions close to initialization, and we hypothesize that
the
Γ
learned during FT is closer to the linear probing solution
than a random initialization.
et al.,2021): Add a natural language prompt (e.g.
“This is
[MASK]
.”) in addition to the downstream
task input, and use the pre-trained MLM to fill in the
masked token. Compute the logits over task-relevant
words (e.g., “great” and “terrible”) using the corre-
sponding columns of
Φ
, denoted
˜
ΦRn×C
. These
logits will serve as surrogates to solve the downstream
task. In this case, the model output
f:SFT RC
for
the eNTK construction is f(s) = ˜
Φh[MASK](s).
3.2. Kernel Behavior
We consider a neural network
f(ξ;θ)
that takes input
ξ
and computes a scalar output
3
using
θ
as the parameters.
Gradient-based updates to the model parameters involve
computing a loss function
and
θ
, which can be decom-
posed by the chain rule as
f
f
θ
. The first term is defined as
the output derivative (Definition 3.1), and the second term
is used to define kernel behavior (Definition 3.2).
Definition 3.1 (Output Derivative).The output derivative
χ(ξ, y, θ)
for a network
f
with parameters
θ
, loss func-
tion
, and input
ξ
with label
y
is defined as
χ(ξ, y, θ) =
(f(ξ;θ),y)
f
. We also define the output derivative applied at
time
t
as
χt=χ(ξt, yt, θt1)
, where
ξt
is the input at time
t
with label
yt
. For ease of notation, we often absorb
y
into
ξand write χ(ξ, θ)and χ(ξ, f)interchangeably.
Below, we adapt the definition of kernel-based learning
(i.e., lazy regime in Woodworth et al. (2020)) to an arbitrary
initialization.
Definition 3.2 (Kernel Behavior).Let
θt
be the parameters
after
t
steps of training by a gradient-based optimization
algorithm, and let
ξ
be an arbitrary fixed input. We say
this training process of the network demonstrates kernel
behavior if the following properties are satisfied.
1.
Linearization: The change of the network can be well-
approximated by its first order Taylor expansion, i.e.,
f(ξ;θt)f(ξ;θt1)≈ ⟨∇f(ξ;θt1), θtθt1;
2.
Fixed Features: The gradient at step
t
is approxi-
mately the same as before training, i.e.,
f(ξ;θt)
f(ξ;θ0).
f
denotes the gradient of
f
w.r.t.
θ
. “Closeness to kernel
behavior” is quantified using the difference in the quanti-
ties on the two sides of the
symbol. We formalize the
approximations in Definition C.3.
3
Note that for
C
-way classification,
f
outputs a vector in
RC
.
We say
f
exhibits kernel behavior if the Linearization and Fixed
Features properties hold for every component of
f
. The subsequent
definition of a kernel analog also generalizes to a vector output,
where νtis a vector in RCand K(A)(ξ, ξt)is a matrix in RC×C.
3
A Kernel-Based View of Language Model Fine-Tuning
Past work has shown that if gradient-based training exhibits
kernel behavior, then the function change can be expressed
in terms of a fixed kernel (i.e., the kernel analog).
Definition 3.3 (Kernel Analog).Suppose optimization of
the parameters
θ
of a model
f
using the gradient-based
update algorithm
A
exhibits kernel behavior (Definition 3.2).
Then, we say that a kernel
K(A)
is the kernel analog of the
optimization algorithm
A
if for every
t > 0
, there exists
νt
such that for any input ξ,
f(ξ;θt)f(ξ;θt1)≈ −νtK(A)(ξ, ξt)(2)
where
ξt
is the training input
4
of step
t
,
θt
is the parameter
after step t.
We illustrate the connection between the kernel analog and
kernel behavior when using SGD. If SGD exhibits kernel
behavior, then, for a fixed input ξ, we can write
f(ξ;θt)f(ξ;θt1)≈ ⟨∇f(ξ;θt1), θtθt1
=⟨∇f(ξ;θt1),ηχtf(ξt;θt1)
≈ −ηχtK(SGD)(ξ, ξt)
where the approximations follow from the Linearization and
Fixed Features property, respectively,
η
is the learning rate,
χt
is the output derivative (Definition 3.1), and
K(SGD)
is the
kernel analog of SGD with
νt=ηχt
. Notably,
K(SGD)
is the
well-known neural tangent kernel (NTK) formula derived
in (Jacot et al.,2018), which represents an input
ξ
as the
resulting gradient f(ξ;θ0).
Definition 3.4 (Neural Tangent Kernel
K(SGD)
).
K(SGD)(ξ, ξ) = ⟨∇f(ξ;θ0),f(ξ;θ0)
Given a kernel
K
, one can solve a classification task by
learning
αi
to minimize the empirical risk of
PiαiK(·, ξi)
,
where
{ξi}
is the training data (Appendix A). If training
exhibits kernel behavior and
K
is the kernel analog for the
optimizer, then solving the kernel regression problem is
equivalent to training the network (Jacot et al.,2018).
In Section 4, we derive the kernel analog for SignGD (i.e.,
an early-stage approximation of Adam), and in Section 6,
we compare its eNTK performance against Adam FT. The
eNTK computation relies on two design choices for the
setting: (1) what the model output
f(ξ;θ)
is, and (2) which
optimizer
A
is used. We choose
f
based on the FT setting
(Section 3.1) and Aas SGD or Adam.
4. Kernel Derivation for Adam
Computing the eNTK requires using the kernel analog (Def-
inition 3.3) of the chosen optimization algorithm
A
. How-
ever, it is difficult to construct a long-term kernel analog for
4For simplicity, we assume the batch size is 1.
Adam, because the adaptivity causes each update to depend
on the entire gradient history. Previous work has shown that
in the early stages of training, full-batch (Ma et al.,2022)
and mini-batch (Malladi et al.,2022) Adam with a small
learning rate compute the moving averages for the moment
estimates in a small neighborhood, so the Adam update
reduces to coordinate-wise normalization on the gradient.
This optimization algorithm is called SignGD.
Definition 4.1 (SignGD).SignGD is a gradient-based opti-
mization algorithm that updates parameters as
θt=θt1
ηsign(t(ξt;θt1)), where sign is applied element-wise.
In Table 10, we provide empirical evidence that fine-tuning
with SignGD yields comparable performance to Adam.
5
We
define the sign-based kernel below and prove it to be the
correct kernel analog for SignGD.
Definition 4.2 (Asymmetric SignGD Kernel).
K(A-SignGD)(ξ, ξ) = ⟨∇f(ξ;θ0),sign(f(ξ;θ0).
Theorem 4.3 (Informal version of Theorem C.4).If a net-
work is trained with SignGD and exhibits kernel behavior
(Definition 3.2), then the training dynamics follow
f(ξ;θt)f(ξ;θt1)≈ −ηsign(χt)K(A-SignGD)(ξ, ξt),
where χtis the output derivative (Definition 3.1).
Proof sketch.
The Linearization property in Definition 3.2
implies that
f(ξ;θt)f(ξ;θt1)≈ ⟨∇f(ξ;θt), θtθt1
=ηsign(χt)⟨∇f(ξ;θt),sign(f(ξt;θt1)).
Then, by the Fixed Features property in Definition 3.2,
⟨∇f(ξ;θt),sign(f(ξt;θt1))⟩ ≈
⟨∇f(ξ;θ0),sign(f(ξt;θ0))=K(A-SignGD)(ξ, ξt).
We solve the asymmetric kernel regression as suggested in
He et al. (2022), but the difficulties of solving the kernel re-
gression problem with an asymmetric kernel (Appendix A.3)
motivate us to also use the symmetric SignGD kernel.
Definition 4.4 (SignGD Kernel).
K(SignGD)(ξ, ξ) =
sign(f(ξ;θ0)),sign(f(ξ;θ0))
Unlike the standard NTK formula for SGD, the kernel ana-
log for Adam uses the sign function because early-stage
Adam dynamics are agnostic to the scales of the gradients.
Concurrent work in Littwin & Yang (2023) more formally
extends the Tensor Programs framework and finds that no
kernel can describe general (e.g., late-stage) Adam training
when batch size is large.
5
Sign-based optimizers have also shown success in vision
tasks (Chen et al.,2022).
4
A Kernel-Based View of Language Model Fine-Tuning
5. Theory: Prompt-Based Fine-Tuning Can
Exhibit Kernel Behavior
We give a plausible mechanism for how prompt-based FT
can exhibit kernel behavior (Definition 3.2) as the network
width grows large. We start by formalizing how changing
the architecture width impacts pre-training.
Definition 5.1 (Pre-Training Scheme).A pre-training
scheme
(X,A,Fn)
with width
n
contains the dataset
X
,
optimizer
A
and its hyperparameters, and a model architec-
ture
Fn
. Let
fn(X,A,Fn)
denote a model resulting
from training the architecture
Fn
on the dataset
X
with
optimizer A.
Remark 5.2.The reliance of the architecture on the width is
given by Tensor Programs (Yang,2020a): for example, in a
Transformer, ncorresponds to the embedding dimension.
We now connect pre-training to the downstream task. Anal-
ogous to Saunshi et al. (2021), we reason that prompting
transforms the downstream task into a fill-in-the-blank prob-
lem, and thus the downstream task can be viewed as a sub-
case of the pre-training task. We then assume that a wider
pre-trained network will be better at filling in masked tokens
and that an infinitely wide pre-trained network can solve the
downstream task perfectly when using a suitable prompt.
Definition 5.3 (Natural Task in the Infinite-Width Limit).A
downstream task
Ξ
is natural with respect to a pre-training
scheme
(X,A,Fn)
if, for any pre-trained model
fn
(X,A,Fn)and any downstream example (ξ, y)Ξ,
lim
n→∞ χ(ξ, y, fn)=0.(3)
where χis the output derivative (Definition 3.1).
Remark 5.4.Experiments in Section 6and Appendix B.2
suggest that the FT optimization dynamics depend on the
choice of prompt. In the above notation, the prompt is
included in the downstream task dataset
Ξ
. Only tasks with
a well-suited prompt can be natural in the infinite-width
limit. Tasks solved by FT using a randomly initialized head
cannot satisfy the condition since
χ
will not vanish even for
an infinitely wide pre-trained network at start of FT.
Although Definition 5.3 is asymptotic, we design a cheap
empirical test using two models of different widths
n1̸=
n2
and same depth resulting from otherwise identical
pre-training schemes:
fn1(X,A,Fn1)
and
fn2
(X,A,Fn2)
. We measure if
χ
decreases with width for
every downstream example (ξ, y)Ξwithout making any
gradient updates. This is necessary but not sufficient for
the task to be natural in the infinite-width limit. See Ap-
pendix B.1.
To study the behavior of FT, one also needs to make as-
sumptions about parameters that resulted from pre-training.
We assume that the network can be written as a Tensor Pro-
gram (Yang,2019;2020a;b), which is sufficiently general
to allow our theory to describe many complex architectures
(e.g., Transformers). To allow the analysis to proceed by
way of Tensor Programs, the network must be (1) stable:
its output does not grow with width (i.e., the infinite-width
limit is meaningful), and (2) non-trivial: its output can be
updated during fine-tuning (i.e., learning can occur).
Theorem 5.5 (Informal version of Theorem C.5).Assume
the downstream task
Ξ
is natural in the infinite-width limit
with respect to a pre-training scheme
(X,A,Fn)
, and the
model
f(X,A,Fn)
is stable, non-trivial, and can be
written as a Tensor Program. Then prompt-based FT of
f
will exhibit the Linearization and Fixed Features properties
of kernel behavior (Definition 3.2).
The theorem formalizes the intuition that if the pre-trained
network is already decent at solving the downstream task,
the network needs to only mildly adapt to solve the down-
stream task. Notably, we extend standard NTK theory to
account for an arbitrary initialization and to characterize
early-stage training with Adam using results from Section 4.
Our theoretical results in this section and Section 4apply to
autoregressive and masked language models (MLMs), but
we limit our fine-tuning experiments to MLMs as they are
known to perform better after fine-tuning.
6. Experiments
We compute the eNTK as described in Section 3for differ-
ent optimization algorithms and FT settings. eNTK perfor-
mance being comparable to FT performance is a necessary
but not sufficient condition for FT to exhibit kernel behavior
(Definition 3.2), so we also directly measure if the Lin-
earization and Fixed Features properties hold (Section 6.2).
If the eNTK can solve the task, then eNTK regression pro-
vides an alternate method
6
to use the pre-trained model to
solve a downstream task, but the kernel lens only admits
a theoretical analysis of FT optimization dynamics if both
properties of kernel behavior are satisfied (Definition 3.2;
see Section 6.2). For tasks that the eNTK cannot solve,
we conjecture that the prompt is not well-designed for the
task (in the sense of Definition 5.3), forcing the pre-trained
model to adapt more during FT.
Our experiments are in the few-shot setting with manual
prompt templates from Gao et al. (2021). We consider 14
NLP tasks, divided into 8 single sentence and 6 sentence pair
datasets, which cover: sentiment analysis (SST-2, SST-5,
6
The eNTK is not as susceptible to noisy gradients as FT is,
because the learned kernel coefficients can downweight anomalous
examples. This stability sometimes allows the kernel to outper-
form FT, especially in the few-shot setting (see MR and Subj in
Table 2a).
5
摘要:

AKernel-BasedViewofLanguageModelFine-TuningSadhikaMalladi1AlexanderWettig1DingliYu1DanqiChen1SanjeevArora1AbstractIthasbecomestandardtosolveNLPtasksbyfine-tuningpre-trainedlanguagemodels(LMs),especiallyinlow-datasettings.Thereisminimaltheoreticalunderstandingofempiricalsuccess,e.g.,whyfine-tuningamo...

展开>> 收起<<
A Kernel-Based View of Language Model Fine-Tuning.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:900.65KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注