A Kernel-Based View of Language Model Fine-Tuning

2025-04-24 0 0 900.65KB 32 页 10玖币

侵权投诉

Sadhika Malladi 1Alexander Wettig 1Dingli Yu 1Danqi Chen 1Sanjeev Arora 1

Abstract

It has become standard to solve NLP tasks by

ﬁne-tuning pre-trained language models (LMs),

especially in low-data settings. There is minimal

theoretical understanding of empirical success,

e.g., why ﬁne-tuning a model with

108

or more

parameters on a couple dozen training points does

not result in overﬁtting. We investigate whether

the Neural Tangent Kernel (NTK)—which origi-

nated as a model to study the gradient descent dy-

namics of inﬁnitely wide networks with suitable

random initialization—describes ﬁne-tuning of

pre-trained LMs. This study was inspired by the

decent performance of NTK for computer vision

tasks (Wei et al.,2022). We extend the NTK for-

malism to Adam and use Tensor Programs (Yang,

2020b) to characterize conditions under which the

NTK lens may describe ﬁne-tuning updates to pre-

trained language models. Extensive experiments

on 14 NLP tasks validate our theory and show

that formulating the downstream task as a masked

word prediction problem through prompting often

induces kernel-based dynamics during ﬁne-tuning.

Finally, we use this kernel view to propose an ex-

planation for the success of parameter-efﬁcient

subspace-based ﬁne-tuning methods.1

1. Introduction

It is now customary to solve most supervised natural lan-

guage processing (NLP) tasks such as topic classiﬁcation

and textual entailment by ﬁne-tuning a pre-trained language

model (e.g., (Devlin et al.,2019;Liu et al.,2020b;Clark

et al.,2020;Raffel et al.,2020;Joshi et al.,2020)). We lack

theoretical understanding of this ﬁne-tuning paradigm. Why

do we not see over-ﬁtting when ﬁne-tuning a very large

Department of Computer Science, Princeton University,

Princeton, NJ, USA. Correspondence to: Sadhika Malladi

small-

adi@cs.princeton.edu>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Our code and pre-computed kernels are publicly available at

https://github.com/princeton-nlp/LM-Kernel-FT.

language model using a couple dozen instances of the super-

vised task? Why is ﬁne-tuning so sensitive to details such

as whether or not we include a prompt (e.g., adding “It was

[great/terrible]” for sentiment analysis (Schick & Sch

utze,

2021;Gao et al.,2021)? Why does restricting optimiza-

tion to a low-rank subspace of model parameters (Hu et al.,

2021;Li et al.,2018;Aghajanyan et al.,2021) still result

in performance comparable to full ﬁne-tuning? Answering

such questions requires understanding how the sequence of

parameter updates changes in various scenarios, e.g., the

addition of a prompt, or the introduction of randomly ini-

tialized parameters. The current theory of deep learning, at

ﬁrst sight, seems too primitive to address such questions,

especially since ﬁne-tuning has to start from a parameter

initialization inherited from pre-training.

Recently, Wei et al. (2022) suggested replacing ﬁne-tuning

with Neural Tangent Kernel (NTK), an idea invented for the

study of inﬁnite-width deep neural networks (Jacot et al.,

2018;Du et al.,2019a) and previously applied to solving

vision tasks with inﬁnitely wide ConvNets (Arora et al.,

2019b). They note that the NTK can be deﬁned for any

neural model

and any initialization

θ0

by representing an

input

by the gradient it induces

∇f(ξ;θ0)

, which yields a

kernel matrix:

K(ξ, ξ′) = ⟨∇f(ξ;θ0),∇f(ξ′;θ0)⟩.(1)

This kernel is well-deﬁned for any parameter vector

θ0

However, for an inﬁnite-width network initialized with

θ0

sampled from a suitably-scaled Gaussians, it can be shown

that the kernel matrix is unchanged during gradient descent,

which turns the classiﬁcation task into a form of kernel

regression with respect to this kernel (Jacot et al.,2018).

In the ﬁne-tuning setting, however, the initialization

θ0

inherited from the pre-trained network, and not sampled

from the Gaussian distribution. Nevertheless, (Wei et al.,

2022) found that kernel regression using this “empirical

NTK” (eNTK) deﬁned with the inherited

θ0

performs well,

achieving classiﬁcation accuracy within 6% absolute of ac-

tual ﬁne-tuning on several image recognition tasks. In other

words, their work hints that mathematical understanding

of the ﬁne-tuning phenomenon (e.g., its sample efﬁciency)

could go via the theory of kernel classiﬁers.

The current paper furthers an empirical and theoretical

understanding of the pre-training and ﬁne-tuning (FT)

arXiv:2210.05643v4 [cs.LG] 6 Jun 2023

A Kernel-Based View of Language Model Fine-Tuning

paradigm for NLP tasks. Our contributions are:

We formally extend the standard NTK theory de-

veloped for gradient descent to characterize kernel-

based dynamics when training with Adam. We

propose and rigorously prove the correctness of a new

kernel formula relying on the sign of the gradient to

describe early-stage training (e.g., ﬁne-tuning) with

Adam (Section 4).

We formally extend inﬁnite-width analysis to ac-

count for a pre-trained initialization and character-

ize conditions under which ﬁne-tuning can exhibit

kernel behavior. Using insights into the importance

of prompting, we formally prove the existence of a

rigorous mechanism through which prompt-based FT

of complex architectures (e.g., Transformers) can ex-

hibit kernel behavior (Section 5). Analysis proceeds

in the context of networks whose widths go to inﬁnity

(i.e., through the Tensor Programs framework), but

unlike standard NTK theory, it allows a non-random

initialization (i.e., one that results from pre-training).

We perform an extensive empirical analysis on

14 diverse NLP tasks to reveal when and to what

extent ﬁne-tuning exhibits kernel behavior. We

ﬁnd that using a meaningful prompt is crucial for the

eNTK to achieve good performance, suggesting that

prompting induces a well-characterized optimization

beneﬁt for ﬁne-tuning. Further experiments reveal

that the trajectory of prompt-based FT can often be

described by kernel-based dynamics when the eNTK

succeeds (Section 6).

We straightforwardly apply the kernel view of FT

dynamics to formally analyze the success of ﬁne-

tuning methods that update in a low-rank subspace

of model parameters (e.g., LoRA, (Hu et al.,2021)).

These results in Section 7highlight how a kernel-

based understanding of FT can aid in the practical

design and theoretical analysis of efﬁcient variants.

2. Related Work

Kernel view of training. The inﬁnite-width limit is a

well-studied theoretical model for deep network optimiza-

tion. Jacot et al. (2018) introduced NTK to capture training

a deep and inﬁnitely wide neural network from a random

initialization. Subsequent experiments showed that the ker-

nels underperformed for standard tasks (Arora et al.,2019b)

but performed well on small datasets (i.e., hundreds of ex-

amples) (Arora et al.,2020). Many works (Allen-Zhu et al.,

2019a;b;Arora et al.,2019a;Du et al.,2019b;a;Li & Liang,

2018;Zou et al.,2018;Cao & Gu,2019) have since applied

this lens to understand the optimization and generalization

behavior of deep networks. However, such analyses do not

directly apply to the pre-training and ﬁne-tuning framework

because (1) the network trained during FT is inherited and

non-random; and (2) LMs are often trained with Adam,

and the NTK formula only describes training an inﬁnitely

wide network with SGD. In this work, we handle a non-

random (i.e., pre-trained) initialization by assuming that the

pre-training task is sufﬁciently related to the downstream

task (Deﬁnition 5.3), and we derive new kernels to model

early-stage training with Adam (Section 4).

Theory of self-supervised learning and transfer learning.

Several existing theoretical works on transfer learning study

the performance of linear probing on a representation to

provide guarantees on various tasks related to the original

training data (Du et al.,2021;Tripuraneni et al.,2020;Wu

et al.,2020). Chua et al. (2021) show that regularized ﬁne-

tuning in a meta-learning setting exhibits kernel behavior if

the pre-training and downstream tasks are closely related.

Along similar lines, Mu et al. (2020); Maddox et al. (2021);

Achille et al. (2021) suggest through experiments and theory

that gradient-based features, corresponding to a lineariza-

tion of ﬁne-tuning, can perform well on visual downstream

tasks. We characterize when kernel dynamics describe ﬁne-

tuning a pre-trained masked language model on downstream

language understanding tasks.

Saunshi et al. (2021) study autoregressive language models

to rigorously characterize why prompting can improve zero-

shot task performance, but their analysis precludes an inves-

tigation of FT. We focus on the masked language model pre-

training objective, but it is worth noting that there are many

works (Saunshi et al.,2019;Tosh et al.,2021a;b;Lee et al.,

2021;Tsai et al.,2021) studying transfer when pre-training

with a contrastive objective. However, experiments on lan-

guage modeling (Abnar et al.,2021) and contrastive learning

(Saunshi et al.,2022) recently demonstrated that properties

of transfer between self-supervised pre-training and super-

vised FT cannot be fully captured by model-agnostic analy-

ses that directly relate the pre-training and downstream task

errors. Kernel theory provides a principled optimization-

and architecture-aware framework to analyze FT.

Optimization of Transformers. Several works (Zhang

et al.,2020;Liu et al.,2020a;Li et al.,2022) have

documented issues with optimizing Transformer-based

architectures with SGD instead of Adam. To study the

unique properties of optimizing transformers with Adam,

we derive a new kernel formula (Theorem 4.3) to capture

early-stage training with Adam. Table 2compares the

performance of this kernel to FT with Adam and SGD.

Variants of ﬁne-tuning methods. A standard way of ﬁne-

tuning pre-trained LMs as introduced in (Radford et al.,

2018;Devlin et al.,2019) is to add a linear classiﬁer on

top of a pre-trained encoder and update all the parameters

A Kernel-Based View of Language Model Fine-Tuning

together. Subsequent work (Schick & Sch

utze,2021;Gao

et al.,2021) formulated downstream tasks as a language

modeling problem (i.e., prompt-based FT) and demonstrated

empirical success in low-data scenarios (see Liu et al. (2022)

for a comprehensive survey). Another line of research stud-

ies parameter-efﬁcient ﬁne-tuning methods in which only a

subset of model parameters are updated (Lester et al.,2021;

Ben Zaken et al.,2022;Li & Liang,2021) or the parameters

updates are restricted to a low-dimensional subspace (Hu

et al.,2021;Aghajanyan et al.,2021). We show that good

eNTK performance arises only when studying prompt-based

FT in Section 6(Figure 1) and we later show in Section 7

that subspace-based FT methods such as LoRA (Hu et al.,

2021) have a simple interpretation through the kernel.

3. Preliminaries

3.1. Pre-Training and Fine-Tuning Paradigm

We focus our attention on masked language models (MLMs),

such as BERT (Devlin et al.,2019) and RoBERTa (Liu et al.,

2020b), which are trained to minimize the cross-entropy

loss on independently predicting masked tokens (i.e., a

|V|

way classiﬁcation task, where

is the vocabulary). Given

a text input sof length Tfrom the pre-training distribution

SPT

, replace a small percentage (e.g., 15%) of tokens with

[MASK]

tokens. This masked input is then fed into the

representation function

h:SPT →T×Rn

(e.g., a Trans-

former encoder) to produce a low-dimensional contextual

embedding for each position in the input. The contextual

embeddings are independently multiplied by a classiﬁer

head (i.e., word embeddings)

Φ∈Rn×|V|

to produce logits

that will be used to compute the probability of a token ﬁlling

each masked position.

Using a pre-trained model to solve downstream tasks effec-

tively has been a highly active area of research. We focus

on ﬁne-tuning (FT) methods, which adapt the pre-trained

model to a new input distribution

SFT

using additional train-

ing on the C-way downstream classiﬁcation task.

Standard FT (Devlin et al.,2019;Liu et al.,2020b):

To solve a

-way downstream classiﬁcation task, ini-

tialize and learn

a new classiﬁer head

Γ : Rn→RC

on top of the contextual

[CLS]

embedding, denoted

h[CLS]

. In this case, the model output

f:SFT →RC

for the eNTK construction is f(s) = Γ(h[CLS](s)).

Prompt-based FT (Schick & Sch

utze,2021;Gao

In our experiments, Standard FT corresponds to initializing

at the linear probing solution (i.e., training

on the downstream

task while freezing all other layers) and then performing FT. We

do this because when FT exhibits kernel behavior (Deﬁnition 3.2),

it ﬁnds solutions close to initialization, and we hypothesize that

the

learned during FT is closer to the linear probing solution

than a random initialization.

et al.,2021): Add a natural language prompt (e.g.

“This is

[MASK]

.”) in addition to the downstream

task input, and use the pre-trained MLM to ﬁll in the

masked token. Compute the logits over task-relevant

words (e.g., “great” and “terrible”) using the corre-

sponding columns of

, denoted

Φ∈Rn×C

. These

logits will serve as surrogates to solve the downstream

task. In this case, the model output

f:SFT →RC

for

the eNTK construction is f(s) = ˜

Φ⊤h[MASK](s).

3.2. Kernel Behavior

We consider a neural network

f(ξ;θ)

that takes input

and computes a scalar output

using

as the parameters.

Gradient-based updates to the model parameters involve

computing a loss function

ℓ

and

∂ℓ

∂θ

, which can be decom-

posed by the chain rule as

∂ℓ

∂f

∂θ

. The ﬁrst term is deﬁned as

the output derivative (Deﬁnition 3.1), and the second term

is used to deﬁne kernel behavior (Deﬁnition 3.2).

Deﬁnition 3.1 (Output Derivative).The output derivative

χ(ξ, y, θ)

for a network

with parameters

, loss func-

tion

ℓ

, and input

with label

is deﬁned as

χ(ξ, y, θ) =

∂ℓ(f(ξ;θ),y)

∂f

. We also deﬁne the output derivative applied at

time

χt=χ(ξt, yt, θt−1)

, where

ξt

is the input at time

with label

. For ease of notation, we often absorb

into

ξand write χ(ξ, θ)and χ(ξ, f)interchangeably.

Below, we adapt the deﬁnition of kernel-based learning

(i.e., lazy regime in Woodworth et al. (2020)) to an arbitrary

initialization.

Deﬁnition 3.2 (Kernel Behavior).Let

θt

be the parameters

after

steps of training by a gradient-based optimization

algorithm, and let

be an arbitrary ﬁxed input. We say

this training process of the network demonstrates kernel

behavior if the following properties are satisﬁed.

Linearization: The change of the network can be well-

approximated by its ﬁrst order Taylor expansion, i.e.,

f(ξ;θt)−f(ξ;θt−1)≈ ⟨∇f(ξ;θt−1), θt−θt−1⟩;

Fixed Features: The gradient at step

is approxi-

mately the same as before training, i.e.,

∇f(ξ;θt)≈

∇f(ξ;θ0).

∇f

denotes the gradient of

w.r.t.

. “Closeness to kernel

behavior” is quantiﬁed using the difference in the quanti-

ties on the two sides of the

≈

symbol. We formalize the

approximations in Deﬁnition C.3.

Note that for

-way classiﬁcation,

outputs a vector in

We say

exhibits kernel behavior if the Linearization and Fixed

Features properties hold for every component of

. The subsequent

deﬁnition of a kernel analog also generalizes to a vector output,

where νtis a vector in RCand K(A)(ξ, ξt)is a matrix in RC×C.

A Kernel-Based View of Language Model Fine-Tuning

Past work has shown that if gradient-based training exhibits

kernel behavior, then the function change can be expressed

in terms of a ﬁxed kernel (i.e., the kernel analog).

Deﬁnition 3.3 (Kernel Analog).Suppose optimization of

the parameters

of a model

using the gradient-based

update algorithm

exhibits kernel behavior (Deﬁnition 3.2).

Then, we say that a kernel

K(A)

is the kernel analog of the

optimization algorithm

if for every

t > 0

, there exists

νt

such that for any input ξ,

f(ξ;θt)−f(ξ;θt−1)≈ −νtK(A)(ξ, ξt)(2)

where

ξt

is the training input

of step

θt

is the parameter

after step t.

We illustrate the connection between the kernel analog and

kernel behavior when using SGD. If SGD exhibits kernel

behavior, then, for a ﬁxed input ξ, we can write

f(ξ;θt)−f(ξ;θt−1)≈ ⟨∇f(ξ;θt−1), θt−θt−1⟩

=⟨∇f(ξ;θt−1),−ηχt∇f(ξt;θt−1)⟩

≈ −ηχtK(SGD)(ξ, ξt)

where the approximations follow from the Linearization and

Fixed Features property, respectively,

is the learning rate,

χt

is the output derivative (Deﬁnition 3.1), and

K(SGD)

is the

kernel analog of SGD with

νt=ηχt

. Notably,

K(SGD)

is the

well-known neural tangent kernel (NTK) formula derived

in (Jacot et al.,2018), which represents an input

as the

resulting gradient ∇f(ξ;θ0).

Deﬁnition 3.4 (Neural Tangent Kernel

K(SGD)

K(SGD)(ξ, ξ′) = ⟨∇f(ξ;θ0),∇f(ξ′;θ0)⟩

Given a kernel

, one can solve a classiﬁcation task by

learning

αi

to minimize the empirical risk of

PiαiK(·, ξi)

where

{ξi}

is the training data (Appendix A). If training

exhibits kernel behavior and

is the kernel analog for the

optimizer, then solving the kernel regression problem is

equivalent to training the network (Jacot et al.,2018).

In Section 4, we derive the kernel analog for SignGD (i.e.,

an early-stage approximation of Adam), and in Section 6,

we compare its eNTK performance against Adam FT. The

eNTK computation relies on two design choices for the

setting: (1) what the model output

f(ξ;θ)

is, and (2) which

optimizer

is used. We choose

based on the FT setting

(Section 3.1) and Aas SGD or Adam.

4. Kernel Derivation for Adam

Computing the eNTK requires using the kernel analog (Def-

inition 3.3) of the chosen optimization algorithm

. How-

ever, it is difﬁcult to construct a long-term kernel analog for

4For simplicity, we assume the batch size is 1.

Adam, because the adaptivity causes each update to depend

on the entire gradient history. Previous work has shown that

in the early stages of training, full-batch (Ma et al.,2022)

and mini-batch (Malladi et al.,2022) Adam with a small

learning rate compute the moving averages for the moment

estimates in a small neighborhood, so the Adam update

reduces to coordinate-wise normalization on the gradient.

This optimization algorithm is called SignGD.

Deﬁnition 4.1 (SignGD).SignGD is a gradient-based opti-

mization algorithm that updates parameters as

θt=θt−1−

ηsign(∇ℓt(ξt;θt−1)), where sign is applied element-wise.

In Table 10, we provide empirical evidence that ﬁne-tuning

with SignGD yields comparable performance to Adam.

deﬁne the sign-based kernel below and prove it to be the

correct kernel analog for SignGD.

Deﬁnition 4.2 (Asymmetric SignGD Kernel).

K(A-SignGD)(ξ, ξ′) = ⟨∇f(ξ;θ0),sign(∇f(ξ′;θ0)⟩.

Theorem 4.3 (Informal version of Theorem C.4).If a net-

work is trained with SignGD and exhibits kernel behavior

(Deﬁnition 3.2), then the training dynamics follow

f(ξ;θt)−f(ξ;θt−1)≈ −ηsign(χt)K(A-SignGD)(ξ, ξt),

where χtis the output derivative (Deﬁnition 3.1).

Proof sketch.

The Linearization property in Deﬁnition 3.2

implies that

f(ξ;θt)−f(ξ;θt−1)≈ ⟨∇f(ξ;θt), θt−θt−1⟩

=−ηsign(χt)⟨∇f(ξ;θt),sign(∇f(ξt;θt−1))⟩.

Then, by the Fixed Features property in Deﬁnition 3.2,

⟨∇f(ξ;θt),sign(∇f(ξt;θt−1))⟩ ≈

⟨∇f(ξ;θ0),sign(∇f(ξt;θ0))⟩=K(A-SignGD)(ξ, ξt).

We solve the asymmetric kernel regression as suggested in

He et al. (2022), but the difﬁculties of solving the kernel re-

gression problem with an asymmetric kernel (Appendix A.3)

motivate us to also use the symmetric SignGD kernel.

Deﬁnition 4.4 (SignGD Kernel).

K(SignGD)(ξ, ξ′) =

⟨sign(∇f(ξ;θ0)),sign(∇f(ξ′;θ0))⟩

Unlike the standard NTK formula for SGD, the kernel ana-

log for Adam uses the sign function because early-stage

Adam dynamics are agnostic to the scales of the gradients.

Concurrent work in Littwin & Yang (2023) more formally

extends the Tensor Programs framework and ﬁnds that no

kernel can describe general (e.g., late-stage) Adam training

when batch size is large.

Sign-based optimizers have also shown success in vision

tasks (Chen et al.,2022).

A Kernel-Based View of Language Model Fine-Tuning

5. Theory: Prompt-Based Fine-Tuning Can

Exhibit Kernel Behavior

We give a plausible mechanism for how prompt-based FT

can exhibit kernel behavior (Deﬁnition 3.2) as the network

width grows large. We start by formalizing how changing

the architecture width impacts pre-training.

Deﬁnition 5.1 (Pre-Training Scheme).A pre-training

scheme

(X,A,Fn)

with width

contains the dataset

optimizer

and its hyperparameters, and a model architec-

ture

. Let

fn∼(X,A,Fn)

denote a model resulting

from training the architecture

on the dataset

with

optimizer A.

Remark 5.2.The reliance of the architecture on the width is

given by Tensor Programs (Yang,2020a): for example, in a

Transformer, ncorresponds to the embedding dimension.

We now connect pre-training to the downstream task. Anal-

ogous to Saunshi et al. (2021), we reason that prompting

transforms the downstream task into a ﬁll-in-the-blank prob-

lem, and thus the downstream task can be viewed as a sub-

case of the pre-training task. We then assume that a wider

pre-trained network will be better at ﬁlling in masked tokens

and that an inﬁnitely wide pre-trained network can solve the

downstream task perfectly when using a suitable prompt.

Deﬁnition 5.3 (Natural Task in the Inﬁnite-Width Limit).A

downstream task

is natural with respect to a pre-training

scheme

(X,A,Fn)

if, for any pre-trained model

fn∼

(X,A,Fn)and any downstream example (ξ, y)∈Ξ,

lim

n→∞ χ(ξ, y, fn)=0.(3)

where χis the output derivative (Deﬁnition 3.1).

Remark 5.4.Experiments in Section 6and Appendix B.2

suggest that the FT optimization dynamics depend on the

choice of prompt. In the above notation, the prompt is

included in the downstream task dataset

. Only tasks with

a well-suited prompt can be natural in the inﬁnite-width

limit. Tasks solved by FT using a randomly initialized head

cannot satisfy the condition since

will not vanish even for

an inﬁnitely wide pre-trained network at start of FT.

Although Deﬁnition 5.3 is asymptotic, we design a cheap

empirical test using two models of different widths

n1̸=

and same depth resulting from otherwise identical

pre-training schemes:

fn1∼(X,A,Fn1)

and

fn2∼

(X,A,Fn2)

. We measure if

decreases with width for

every downstream example (ξ, y)∈Ξwithout making any

gradient updates. This is necessary but not sufﬁcient for

the task to be natural in the inﬁnite-width limit. See Ap-

pendix B.1.

To study the behavior of FT, one also needs to make as-

sumptions about parameters that resulted from pre-training.

We assume that the network can be written as a Tensor Pro-

gram (Yang,2019;2020a;b), which is sufﬁciently general

to allow our theory to describe many complex architectures

(e.g., Transformers). To allow the analysis to proceed by

way of Tensor Programs, the network must be (1) stable:

its output does not grow with width (i.e., the inﬁnite-width

limit is meaningful), and (2) non-trivial: its output can be

updated during ﬁne-tuning (i.e., learning can occur).

Theorem 5.5 (Informal version of Theorem C.5).Assume

the downstream task

is natural in the inﬁnite-width limit

with respect to a pre-training scheme

(X,A,Fn)

, and the

model

f∼(X,A,Fn)

is stable, non-trivial, and can be

written as a Tensor Program. Then prompt-based FT of

will exhibit the Linearization and Fixed Features properties

of kernel behavior (Deﬁnition 3.2).

The theorem formalizes the intuition that if the pre-trained

network is already decent at solving the downstream task,

the network needs to only mildly adapt to solve the down-

stream task. Notably, we extend standard NTK theory to

account for an arbitrary initialization and to characterize

early-stage training with Adam using results from Section 4.

Our theoretical results in this section and Section 4apply to

autoregressive and masked language models (MLMs), but

we limit our ﬁne-tuning experiments to MLMs as they are

known to perform better after ﬁne-tuning.

6. Experiments

We compute the eNTK as described in Section 3for differ-

ent optimization algorithms and FT settings. eNTK perfor-

mance being comparable to FT performance is a necessary

but not sufﬁcient condition for FT to exhibit kernel behavior

(Deﬁnition 3.2), so we also directly measure if the Lin-

earization and Fixed Features properties hold (Section 6.2).

If the eNTK can solve the task, then eNTK regression pro-

vides an alternate method

to use the pre-trained model to

solve a downstream task, but the kernel lens only admits

a theoretical analysis of FT optimization dynamics if both

properties of kernel behavior are satisﬁed (Deﬁnition 3.2;

see Section 6.2). For tasks that the eNTK cannot solve,

we conjecture that the prompt is not well-designed for the

task (in the sense of Deﬁnition 5.3), forcing the pre-trained

model to adapt more during FT.

Our experiments are in the few-shot setting with manual

prompt templates from Gao et al. (2021). We consider 14

NLP tasks, divided into 8 single sentence and 6 sentence pair

datasets, which cover: sentiment analysis (SST-2, SST-5,

The eNTK is not as susceptible to noisy gradients as FT is,

because the learned kernel coefﬁcients can downweight anomalous

examples. This stability sometimes allows the kernel to outper-

form FT, especially in the few-shot setting (see MR and Subj in

Table 2a).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AKernel-BasedViewofLanguageModelFine-TuningSadhikaMalladi1AlexanderWettig1DingliYu1DanqiChen1SanjeevArora1AbstractIthasbecomestandardtosolveNLPtasksbyfine-tuningpre-trainedlanguagemodels(LMs),especiallyinlow-datasettings.Thereisminimaltheoreticalunderstandingofempiricalsuccess,e.g.,whyfine-tuningamo...

展开>> 收起<<

A Kernel-Based View of Language Model Fine-Tuning.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Kernel-Based View of Language Model Fine-Tuning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: