
A Kernel-Based View of Language Model Fine-Tuning
paradigm for NLP tasks. Our contributions are:
1.
We formally extend the standard NTK theory de-
veloped for gradient descent to characterize kernel-
based dynamics when training with Adam. We
propose and rigorously prove the correctness of a new
kernel formula relying on the sign of the gradient to
describe early-stage training (e.g., fine-tuning) with
Adam (Section 4).
2.
We formally extend infinite-width analysis to ac-
count for a pre-trained initialization and character-
ize conditions under which fine-tuning can exhibit
kernel behavior. Using insights into the importance
of prompting, we formally prove the existence of a
rigorous mechanism through which prompt-based FT
of complex architectures (e.g., Transformers) can ex-
hibit kernel behavior (Section 5). Analysis proceeds
in the context of networks whose widths go to infinity
(i.e., through the Tensor Programs framework), but
unlike standard NTK theory, it allows a non-random
initialization (i.e., one that results from pre-training).
3.
We perform an extensive empirical analysis on
14 diverse NLP tasks to reveal when and to what
extent fine-tuning exhibits kernel behavior. We
find that using a meaningful prompt is crucial for the
eNTK to achieve good performance, suggesting that
prompting induces a well-characterized optimization
benefit for fine-tuning. Further experiments reveal
that the trajectory of prompt-based FT can often be
described by kernel-based dynamics when the eNTK
succeeds (Section 6).
4.
We straightforwardly apply the kernel view of FT
dynamics to formally analyze the success of fine-
tuning methods that update in a low-rank subspace
of model parameters (e.g., LoRA, (Hu et al.,2021)).
These results in Section 7highlight how a kernel-
based understanding of FT can aid in the practical
design and theoretical analysis of efficient variants.
2. Related Work
Kernel view of training. The infinite-width limit is a
well-studied theoretical model for deep network optimiza-
tion. Jacot et al. (2018) introduced NTK to capture training
a deep and infinitely wide neural network from a random
initialization. Subsequent experiments showed that the ker-
nels underperformed for standard tasks (Arora et al.,2019b)
but performed well on small datasets (i.e., hundreds of ex-
amples) (Arora et al.,2020). Many works (Allen-Zhu et al.,
2019a;b;Arora et al.,2019a;Du et al.,2019b;a;Li & Liang,
2018;Zou et al.,2018;Cao & Gu,2019) have since applied
this lens to understand the optimization and generalization
behavior of deep networks. However, such analyses do not
directly apply to the pre-training and fine-tuning framework
because (1) the network trained during FT is inherited and
non-random; and (2) LMs are often trained with Adam,
and the NTK formula only describes training an infinitely
wide network with SGD. In this work, we handle a non-
random (i.e., pre-trained) initialization by assuming that the
pre-training task is sufficiently related to the downstream
task (Definition 5.3), and we derive new kernels to model
early-stage training with Adam (Section 4).
Theory of self-supervised learning and transfer learning.
Several existing theoretical works on transfer learning study
the performance of linear probing on a representation to
provide guarantees on various tasks related to the original
training data (Du et al.,2021;Tripuraneni et al.,2020;Wu
et al.,2020). Chua et al. (2021) show that regularized fine-
tuning in a meta-learning setting exhibits kernel behavior if
the pre-training and downstream tasks are closely related.
Along similar lines, Mu et al. (2020); Maddox et al. (2021);
Achille et al. (2021) suggest through experiments and theory
that gradient-based features, corresponding to a lineariza-
tion of fine-tuning, can perform well on visual downstream
tasks. We characterize when kernel dynamics describe fine-
tuning a pre-trained masked language model on downstream
language understanding tasks.
Saunshi et al. (2021) study autoregressive language models
to rigorously characterize why prompting can improve zero-
shot task performance, but their analysis precludes an inves-
tigation of FT. We focus on the masked language model pre-
training objective, but it is worth noting that there are many
works (Saunshi et al.,2019;Tosh et al.,2021a;b;Lee et al.,
2021;Tsai et al.,2021) studying transfer when pre-training
with a contrastive objective. However, experiments on lan-
guage modeling (Abnar et al.,2021) and contrastive learning
(Saunshi et al.,2022) recently demonstrated that properties
of transfer between self-supervised pre-training and super-
vised FT cannot be fully captured by model-agnostic analy-
ses that directly relate the pre-training and downstream task
errors. Kernel theory provides a principled optimization-
and architecture-aware framework to analyze FT.
Optimization of Transformers. Several works (Zhang
et al.,2020;Liu et al.,2020a;Li et al.,2022) have
documented issues with optimizing Transformer-based
architectures with SGD instead of Adam. To study the
unique properties of optimizing transformers with Adam,
we derive a new kernel formula (Theorem 4.3) to capture
early-stage training with Adam. Table 2compares the
performance of this kernel to FT with Adam and SGD.
Variants of fine-tuning methods. A standard way of fine-
tuning pre-trained LMs as introduced in (Radford et al.,
2018;Devlin et al.,2019) is to add a linear classifier on
top of a pre-trained encoder and update all the parameters
2