Same Pre-training Loss Better Downstream Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma

2025-05-03 0 0 1.11MB 29 页 10玖币
侵权投诉
Same Pre-training Loss, Better Downstream:
Implicit Bias Matters for Language Models
Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma
Stanford University
{hliu99, sxie, zhiyuanli, tengyuma}@stanford.edu
October 26, 2022
Abstract
Language modeling on large-scale datasets leads to impressive performance gains on various downstream
language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is
often used as the evaluation metric when developing language models since the pre-training loss tends to
be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively).
Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain
downstream performance and 2) flatness of the model is well-correlated with downstream performance
where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the
same (statistically optimal) pre-training loss but different downstream performance: continue pre-training
after convergence, increasing the model size, and changing the training algorithm. These experiments
demonstrate the existence of implicit bias of pre-training algorithms/optimizers—among models with the
same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this
implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language
models, and empirically observe a strong correlation between flatness and downstream performance among
models with the same minimal pre-training loss. We also prove in a synthetic language setting that among
the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.
1 Introduction
Large language models (LLMs) pre-trained on internet-scale data have improved performance on a wide array
of downstream tasks [
12
,
64
,
42
,
43
,
7
]. These models are trained with a language modeling pre-training loss to
“fill in the blanks”—either predicting the next token/word (autoregressive language modeling loss, or perplexity)
or masked tokens (masked language modeling (MLM) loss).
In common practice, the validation pre-training loss is used to monitor the training process [
7
,
66
] and
compare different models since the pre-training loss is generally strongly correlated with downstream perfor-
mance [
20
]. Moreover, theoretical works on understanding LLMs also focus on how the pre-training loss affects
downstream performance. Saunshi et al.
[46]
, Wei et al.
[59]
, Xie et al.
[62]
show that good pre-training loss,
or fitting the language modeling conditional probability well, is a main reason for downstream success of LLMs.
Their analyses generally treat the language models as blackboxes and do not take into account how the models
represents the conditional probability.
In this paper, we question the conventional wisdom on the correlation between the validation pre-training loss
and downstream performance for language modeling. Recent works have demonstrated that models with different
architectures may have the same pre-training loss but different performance [
47
,
50
]. Due to the expressivity of
modern neural nets, many parameter configurations even within the same architecture can still have the same pre-
training loss. A priori, it is unclear why all these configurations should have the same downstream performance.
1
arXiv:2210.14199v1 [cs.LG] 25 Oct 2022
We find that different parameter configurations with the same pre-training loss can indeed have different
downstream performance, especially when the pre-training loss reaches a near-optimal level. Concretely, using
simplified text datasets, we find three situations that demonstrate such a phenomenon:
Even after the pre-training loss converges, models at a later time step still tend to perform better.
Models trained by standard algorithms have better performance than adversarially trained models with the
same pre-training loss.
Larger models perform better downstream than smaller models even if they have the same pre-training loss.
These situations are most prominent in the saturation regime, where the models are close to the minimal
possible pre-training loss (aka the entropy of the conditional probability, which can be estimated in our simplified
datasets). In the saturation regime, the pre-training loss of all models are almost the same, but the transferability
to downstream tasks varies. Interestingly, this phenomenon also holds when linear probing on contextualized
presentations is used for evaluating downstream performance instead of finetuning. Thus, even though the
predicted conditional probabilities of two models are the same (and correct), the contextualized representations
can behave differently.
In each of the first two cases above, we find two models with the same pre-training loss and the same
architecture; but one has a better performance than the other. They only differ by the training algorithms
that are used to produce them. Therefore, this suggests the training algorithms have an implicit bias toward
one of these models—standard algorithms with more training steps biases towards parameter configurations
that transfer better to downstream tasks. The third case has a more subtle but similar interpretation. There
exists a hypothetical large model that represents the smaller model with worse downstream performance (by
filling zeros in the weights or replicating the weights of the smaller model). The training algorithm on the large
architecture could have chosen it, but did not. This suggests the algorithm has an implicit bias against the
hypothetical model (which has an equally good loss).
In supervised settings, optimizers are known to have an implicit bias toward selecting generalizable models
among all models with small empirical loss. E.g., see Damian et al.
[11]
, Li et al.
[34]
, which show that
SGD implicitly biases toward flatter minima, and references therein. However, the role of implicit bias in
self-supervised learning has not been studied and is conceptually different. Unlike in supervised learning, the
gap between empirical and population self-supervised losses is typically small, and thus implicit bias does
not seem to contribute to bridging this gap. Instead, the implicit bias selects local minima of the population
self-supervised loss that transfer better to downstream tasks.
Why do the algorithms bias toward some type of models? In Section 4, we provide a first-cut theoretical analy-
sis of the implicit bias in language modeling. Fortunately, despite the conceptual differences, mathematical tools
from supervised settings can be straightforwardly adapted to language modeling settings. We prove that mini-
batch SGD prefers flatter minima of population pre-training loss among all minima in the saturation regime. Inter-
estingly, we obtain cleaner theoretical results for the standard mini-batch SGD, without the artificial label noise in-
troduced in prior works [
11
,
34
], partly because the mini-batch noise for LLMs does not vanish even at convergence.
We corroborate our theory with empirical evidence in Section 5. We show that for models with the same
pre-training loss in the three situations above, flatness of the model (measured by the trace of Hessian of the
loss, as predicted by the theory) strongly correlates with the downstream performance.
Finally, to complement the theory and experiments above, we also rigorously formalize the connection
between flatness and downstream performance in a simplified Dyck language setting in Section 6. In this
setting, we prove that there are many models with good MLM pre-training loss; among them, the flattest
model learns the most useful features for downstream tasks. Here, results from the supervised setting cannot
be readily adapted since they are obtained (partially) via generalization bounds [
55
,
56
], which do not apply
to the language modeling setting where the implicit bias is not related to the gap between the empirical and
population loss. Proving the correlation between flatness and downstream performance in more general settings
likely requires highly non-trivial and novel theoretical tools, and we hope to motivate future work on this topic.
2
(a) PCFGTask C (b) OPTQNLI
Figure 1: Models at a later time step performs better, even after the pre-training loss converges. (a) A model
with 41M parameters pre-trained on the PCFG-generated dataset, and evaluated on task C. (b) A model with
235M parameters pre-trained on the OPT-generated dataset, and evaluated on QNLI. The error bar in (a)
shows the standard deviation of 5 random seeds in linear probe. We do not provide error bars in (b) due to
the limitation of computation resources. Also note that the pre-training loss is approaching its minimal value
(3.196 for the PCFG-generated dataset and 1.865 for the OPT-generated dataset) as we increase the number of
steps. In Section A.6, we further provide evaluation of the pre-training loss with the KL divergence.
2 Related Work
Language modeling and downstream adaptation.
Large language modeling has revolutionized the
NLP field. Starting from Devlin et al.
[12]
, a line of works improve the downstream performance on a wide range
of tasks with increasing model size and data amount [
64
,
42
,
43
,
7
]. LLMs even exhibit unexpected emergent
behaviors, such as in-context learning [
62
,
39
], step-by-step reasoning [
60
], and zero-shot learning [
7
]. Kaplan
et al.
[27]
, Hernandez et al.
[20]
study the behavior of language models with increasing size, and find out that the
pre-training loss is typically correlated with downstream performance as model size increases. In practice, the pre-
training loss is used as an evaluation metric for language models. A notable example is the efficient transformer
line of works, which benchmark the pre-training loss given the same computation constraint [
10
,
54
,
9
,
49
,
35
].
Understanding large language models.
Empirical works on understanding MLM find out that the rep-
resentations of language models encode rich semantic and syntactic information [
41
,
23
,
21
,
38
]. Theoretical
works show that good LM loss, or the ability to fit the LM conditional probability, is a sufficient condition
for good performance on downstream tasks. Zhang and Hashimoto
[67]
show MLM representations recover
latent variables in graphical models. Saunshi et al.
[46]
introduce the natural assumption, which states that
downstream tasks can be solved linearly with the true conditional probability. Wei et al.
[59]
instantiate MLM
on datasets generated by HMMs and show linear probe on top of MLM models solves downstream tasks. In
contrast, our empirical evidence indicates that other factors related to the architecture and optimization also
contribute to the performance beyond the natural assumption—somewhat surprisingly, we show that linear
probe on top of the features of language models is better than linear probe on top of true conditional probability.
Recent works also provide empirical evidence that good pre-training loss cannot fully explain the downstream
success. Tay et al.
[50]
find out that a narrow but deep transformer is better than a wide but shallow transformer
with the same pre-training loss. Zhang et al.
[68]
demonstrate that Albert [
29
] generalizes better to OOD tasks
than Bert on a synthetic reasoning task. These works indicate that the architecture is an important factor for
good downstream performance beyond pre-training loss. This paper discovers the implicit bias in language
modeling on models with the same architecture and the same pre-training loss. Similar to our findings, Xie et al.
[62]
also observe that despite similar perplexity, larger models are better than smaller modes for in-context
learning, while in this paper, we focus on the standard fine-tuning and linear probe evaluation of language models,
and provide a novel understanding of the mechanism behind the superiority of large models over small models.
3
Understanding self-supervised learning.
Our work is also related to the broader theoretical self-supervised
learning literature. This line of works study why a seemingly unrelated self-supervised objective helps improve
the performance on downstream tasks. Arora et al.
[2]
prove that contrastive learning representations work
on downstream linear classification tasks. Lee et al.
[31]
study reconstruction-based self-supervised learning
algorithms and show that linear probe on top of the self-supervised representations solves downstream tasks.
HaoChen et al.
[18]
show that the contrastive learning loss can be viewed as a principled spectral clustering
objective. With the spectral contrastive loss, self-supervised representations recover the cluster structure in
the augmentation graph. Recently, Saunshi et al.
[47]
introduce the disjoint augmentation regime, where the
minimizer of the contrastive learning loss can perform poorly on downstream tasks. Empirically, they find
out that subtracting the mean of the representations of each class makes self-supervised models perform worse
on downstream tasks, and ResNet [
19
] can have better downstream performance on downstream tasks than
ViT [
13
] and MLP-Mixer [
51
] on modified images. This indicates that pre-training loss is not all that matters
for good downstream performance in self-supervised learning.
Implicit bias in supervised learning.
The training algorithm chooses solutions with certain properties,
and usually leads to better generalization [
16
,
48
,
32
,
25
,
1
,
37
,
33
,
61
,
17
]. Recently, Blanc et al.
[6]
, Damian
et al.
[11]
, Li et al.
[34]
demonstrate label noise SGD biases the models toward flatter minima. However, the
setting of implicit bias in supervised learning is different from language modeling. In language modeling, we
have access to gigantic corpus, and cannot interpolate the pre-training dataset. Moreover, we care about the
adaptability of the solution on downstream tasks instead of generalization in distribution.
3 The Existence of Implicit Bias in Language Modeling
In this section, we systematically investigate the relationship between pre-training loss and downstream per-
formance with experiments. We find out that models with the same pre-training loss but different training
procedures can have different downstream performance.
3.1 Formulations
Masked language modeling.
Consider a vocabulary
W
=
{
0
,
1
,...,c}
, where 0 is a special token for the
mask. Let
x
=[
x1,...,xT
] denote the input sequence of length
T
, and
xt
=[
x1,...,xt1,
0
,xt+1,...,xT
] denote the
masked sentence, where
t
is sampled uniformly randomly and independently from [
T
].
1
The MLM conditional
probability refers to the probability of
xt
given the rest of the sequence
Pr(xt|xt)
. We use
Pr(·|xt)
to denote
the
c
-dimensional probability vector
Pr(·|xt)
:=[
Pr(xt=1|xt),...,Pr(xt=c|xt)
]
Rc
. In MLM pre-training,
the model
fθ
(
·
) (parameterized by
θ
) outputs the predicted MLM conditional probability vector
fθ
(
xt
)
Rc
.
The model is trained to predict the masked token
xt
given the rest of the sentence
xt
with cross entropy loss,
L(θ)=Ex,t[`(fθ(xt),xt)]=Ex,t[log([fθ(xt)]xt)].
Downstream evaluation.
The language model
fθ
is composed of a feature extractor
hψ
, which outputs a
sequence of contextual representations, and a linear classifier that outputs the conditional probability at every
position. On downstream tasks, we use a randomly initialized
gφ
on top of the pre-trained
hψ
. In fine-tuning, both
gφ
and
hψ
are trained, while in linear probe, only
gφ
is updated. For fine-tuning, we use the contextual representa-
tions of the
cls
token. For linear probe, we concatenate the contextual representations of all the tokens together.
Saturation regime.
To study models with the same pre-training loss, we introduce the saturation regime in
this paper, where the model output equals the true conditional probability,
fθ
(
xt
)=
Pr(·|xt)
. In the saturation
regime, the MLM loss is equal to the entropy of the true conditional probability
L
(
θ
)=
Ex,t
[
log
(
Pr(xt|xt)
)]=
1
TPT
t=1H
(
xt|xt
), which is also the optimal pre-training loss. Thus, all models in the saturation regime have
the same, optimal pre-training loss, and we will show that they behave differently on downstream tasks. Our
1For simplicity, we only consider masking out one one token in each sentence.
4
(a) PCFGTask B (b) HMMTask-10 (c) OPTQNLI
Figure 2: Larger models perform better downstream than smaller models, even with almost the same pre-training
loss. (a) Pre-train on the PCFG-generated dataset and evaluate on task B. (b) Pre-train on the HMM-generated
dataset and evaluate on task-10. (c) Pre-train on the OPT-generated dataset and evaluate on QNLI. See
Section 3and Section Afor details.
experiments use expressive enough architectures such that there are multiple parameter configurations in the
saturation regime for our simplified datasets. For real large-scale data, it is currently computationally challenging
to arrive at the saturation regime. However, we hope that our experiments can provide insights for even larger
models in the future and for other regimes where pre-training loss does not explain downstream performance.
3.2 Experimental Setup
We design controlled experiments to study the correlation between pre-training loss and downstream perfor-
mance. In particular, we will find a set of models with almost the same pre-training loss. We effectively use the
same architecture family so that the main difference between the models only stems from training algorithms.
More details are provided in Section A.
Datasets.
We introduce three generative models to produce simplified datasets, with which we can study
various factors systematically. With the knowledge of the true generative models that generate the data, we
can compute the true conditional probability and scale up the models until they approach the saturation regime
to ensure they have almost the same pre-training loss. Moreover, we can generate unlimited amount of text
for pre-training to avoid overfitting to the empirical pre-training loss.
1)
PCFG-generated dataset. PCFG [
8
] generates sentences with probabilistic trees and is widely used to
understand natural language [
26
,
45
,
28
,
63
]. We randomly generate the production rules which satisfy the
Chomsky Normal Form [
8
]. The non-terminal symbols in the parse tree can be viewed as intrinsic quantities
associated with the sentence such as sentiment and syntax. Thus we design three downstream tasks A, B,
and C to classify non-terminal symbols at different positions of the parse trees.
2)
HMM-generated dataset. HMM samples the hidden variables from the transition probabilities and the tokens
from the emission probabilities. [
59
,
62
] also analyze the properties of pre-trained language models with
HMMs. We generate the transition and emission probabilities as random block-diagonal stochastic matrices.
The downstream task is to classify the hidden variable in the sentence. We use task-
k
to refer to classifying
the k-th hidden variable.
3)
OPT-generated dataset. We also introduce a more realistic pre-training dataset generated by the OPT
models [
66
]. Starting from the
bos
token, we sample each token from the conditional probability output by
the OPT model. For computational feasibility we only allow to generate the top-2000 most frequent tokens
in the OPT vocabulary. We use QNLI and SST-2 from GLUE [53] as downstream tasks.
Note that the true conditional probability can be computed efficiently for the three datasets given the knowl-
edge of the generated models. For PCFG and HMM-generated datasets, we can compute the true conditional
5
摘要:

SamePre-trainingLoss,BetterDownstream:ImplicitBiasMattersforLanguageModelsHongLiuSangMichaelXieZhiyuanLiTengyuMaStanfordUniversityfhliu99,sxie,zhiyuanli,tengyumag@stanford.eduOctober26,2022AbstractLanguagemodelingonlarge-scaledatasetsleadstoimpressiveperformancegainsonvariousdownstreamlanguagetasks....

展开>> 收起<<
Same Pre-training Loss Better Downstream Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:1.11MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注