Same Pre-training Loss Better Downstream Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma

2025-05-03 0 0 1.11MB 29 页 10玖币

侵权投诉

Same Pre-training Loss, Better Downstream:

Implicit Bias Matters for Language Models

Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma

Stanford University

{hliu99, sxie, zhiyuanli, tengyuma}@stanford.edu

October 26, 2022

Abstract

Language modeling on large-scale datasets leads to impressive performance gains on various downstream

language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is

often used as the evaluation metric when developing language models since the pre-training loss tends to

be well-correlated with downstream performance (which is itself diﬃcult to evaluate comprehensively).

Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain

downstream performance and 2) ﬂatness of the model is well-correlated with downstream performance

where pre-training loss is not. On simpliﬁed datasets, we identify three ways to produce models with the

same (statistically optimal) pre-training loss but diﬀerent downstream performance: continue pre-training

after convergence, increasing the model size, and changing the training algorithm. These experiments

demonstrate the existence of implicit bias of pre-training algorithms/optimizers—among models with the

same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this

implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers ﬂatter minima in language

models, and empirically observe a strong correlation between ﬂatness and downstream performance among

models with the same minimal pre-training loss. We also prove in a synthetic language setting that among

the models with the minimal pre-training loss, the ﬂattest model transfers to downstream tasks.

1 Introduction

Large language models (LLMs) pre-trained on internet-scale data have improved performance on a wide array

of downstream tasks [

]. These models are trained with a language modeling pre-training loss to

“ﬁll in the blanks”—either predicting the next token/word (autoregressive language modeling loss, or perplexity)

or masked tokens (masked language modeling (MLM) loss).

In common practice, the validation pre-training loss is used to monitor the training process [

] and

compare diﬀerent models since the pre-training loss is generally strongly correlated with downstream perfor-

mance [

]. Moreover, theoretical works on understanding LLMs also focus on how the pre-training loss aﬀects

downstream performance. Saunshi et al.

[46]

, Wei et al.

[59]

, Xie et al.

[62]

show that good pre-training loss,

or ﬁtting the language modeling conditional probability well, is a main reason for downstream success of LLMs.

Their analyses generally treat the language models as blackboxes and do not take into account how the models

represents the conditional probability.

In this paper, we question the conventional wisdom on the correlation between the validation pre-training loss

and downstream performance for language modeling. Recent works have demonstrated that models with diﬀerent

architectures may have the same pre-training loss but diﬀerent performance [

]. Due to the expressivity of

modern neural nets, many parameter conﬁgurations even within the same architecture can still have the same pre-

training loss. A priori, it is unclear why all these conﬁgurations should have the same downstream performance.

arXiv:2210.14199v1 [cs.LG] 25 Oct 2022

We ﬁnd that diﬀerent parameter conﬁgurations with the same pre-training loss can indeed have diﬀerent

downstream performance, especially when the pre-training loss reaches a near-optimal level. Concretely, using

simpliﬁed text datasets, we ﬁnd three situations that demonstrate such a phenomenon:

•Even after the pre-training loss converges, models at a later time step still tend to perform better.

•

Models trained by standard algorithms have better performance than adversarially trained models with the

same pre-training loss.

•

Larger models perform better downstream than smaller models even if they have the same pre-training loss.

These situations are most prominent in the saturation regime, where the models are close to the minimal

possible pre-training loss (aka the entropy of the conditional probability, which can be estimated in our simpliﬁed

datasets). In the saturation regime, the pre-training loss of all models are almost the same, but the transferability

to downstream tasks varies. Interestingly, this phenomenon also holds when linear probing on contextualized

presentations is used for evaluating downstream performance instead of ﬁnetuning. Thus, even though the

predicted conditional probabilities of two models are the same (and correct), the contextualized representations

can behave diﬀerently.

In each of the ﬁrst two cases above, we ﬁnd two models with the same pre-training loss and the same

architecture; but one has a better performance than the other. They only diﬀer by the training algorithms

that are used to produce them. Therefore, this suggests the training algorithms have an implicit bias toward

one of these models—standard algorithms with more training steps biases towards parameter conﬁgurations

that transfer better to downstream tasks. The third case has a more subtle but similar interpretation. There

exists a hypothetical large model that represents the smaller model with worse downstream performance (by

ﬁlling zeros in the weights or replicating the weights of the smaller model). The training algorithm on the large

architecture could have chosen it, but did not. This suggests the algorithm has an implicit bias against the

hypothetical model (which has an equally good loss).

In supervised settings, optimizers are known to have an implicit bias toward selecting generalizable models

among all models with small empirical loss. E.g., see Damian et al.

[11]

, Li et al.

[34]

, which show that

SGD implicitly biases toward ﬂatter minima, and references therein. However, the role of implicit bias in

self-supervised learning has not been studied and is conceptually diﬀerent. Unlike in supervised learning, the

gap between empirical and population self-supervised losses is typically small, and thus implicit bias does

not seem to contribute to bridging this gap. Instead, the implicit bias selects local minima of the population

self-supervised loss that transfer better to downstream tasks.

Why do the algorithms bias toward some type of models? In Section 4, we provide a ﬁrst-cut theoretical analy-

sis of the implicit bias in language modeling. Fortunately, despite the conceptual diﬀerences, mathematical tools

from supervised settings can be straightforwardly adapted to language modeling settings. We prove that mini-

batch SGD prefers ﬂatter minima of population pre-training loss among all minima in the saturation regime. Inter-

estingly, we obtain cleaner theoretical results for the standard mini-batch SGD, without the artiﬁcial label noise in-

troduced in prior works [

], partly because the mini-batch noise for LLMs does not vanish even at convergence.

We corroborate our theory with empirical evidence in Section 5. We show that for models with the same

pre-training loss in the three situations above, ﬂatness of the model (measured by the trace of Hessian of the

loss, as predicted by the theory) strongly correlates with the downstream performance.

Finally, to complement the theory and experiments above, we also rigorously formalize the connection

between ﬂatness and downstream performance in a simpliﬁed Dyck language setting in Section 6. In this

setting, we prove that there are many models with good MLM pre-training loss; among them, the ﬂattest

model learns the most useful features for downstream tasks. Here, results from the supervised setting cannot

be readily adapted since they are obtained (partially) via generalization bounds [

], which do not apply

to the language modeling setting where the implicit bias is not related to the gap between the empirical and

population loss. Proving the correlation between ﬂatness and downstream performance in more general settings

likely requires highly non-trivial and novel theoretical tools, and we hope to motivate future work on this topic.

(a) PCFG→Task C (b) OPT→QNLI

Figure 1: Models at a later time step performs better, even after the pre-training loss converges. (a) A model

with 41M parameters pre-trained on the PCFG-generated dataset, and evaluated on task C. (b) A model with

235M parameters pre-trained on the OPT-generated dataset, and evaluated on QNLI. The error bar in (a)

shows the standard deviation of 5 random seeds in linear probe. We do not provide error bars in (b) due to

the limitation of computation resources. Also note that the pre-training loss is approaching its minimal value

(3.196 for the PCFG-generated dataset and 1.865 for the OPT-generated dataset) as we increase the number of

steps. In Section A.6, we further provide evaluation of the pre-training loss with the KL divergence.

2 Related Work

Language modeling and downstream adaptation.

Large language modeling has revolutionized the

NLP ﬁeld. Starting from Devlin et al.

[12]

, a line of works improve the downstream performance on a wide range

of tasks with increasing model size and data amount [

]. LLMs even exhibit unexpected emergent

behaviors, such as in-context learning [

], step-by-step reasoning [

], and zero-shot learning [

]. Kaplan

et al.

[27]

, Hernandez et al.

[20]

study the behavior of language models with increasing size, and ﬁnd out that the

pre-training loss is typically correlated with downstream performance as model size increases. In practice, the pre-

training loss is used as an evaluation metric for language models. A notable example is the eﬃcient transformer

line of works, which benchmark the pre-training loss given the same computation constraint [

Understanding large language models.

Empirical works on understanding MLM ﬁnd out that the rep-

resentations of language models encode rich semantic and syntactic information [

]. Theoretical

works show that good LM loss, or the ability to ﬁt the LM conditional probability, is a suﬃcient condition

for good performance on downstream tasks. Zhang and Hashimoto

[67]

show MLM representations recover

latent variables in graphical models. Saunshi et al.

[46]

introduce the natural assumption, which states that

downstream tasks can be solved linearly with the true conditional probability. Wei et al.

[59]

instantiate MLM

on datasets generated by HMMs and show linear probe on top of MLM models solves downstream tasks. In

contrast, our empirical evidence indicates that other factors related to the architecture and optimization also

contribute to the performance beyond the natural assumption—somewhat surprisingly, we show that linear

probe on top of the features of language models is better than linear probe on top of true conditional probability.

Recent works also provide empirical evidence that good pre-training loss cannot fully explain the downstream

success. Tay et al.

[50]

ﬁnd out that a narrow but deep transformer is better than a wide but shallow transformer

with the same pre-training loss. Zhang et al.

[68]

demonstrate that Albert [

] generalizes better to OOD tasks

than Bert on a synthetic reasoning task. These works indicate that the architecture is an important factor for

good downstream performance beyond pre-training loss. This paper discovers the implicit bias in language

modeling on models with the same architecture and the same pre-training loss. Similar to our ﬁndings, Xie et al.

[62]

also observe that despite similar perplexity, larger models are better than smaller modes for in-context

learning, while in this paper, we focus on the standard ﬁne-tuning and linear probe evaluation of language models,

and provide a novel understanding of the mechanism behind the superiority of large models over small models.

Understanding self-supervised learning.

Our work is also related to the broader theoretical self-supervised

learning literature. This line of works study why a seemingly unrelated self-supervised objective helps improve

the performance on downstream tasks. Arora et al.

[2]

prove that contrastive learning representations work

on downstream linear classiﬁcation tasks. Lee et al.

[31]

study reconstruction-based self-supervised learning

algorithms and show that linear probe on top of the self-supervised representations solves downstream tasks.

HaoChen et al.

[18]

show that the contrastive learning loss can be viewed as a principled spectral clustering

objective. With the spectral contrastive loss, self-supervised representations recover the cluster structure in

the augmentation graph. Recently, Saunshi et al.

[47]

introduce the disjoint augmentation regime, where the

minimizer of the contrastive learning loss can perform poorly on downstream tasks. Empirically, they ﬁnd

out that subtracting the mean of the representations of each class makes self-supervised models perform worse

on downstream tasks, and ResNet [

] can have better downstream performance on downstream tasks than

ViT [

] and MLP-Mixer [

] on modiﬁed images. This indicates that pre-training loss is not all that matters

for good downstream performance in self-supervised learning.

Implicit bias in supervised learning.

The training algorithm chooses solutions with certain properties,

and usually leads to better generalization [

]. Recently, Blanc et al.

[6]

, Damian

et al.

[11]

, Li et al.

[34]

demonstrate label noise SGD biases the models toward ﬂatter minima. However, the

setting of implicit bias in supervised learning is diﬀerent from language modeling. In language modeling, we

have access to gigantic corpus, and cannot interpolate the pre-training dataset. Moreover, we care about the

adaptability of the solution on downstream tasks instead of generalization in distribution.

3 The Existence of Implicit Bias in Language Modeling

In this section, we systematically investigate the relationship between pre-training loss and downstream per-

formance with experiments. We ﬁnd out that models with the same pre-training loss but diﬀerent training

procedures can have diﬀerent downstream performance.

3.1 Formulations

Masked language modeling.

Consider a vocabulary

{

,...,c}

, where 0 is a special token for the

mask. Let

x1,...,xT

] denote the input sequence of length

, and

x−t

x1,...,xt−1,

,xt+1,...,xT

] denote the

masked sentence, where

is sampled uniformly randomly and independently from [

The MLM conditional

probability refers to the probability of

given the rest of the sequence

Pr(xt|x−t)

. We use

Pr(·|x−t)

to denote

the

-dimensional probability vector

Pr(·|x−t)

:=[

Pr(xt=1|x−t),...,Pr(xt=c|x−t)

]

∈Rc

. In MLM pre-training,

the model

fθ

(

) (parameterized by

) outputs the predicted MLM conditional probability vector

fθ

(

x−t

)

∈Rc

The model is trained to predict the masked token

given the rest of the sentence

x−t

with cross entropy loss,

L(θ)=Ex,t[`(fθ(x−t),xt)]=Ex,t[−log([fθ(x−t)]xt)].

Downstream evaluation.

The language model

fθ

is composed of a feature extractor

hψ

, which outputs a

sequence of contextual representations, and a linear classiﬁer that outputs the conditional probability at every

position. On downstream tasks, we use a randomly initialized

gφ

on top of the pre-trained

hψ

. In ﬁne-tuning, both

gφ

and

hψ

are trained, while in linear probe, only

gφ

is updated. For ﬁne-tuning, we use the contextual representa-

tions of the

cls

token. For linear probe, we concatenate the contextual representations of all the tokens together.

Saturation regime.

To study models with the same pre-training loss, we introduce the saturation regime in

this paper, where the model output equals the true conditional probability,

fθ

(

x−t

Pr(·|x−t)

. In the saturation

regime, the MLM loss is equal to the entropy of the true conditional probability

(

Ex,t

[

−log

(

Pr(xt|x−t)

)]=

TPT

t=1H

(

xt|x−t

), which is also the optimal pre-training loss. Thus, all models in the saturation regime have

the same, optimal pre-training loss, and we will show that they behave diﬀerently on downstream tasks. Our

1For simplicity, we only consider masking out one one token in each sentence.

(a) PCFG→Task B (b) HMM→Task-10 (c) OPT→QNLI

Figure 2: Larger models perform better downstream than smaller models, even with almost the same pre-training

loss. (a) Pre-train on the PCFG-generated dataset and evaluate on task B. (b) Pre-train on the HMM-generated

dataset and evaluate on task-10. (c) Pre-train on the OPT-generated dataset and evaluate on QNLI. See

Section 3and Section Afor details.

experiments use expressive enough architectures such that there are multiple parameter conﬁgurations in the

saturation regime for our simpliﬁed datasets. For real large-scale data, it is currently computationally challenging

to arrive at the saturation regime. However, we hope that our experiments can provide insights for even larger

models in the future and for other regimes where pre-training loss does not explain downstream performance.

3.2 Experimental Setup

We design controlled experiments to study the correlation between pre-training loss and downstream perfor-

mance. In particular, we will ﬁnd a set of models with almost the same pre-training loss. We eﬀectively use the

same architecture family so that the main diﬀerence between the models only stems from training algorithms.

More details are provided in Section A.

Datasets.

We introduce three generative models to produce simpliﬁed datasets, with which we can study

various factors systematically. With the knowledge of the true generative models that generate the data, we

can compute the true conditional probability and scale up the models until they approach the saturation regime

to ensure they have almost the same pre-training loss. Moreover, we can generate unlimited amount of text

for pre-training to avoid overﬁtting to the empirical pre-training loss.

PCFG-generated dataset. PCFG [

] generates sentences with probabilistic trees and is widely used to

understand natural language [

]. We randomly generate the production rules which satisfy the

Chomsky Normal Form [

]. The non-terminal symbols in the parse tree can be viewed as intrinsic quantities

associated with the sentence such as sentiment and syntax. Thus we design three downstream tasks A, B,

and C to classify non-terminal symbols at diﬀerent positions of the parse trees.

HMM-generated dataset. HMM samples the hidden variables from the transition probabilities and the tokens

from the emission probabilities. [

] also analyze the properties of pre-trained language models with

HMMs. We generate the transition and emission probabilities as random block-diagonal stochastic matrices.

The downstream task is to classify the hidden variable in the sentence. We use task-

to refer to classifying

the k-th hidden variable.

OPT-generated dataset. We also introduce a more realistic pre-training dataset generated by the OPT

models [

]. Starting from the

bos

token, we sample each token from the conditional probability output by

the OPT model. For computational feasibility we only allow to generate the top-2000 most frequent tokens

in the OPT vocabulary. We use QNLI and SST-2 from GLUE [53] as downstream tasks.

Note that the true conditional probability can be computed eﬃciently for the three datasets given the knowl-

edge of the generated models. For PCFG and HMM-generated datasets, we can compute the true conditional

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SamePre-trainingLoss,BetterDownstream:ImplicitBiasMattersforLanguageModelsHongLiuSangMichaelXieZhiyuanLiTengyuMaStanfordUniversityfhliu99,sxie,zhiyuanli,tengyumag@stanford.eduOctober26,2022AbstractLanguagemodelingonlarge-scaledatasetsleadstoimpressiveperformancegainsonvariousdownstreamlanguagetasks....

展开>> 收起<<

Same Pre-training Loss Better Downstream Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Same Pre-training Loss Better Downstream Implicit Bias Matters for Language Models Hong Liu Sang Michael Xie Zhiyuan Li Tengyu Ma

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: