
Understanding self-supervised learning.
Our work is also related to the broader theoretical self-supervised
learning literature. This line of works study why a seemingly unrelated self-supervised objective helps improve
the performance on downstream tasks. Arora et al.
[2]
prove that contrastive learning representations work
on downstream linear classification tasks. Lee et al.
[31]
study reconstruction-based self-supervised learning
algorithms and show that linear probe on top of the self-supervised representations solves downstream tasks.
HaoChen et al.
[18]
show that the contrastive learning loss can be viewed as a principled spectral clustering
objective. With the spectral contrastive loss, self-supervised representations recover the cluster structure in
the augmentation graph. Recently, Saunshi et al.
[47]
introduce the disjoint augmentation regime, where the
minimizer of the contrastive learning loss can perform poorly on downstream tasks. Empirically, they find
out that subtracting the mean of the representations of each class makes self-supervised models perform worse
on downstream tasks, and ResNet [
19
] can have better downstream performance on downstream tasks than
ViT [
13
] and MLP-Mixer [
51
] on modified images. This indicates that pre-training loss is not all that matters
for good downstream performance in self-supervised learning.
Implicit bias in supervised learning.
The training algorithm chooses solutions with certain properties,
and usually leads to better generalization [
16
,
48
,
32
,
25
,
1
,
37
,
33
,
61
,
17
]. Recently, Blanc et al.
[6]
, Damian
et al.
[11]
, Li et al.
[34]
demonstrate label noise SGD biases the models toward flatter minima. However, the
setting of implicit bias in supervised learning is different from language modeling. In language modeling, we
have access to gigantic corpus, and cannot interpolate the pre-training dataset. Moreover, we care about the
adaptability of the solution on downstream tasks instead of generalization in distribution.
3 The Existence of Implicit Bias in Language Modeling
In this section, we systematically investigate the relationship between pre-training loss and downstream per-
formance with experiments. We find out that models with the same pre-training loss but different training
procedures can have different downstream performance.
3.1 Formulations
Masked language modeling.
Consider a vocabulary
W
=
{
0
,
1
,...,c}
, where 0 is a special token for the
mask. Let
x
=[
x1,...,xT
] denote the input sequence of length
T
, and
x−t
=[
x1,...,xt−1,
0
,xt+1,...,xT
] denote the
masked sentence, where
t
is sampled uniformly randomly and independently from [
T
].
1
The MLM conditional
probability refers to the probability of
xt
given the rest of the sequence
Pr(xt|x−t)
. We use
Pr(·|x−t)
to denote
the
c
-dimensional probability vector
Pr(·|x−t)
:=[
Pr(xt=1|x−t),...,Pr(xt=c|x−t)
]
∈Rc
. In MLM pre-training,
the model
fθ
(
·
) (parameterized by
θ
) outputs the predicted MLM conditional probability vector
fθ
(
x−t
)
∈Rc
.
The model is trained to predict the masked token
xt
given the rest of the sentence
x−t
with cross entropy loss,
L(θ)=Ex,t[`(fθ(x−t),xt)]=Ex,t[−log([fθ(x−t)]xt)].
Downstream evaluation.
The language model
fθ
is composed of a feature extractor
hψ
, which outputs a
sequence of contextual representations, and a linear classifier that outputs the conditional probability at every
position. On downstream tasks, we use a randomly initialized
gφ
on top of the pre-trained
hψ
. In fine-tuning, both
gφ
and
hψ
are trained, while in linear probe, only
gφ
is updated. For fine-tuning, we use the contextual representa-
tions of the
cls
token. For linear probe, we concatenate the contextual representations of all the tokens together.
Saturation regime.
To study models with the same pre-training loss, we introduce the saturation regime in
this paper, where the model output equals the true conditional probability,
fθ
(
x−t
)=
Pr(·|x−t)
. In the saturation
regime, the MLM loss is equal to the entropy of the true conditional probability
L
(
θ
)=
Ex,t
[
−log
(
Pr(xt|x−t)
)]=
1
TPT
t=1H
(
xt|x−t
), which is also the optimal pre-training loss. Thus, all models in the saturation regime have
the same, optimal pre-training loss, and we will show that they behave differently on downstream tasks. Our
1For simplicity, we only consider masking out one one token in each sentence.
4