
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models
learning, as shown in our SuperGLUE finetuning experi-
mental results, where our method improves 1B parameter
PaLM model’s finetuneing performance from 67.0 to 68.7,
and our method improves 8B parameters PaLM model’s
finetuning performance on all 8 SuperGLUE tasks, improv-
ing the score from 80.7 to 83.1. We also propose an exten-
sion of our method called Two-Pass FCM (T-FCM), which
applies FCM twice on a replicated input sequence. In do-
ing so, T-FCM effectively makes causal language model
see bidirectional context without altering sequence order-
ing. While this adds extra computation cost during train-
ing, T-FCM further boosts finetuning performance without
hurting few-shot results, improving the score from 80.7 to
to 87.8 (8B) and 67.0 to 73.5 (1B).
Contributions. We highlight the contributions of our paper
below:
• We present FCM, a simple and scalable pre-training
methodology for causal language modeling. We pro-
vide the empirical evaluation of FCM on a suite of
few-shot and finetuning benchmarks.
• We show that FCM is highly effective at improving
zero-shot and few-shot learning results, outperforms
strong baselines including PaLM and UL2, improving
the average SuperGLUE score of 8 billion parameters
PaLM from 61.6 to 64.0, and improving PaLM on a
wide range of 19 NLP tasks.
• In addition to few-shot learning, we demonstrate that
FCM significantly helps with finetuning to down-
stream tasks, improving the performance of 8 billion
parameters PaLM on 8 out of 8 SuperGLUE tasks and
the average SuperGLUE score from 80.7 to 83.1.
• We propose Two-Pass FCM (T-FCM), a simple yet ef-
fective extension of FCM that introduces bidirectional
context to causal language models without altering
the sequence order. We observe that T-FCM further
boosts finetuning performance on SuperGLUE score
from 80.7 to 87.8 without affecting few-shot learning
performance.
2. Method
2.1. Pre-training Objective
Forgetful Causal Masking (FCM). FCM uses a stan-
dard causal, decoder-only Transformer model architec-
ture (Vaswani et al.,2017), i.e., each timestep can only at-
tend to itself and past timesteps. We illustrate FCM in Fig-
ure 2. Given an input text x= [x1,··· , xn], the standard
causal language modeling objective is defined to maximize
the log likelihood of xautoregressively:
log p(x) = log
n
Y
i=1
p(xi|x1, x2, . . . , xi−1)
= log
n
Y
i=1
p(xi|x<i) := log
n
Y
i=1
p(xi|[xj]i−1
j=0).
(1)
In FCM, we randomly sample a mask ratio from m∼[0, η]
where η∈[0,1] is a fixed maximum mask ratio. We
use η= 0.15 throughout the experiments unless other-
wise mentioned. The model is asked to predict each token
xi∈x, and can only attend to tokens in x<i that are not
sampled. Concretely, the FCM objective is given by:
log p(x) = log
n
Y
i=1
p(xi|[I[mj> η]·xj]i−1
j=0),(2)
where mj∼ U(0,1). This can be efficiently implemented
by combining it with causal attention mask. While apply-
ing random masking to the token sequence, we always ex-
clude the special BOS (‘beginning of sentence’) token at
the beginning of each sequence, so that the model is aware
of the beginning of a sentence. Moreover, keeping the BOS
token unmasked helps with training stability because it en-
sures that there is at least one token unmasked without
changing the semantic meaning of the sequence. For ex-
ample, when predicting token xtfor small t, it is possible
that all tokens [x1, ..., xt−1]are masked, which can cause
instability in the training loss. We found that this technique
enables us to train arbitrary high mask ratios without incur-
ring instability.
Two-Pass FCM (T-FCM). Prior work has discovered that
masked language models have better finetuning perfor-
mance than similar size or bigger causal language mod-
els (see, e.g., Wang et al.,2022;Tay et al.,2022, inter
alia). One hypothesis for this performance gap is that
masked language models can use bidirectional context dur-
ing training, while causal language models cannot. To
bridge this gap, we propose Two-Pass FCM (T-FCM) to
introduce bidirectional context into causal language mod-
els during training. T-FCM simply replicates the input se-
quence twice, where the first pass is identical to causal lan-
guage modeling, and the second pass is similar to masked
language modeling since the model can attend to masked
future tokens by looking into the first pass.
During training, T-FCM introduces an additional sentinel
token [copy] to let the model know that a copied sequence
begins. In practice, we found it important to not apply loss
on predicting [bos] from [copy] token, otherwise it desta-
bilizes training. The reason is that the position of [copy]
is arbitrary, hence predicting [bos] from [copy] is not well-
defined.