Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models Hao Liu 1 2Xinyang Geng 1 2Lisa Lee2Igor Mordatch2

2025-05-06 0 0 687.26KB 15 页 10玖币
侵权投诉
Towards Better Few-Shot and Finetuning Performance
with Forgetful Causal Language Models
Hao Liu *12 Xinyang Geng *12 Lisa Lee 2Igor Mordatch 2
Sergey Levine 1Sharan Narang 2Pieter Abbeel 1
Abstract
Large language models (LLM) trained using the
next-token-prediction objective, such as GPT3
and PaLM, have revolutionized natural language
processing in recent years by showing impressive
zero-shot and few-shot capabilities across a wide
range of tasks. In this work, we propose a sim-
ple technique that significantly boosts the perfor-
mance of LLMs without adding computational
cost. Our key observation is that, by perform-
ing the next token prediction task with randomly
selected past tokens masked out, we can im-
prove the quality of the learned representations
for downstream language understanding tasks.
We hypothesize that randomly masking past to-
kens prevents over-attending to recent tokens and
encourages attention to tokens in the distant past.
We find that our method, Forgetful Causal Mask-
ing (FCM), significantly improves both few-shot
and finetuning performance of PaLM. We further
consider a simple extension, T-FCM, which in-
troduces bidirectional context to causal language
model without altering the sequence order, and
further improves finetuning performance.
1. Introduction
Language model (LM) pre-training has substantially ad-
vanced the state-of-the-art across a variety of natural lan-
guage processing tasks (Peters et al.,2018;Devlin et al.,
2018;Brown et al.,2020;Chowdhery et al.,2022) and
related fields including image generation, reasoning, and
code generation (Alayrac et al.,2022;Lewkowycz et al.,
2022;Saharia et al.,2022;Chen et al.,2021). Prior work
on pre-training have focused on mixing different choices of
architecture (e.g., encoder-only, decoder-only, or encoder-
*Equal contribution 1UC Berkeley 2Google Research, Brain
Team. Correspondence to: Hao Liu <hao.liu@berkeley.edu>,
Xinyang Geng <young.geng@berkeley.edu>.
Preliminary work. Under review by the International Conference
on Machine Learning (ICML).
decoder) with different objective functions (e.g., mask-
ing or causal language modeling). For example, masked
encoder-only models such as BERT (Devlin et al.,2018)
and RoBERTa (Liu et al.,2019) excel in discriminative
finetuning tasks such as classification. Similarly, masked
encoder-decoder models such as BART (Lewis et al.,2019)
and T5 (Roberts et al.,2019) perform well on both dis-
criminative and generative finetuning. While masked lan-
guage modeling is effective for finetuning and removes the
need for task-specific architectures, its major limitation is
that there is still a need for task-specific datasets and task-
specific finetuning. On the other hand, decoder-only causal
language models remove such limitations. In fact, they
are capable of zero-shot and few-shot adaptation without
the need for finetuning, by simply prompting the model
with appropriate strings to control the generated outputs, as
shown in GPT3 (Brown et al.,2020) and PaLM (Chowdh-
ery et al.,2022).
Driven by its impressive zero-shot and few-shot abilities,
there has been more work on scaling causal decoder-only
architectures (Zhang et al.,2022;Black et al.,acl;Brown
et al.,2020;Chowdhery et al.,2022) compared to encoder-
based architectures, and there has been significant inter-
ests in studying such models in various contexts (Hoffmann
et al.,2022;Wei et al.,2022b;Li & Liang,2021;Ahn
et al.,2022;Chen et al.,2021). However, such decoder-
only models are still limited by their imperfect zero-shot
and few-shot adaptation compared to human performance,
and their relatively inferior finetuning performance com-
pared to masked language modeling.
To address the above challenges, prior work have proposed
to combine masked modeling with causal language model-
ing (Dong et al.,2019;Wang et al.,2022;Tay et al.,2022;
Du et al.,2022) to bring the benefit of masked modeling
to causal language models while retaining their zero-shot
ability. However, such approaches typically introduce ex-
tra computation and parameters or require using a sophis-
ticated attention masking strategy which hinders practical
usages (Yang et al.,2019;Tay et al.,2022). Moreover, they
typically train encoder-decoder models which are not nat-
urally suitable for zero- and few-shot inference tasks com-
arXiv:2210.13432v2 [cs.CL] 31 Jan 2023
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models
Average SuperGLUE Score
45.0
50.0
55.0
60.0
65.0
PaLM 1B-
180B
FCM 1B-
180B
T-FCM 1B-
180B
PaLM* 8B-
780B HQ
PaLM 8B-
180B
FCM 8B-
180B
T-FCM 8B-
180B
GPT-3 13B-
300B HQ
T5-XXL 11B-
1000B
UL2 20B-
1000B
GPT-3 175B-
300B HQ
ST-MoE
269B-500B
Zero-shot Performance
Average SuperGLUE Score
45.0
50.0
55.0
60.0
65.0
PaLM 1B-180B FCM 1B-180B T-FCM 1B-180B PaLM 8B-180B FCM 8B-180B T-FCM 8B-180B
One-shot Performance
Average SuperGLUE Score
45.0
50.0
55.0
60.0
65.0
PaLM 1B-180B FCM 1B-180B T-FCM 1B-180B PaLM 8B-180B FCM 8B-180B T-FCM 8B-180B
Few-shot Performance
Task score
50.0
60.0
70.0
80.0
90.0
BoolQ CB RTE ReCORD WiC COPA MultiRC WSC Avg
PaLM FCM T-FCM
SuperGLUE Finetuning Performance (1B)
Task score
70.0
80.0
90.0
100.0
BoolQ CB RTE ReCORD WiC COPA MultiRC WSC Avg
PaLM FCM T-FCM
SuperGLUE Finetuning Performance (8B)
Figure 1. FCM and T-FCM outperform PaLM in zero- and few-shot as well as finetuning tasks. We report the averaged scores in each
category. Scores are averaged over 3 evaluation random seeds. Top: We compare zero-shot, one-shot, and five-shot average performance
on the SuperGLUE benchmark of different model sizes and dataset sizes. PaLM?8B-780B HQ denotes the published results of 8B model
trained on 780B tokens from high quality datasets, PaLM 8B-180B denotes the same setup but trained on 180B tokens from C4 dataset,
and FCM 8B-180B denote the same 8B model trained on 180B tokens from C4 dataset using FCM as objective. Bottom: We compare
finetuning performance on SuperGLUE for 1B model size (left) and 8B model size (right). T-FCM, a simple extension of FCM, further
boosts finetuning performance significantly while achieving similar few-shot performance as FCM.
pared with decoder-only causal language models and are
still outperformed by causal language models (Sanh et al.,
2022;Brown et al.,2020;Chowdhery et al.,2022). In or-
der to further improve causal language models few-shot
abilities, some works proposed better prompt engineering
methods (Liu et al.,2021;Lester et al.,2021;Ling et al.,
2017;Wei et al.,2022b;Li & Liang,2021) or better finetun-
ing methods (Mishra et al.,2022;Wei et al.,2022a;Sanh
et al.,2022). Prompt-based methods are sensitive to de-
sign (Lester et al.,2021;Liu et al.,2021), while finetuning-
based approaches typically require a huge amount of su-
pervision to work with as shown in Sanh et al. (2022). In
addition, such methods can only improve pre-trained model
and are unable to improve pre-training.
In this work, we propose a pre-training approach that does
not incur any extra computation cost or parameters, to im-
prove few-shot and zero-shot performance, as well as rep-
resentation learning of causal language models. Our key
observation is that, by performing next token prediction
task with randomly selected past tokens masked out, we
can improve the quality of the learned representations for
downstream language understanding tasks. Our method,
Forgetful Causal Masking (FCM), can be efficiently imple-
mented by randomly masking input tokens in the causal
language model. Applying our method to PaLM (Chowd-
hery et al.,2022), a state-of-the-art causal language model,
we see significant improvement on the SuperGLUE (Sar-
lin et al.,2020) benchmark: our method significantly im-
proves the 1B-model-size PaLM’s zero-shot performance
from 55.7 to 59.2 and improves the 8B-model-size PaLM’s
zero-shot performance from 61.6 to 64.0. We further evalu-
ate FCM on a diverse suite of NLP tasks from Brown et al.
(2020), and observe improvements in few-shot learning
on most tasks. In addition, FCM improves representation
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models
learning, as shown in our SuperGLUE finetuning experi-
mental results, where our method improves 1B parameter
PaLM model’s finetuneing performance from 67.0 to 68.7,
and our method improves 8B parameters PaLM model’s
finetuning performance on all 8 SuperGLUE tasks, improv-
ing the score from 80.7 to 83.1. We also propose an exten-
sion of our method called Two-Pass FCM (T-FCM), which
applies FCM twice on a replicated input sequence. In do-
ing so, T-FCM effectively makes causal language model
see bidirectional context without altering sequence order-
ing. While this adds extra computation cost during train-
ing, T-FCM further boosts finetuning performance without
hurting few-shot results, improving the score from 80.7 to
to 87.8 (8B) and 67.0 to 73.5 (1B).
Contributions. We highlight the contributions of our paper
below:
We present FCM, a simple and scalable pre-training
methodology for causal language modeling. We pro-
vide the empirical evaluation of FCM on a suite of
few-shot and finetuning benchmarks.
We show that FCM is highly effective at improving
zero-shot and few-shot learning results, outperforms
strong baselines including PaLM and UL2, improving
the average SuperGLUE score of 8 billion parameters
PaLM from 61.6 to 64.0, and improving PaLM on a
wide range of 19 NLP tasks.
In addition to few-shot learning, we demonstrate that
FCM significantly helps with finetuning to down-
stream tasks, improving the performance of 8 billion
parameters PaLM on 8 out of 8 SuperGLUE tasks and
the average SuperGLUE score from 80.7 to 83.1.
We propose Two-Pass FCM (T-FCM), a simple yet ef-
fective extension of FCM that introduces bidirectional
context to causal language models without altering
the sequence order. We observe that T-FCM further
boosts finetuning performance on SuperGLUE score
from 80.7 to 87.8 without affecting few-shot learning
performance.
2. Method
2.1. Pre-training Objective
Forgetful Causal Masking (FCM). FCM uses a stan-
dard causal, decoder-only Transformer model architec-
ture (Vaswani et al.,2017), i.e., each timestep can only at-
tend to itself and past timesteps. We illustrate FCM in Fig-
ure 2. Given an input text x= [x1,··· , xn], the standard
causal language modeling objective is defined to maximize
the log likelihood of xautoregressively:
log p(x) = log
n
Y
i=1
p(xi|x1, x2, . . . , xi1)
= log
n
Y
i=1
p(xi|x<i) := log
n
Y
i=1
p(xi|[xj]i1
j=0).
(1)
In FCM, we randomly sample a mask ratio from m[0, η]
where η[0,1] is a fixed maximum mask ratio. We
use η= 0.15 throughout the experiments unless other-
wise mentioned. The model is asked to predict each token
xix, and can only attend to tokens in x<i that are not
sampled. Concretely, the FCM objective is given by:
log p(x) = log
n
Y
i=1
p(xi|[I[mj> η]·xj]i1
j=0),(2)
where mj∼ U(0,1). This can be efficiently implemented
by combining it with causal attention mask. While apply-
ing random masking to the token sequence, we always ex-
clude the special BOS (‘beginning of sentence’) token at
the beginning of each sequence, so that the model is aware
of the beginning of a sentence. Moreover, keeping the BOS
token unmasked helps with training stability because it en-
sures that there is at least one token unmasked without
changing the semantic meaning of the sequence. For ex-
ample, when predicting token xtfor small t, it is possible
that all tokens [x1, ..., xt1]are masked, which can cause
instability in the training loss. We found that this technique
enables us to train arbitrary high mask ratios without incur-
ring instability.
Two-Pass FCM (T-FCM). Prior work has discovered that
masked language models have better finetuning perfor-
mance than similar size or bigger causal language mod-
els (see, e.g., Wang et al.,2022;Tay et al.,2022, inter
alia). One hypothesis for this performance gap is that
masked language models can use bidirectional context dur-
ing training, while causal language models cannot. To
bridge this gap, we propose Two-Pass FCM (T-FCM) to
introduce bidirectional context into causal language mod-
els during training. T-FCM simply replicates the input se-
quence twice, where the first pass is identical to causal lan-
guage modeling, and the second pass is similar to masked
language modeling since the model can attend to masked
future tokens by looking into the first pass.
During training, T-FCM introduces an additional sentinel
token [copy] to let the model know that a copied sequence
begins. In practice, we found it important to not apply loss
on predicting [bos] from [copy] token, otherwise it desta-
bilizes training. The reason is that the position of [copy]
is arbitrary, hence predicting [bos] from [copy] is not well-
defined.
摘要:

TowardsBetterFew-ShotandFinetuningPerformancewithForgetfulCausalLanguageModelsHaoLiu*12XinyangGeng*12LisaLee2IgorMordatch2SergeyLevine1SharanNarang2PieterAbbeel1AbstractLargelanguagemodels(LLM)trainedusingthenext-token-predictionobjective,suchasGPT3andPaLM,haverevolutionizednaturallanguageprocessing...

展开>> 收起<<
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models Hao Liu 1 2Xinyang Geng 1 2Lisa Lee2Igor Mordatch2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:687.26KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注