Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models Hao Liu 1 2Xinyang Geng 1 2Lisa Lee2Igor Mordatch2

2025-05-06 0 0 687.26KB 15 页 10玖币

侵权投诉

Towards Better Few-Shot and Finetuning Performance

with Forgetful Causal Language Models

Hao Liu *12 Xinyang Geng *12 Lisa Lee 2Igor Mordatch 2

Sergey Levine 1Sharan Narang 2Pieter Abbeel 1

Abstract

Large language models (LLM) trained using the

next-token-prediction objective, such as GPT3

and PaLM, have revolutionized natural language

processing in recent years by showing impressive

zero-shot and few-shot capabilities across a wide

range of tasks. In this work, we propose a sim-

ple technique that signiﬁcantly boosts the perfor-

mance of LLMs without adding computational

cost. Our key observation is that, by perform-

ing the next token prediction task with randomly

selected past tokens masked out, we can im-

prove the quality of the learned representations

for downstream language understanding tasks.

We hypothesize that randomly masking past to-

kens prevents over-attending to recent tokens and

encourages attention to tokens in the distant past.

We ﬁnd that our method, Forgetful Causal Mask-

ing (FCM), signiﬁcantly improves both few-shot

and ﬁnetuning performance of PaLM. We further

consider a simple extension, T-FCM, which in-

troduces bidirectional context to causal language

model without altering the sequence order, and

further improves ﬁnetuning performance.

1. Introduction

Language model (LM) pre-training has substantially ad-

vanced the state-of-the-art across a variety of natural lan-

guage processing tasks (Peters et al.,2018;Devlin et al.,

2018;Brown et al.,2020;Chowdhery et al.,2022) and

related ﬁelds including image generation, reasoning, and

code generation (Alayrac et al.,2022;Lewkowycz et al.,

2022;Saharia et al.,2022;Chen et al.,2021). Prior work

on pre-training have focused on mixing different choices of

architecture (e.g., encoder-only, decoder-only, or encoder-

*Equal contribution 1UC Berkeley 2Google Research, Brain

Team. Correspondence to: Hao Liu <hao.liu@berkeley.edu>,

Xinyang Geng <young.geng@berkeley.edu>.

Preliminary work. Under review by the International Conference

on Machine Learning (ICML).

decoder) with different objective functions (e.g., mask-

ing or causal language modeling). For example, masked

encoder-only models such as BERT (Devlin et al.,2018)

and RoBERTa (Liu et al.,2019) excel in discriminative

ﬁnetuning tasks such as classiﬁcation. Similarly, masked

encoder-decoder models such as BART (Lewis et al.,2019)

and T5 (Roberts et al.,2019) perform well on both dis-

criminative and generative ﬁnetuning. While masked lan-

guage modeling is effective for ﬁnetuning and removes the

need for task-speciﬁc architectures, its major limitation is

that there is still a need for task-speciﬁc datasets and task-

speciﬁc ﬁnetuning. On the other hand, decoder-only causal

language models remove such limitations. In fact, they

are capable of zero-shot and few-shot adaptation without

the need for ﬁnetuning, by simply prompting the model

with appropriate strings to control the generated outputs, as

shown in GPT3 (Brown et al.,2020) and PaLM (Chowdh-

ery et al.,2022).

Driven by its impressive zero-shot and few-shot abilities,

there has been more work on scaling causal decoder-only

architectures (Zhang et al.,2022;Black et al.,acl;Brown

et al.,2020;Chowdhery et al.,2022) compared to encoder-

based architectures, and there has been signiﬁcant inter-

ests in studying such models in various contexts (Hoffmann

et al.,2022;Wei et al.,2022b;Li & Liang,2021;Ahn

et al.,2022;Chen et al.,2021). However, such decoder-

only models are still limited by their imperfect zero-shot

and few-shot adaptation compared to human performance,

and their relatively inferior ﬁnetuning performance com-

pared to masked language modeling.

To address the above challenges, prior work have proposed

to combine masked modeling with causal language model-

ing (Dong et al.,2019;Wang et al.,2022;Tay et al.,2022;

Du et al.,2022) to bring the beneﬁt of masked modeling

to causal language models while retaining their zero-shot

ability. However, such approaches typically introduce ex-

tra computation and parameters or require using a sophis-

ticated attention masking strategy which hinders practical

usages (Yang et al.,2019;Tay et al.,2022). Moreover, they

typically train encoder-decoder models which are not nat-

urally suitable for zero- and few-shot inference tasks com-

arXiv:2210.13432v2 [cs.CL] 31 Jan 2023

Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models

Average SuperGLUE Score

45.0

50.0

55.0

60.0

65.0

PaLM 1B-

180B

FCM 1B-

180B

T-FCM 1B-

180B

PaLM* 8B-

780B HQ

PaLM 8B-

180B

FCM 8B-

180B

T-FCM 8B-

180B

GPT-3 13B-

300B HQ

T5-XXL 11B-

1000B

UL2 20B-

1000B

GPT-3 175B-

300B HQ

ST-MoE

269B-500B

Zero-shot Performance

Average SuperGLUE Score

45.0

50.0

55.0

60.0

65.0

PaLM 1B-180B FCM 1B-180B T-FCM 1B-180B PaLM 8B-180B FCM 8B-180B T-FCM 8B-180B

One-shot Performance

Average SuperGLUE Score

45.0

50.0

55.0

60.0

65.0

PaLM 1B-180B FCM 1B-180B T-FCM 1B-180B PaLM 8B-180B FCM 8B-180B T-FCM 8B-180B

Few-shot Performance

Task score

50.0

60.0

70.0

80.0

90.0

BoolQ CB RTE ReCORD WiC COPA MultiRC WSC Avg

PaLM FCM T-FCM

SuperGLUE Finetuning Performance (1B)

Task score

70.0

80.0

90.0

100.0

BoolQ CB RTE ReCORD WiC COPA MultiRC WSC Avg

PaLM FCM T-FCM

SuperGLUE Finetuning Performance (8B)

Figure 1. FCM and T-FCM outperform PaLM in zero- and few-shot as well as ﬁnetuning tasks. We report the averaged scores in each

category. Scores are averaged over 3 evaluation random seeds. Top: We compare zero-shot, one-shot, and ﬁve-shot average performance

on the SuperGLUE benchmark of different model sizes and dataset sizes. PaLM?8B-780B HQ denotes the published results of 8B model

trained on 780B tokens from high quality datasets, PaLM 8B-180B denotes the same setup but trained on 180B tokens from C4 dataset,

and FCM 8B-180B denote the same 8B model trained on 180B tokens from C4 dataset using FCM as objective. Bottom: We compare

ﬁnetuning performance on SuperGLUE for 1B model size (left) and 8B model size (right). T-FCM, a simple extension of FCM, further

boosts ﬁnetuning performance signiﬁcantly while achieving similar few-shot performance as FCM.

pared with decoder-only causal language models and are

still outperformed by causal language models (Sanh et al.,

2022;Brown et al.,2020;Chowdhery et al.,2022). In or-

der to further improve causal language models few-shot

abilities, some works proposed better prompt engineering

methods (Liu et al.,2021;Lester et al.,2021;Ling et al.,

2017;Wei et al.,2022b;Li & Liang,2021) or better ﬁnetun-

ing methods (Mishra et al.,2022;Wei et al.,2022a;Sanh

et al.,2022). Prompt-based methods are sensitive to de-

sign (Lester et al.,2021;Liu et al.,2021), while ﬁnetuning-

based approaches typically require a huge amount of su-

pervision to work with as shown in Sanh et al. (2022). In

addition, such methods can only improve pre-trained model

and are unable to improve pre-training.

In this work, we propose a pre-training approach that does

not incur any extra computation cost or parameters, to im-

prove few-shot and zero-shot performance, as well as rep-

resentation learning of causal language models. Our key

observation is that, by performing next token prediction

task with randomly selected past tokens masked out, we

can improve the quality of the learned representations for

downstream language understanding tasks. Our method,

Forgetful Causal Masking (FCM), can be efﬁciently imple-

mented by randomly masking input tokens in the causal

language model. Applying our method to PaLM (Chowd-

hery et al.,2022), a state-of-the-art causal language model,

we see signiﬁcant improvement on the SuperGLUE (Sar-

lin et al.,2020) benchmark: our method signiﬁcantly im-

proves the 1B-model-size PaLM’s zero-shot performance

from 55.7 to 59.2 and improves the 8B-model-size PaLM’s

zero-shot performance from 61.6 to 64.0. We further evalu-

ate FCM on a diverse suite of NLP tasks from Brown et al.

(2020), and observe improvements in few-shot learning

on most tasks. In addition, FCM improves representation

Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models

learning, as shown in our SuperGLUE ﬁnetuning experi-

mental results, where our method improves 1B parameter

PaLM model’s ﬁnetuneing performance from 67.0 to 68.7,

and our method improves 8B parameters PaLM model’s

ﬁnetuning performance on all 8 SuperGLUE tasks, improv-

ing the score from 80.7 to 83.1. We also propose an exten-

sion of our method called Two-Pass FCM (T-FCM), which

applies FCM twice on a replicated input sequence. In do-

ing so, T-FCM effectively makes causal language model

see bidirectional context without altering sequence order-

ing. While this adds extra computation cost during train-

ing, T-FCM further boosts ﬁnetuning performance without

hurting few-shot results, improving the score from 80.7 to

to 87.8 (8B) and 67.0 to 73.5 (1B).

Contributions. We highlight the contributions of our paper

below:

• We present FCM, a simple and scalable pre-training

methodology for causal language modeling. We pro-

vide the empirical evaluation of FCM on a suite of

few-shot and ﬁnetuning benchmarks.

• We show that FCM is highly effective at improving

zero-shot and few-shot learning results, outperforms

strong baselines including PaLM and UL2, improving

the average SuperGLUE score of 8 billion parameters

PaLM from 61.6 to 64.0, and improving PaLM on a

wide range of 19 NLP tasks.

• In addition to few-shot learning, we demonstrate that

FCM signiﬁcantly helps with ﬁnetuning to down-

stream tasks, improving the performance of 8 billion

parameters PaLM on 8 out of 8 SuperGLUE tasks and

the average SuperGLUE score from 80.7 to 83.1.

• We propose Two-Pass FCM (T-FCM), a simple yet ef-

fective extension of FCM that introduces bidirectional

context to causal language models without altering

the sequence order. We observe that T-FCM further

boosts ﬁnetuning performance on SuperGLUE score

from 80.7 to 87.8 without affecting few-shot learning

performance.

2. Method

2.1. Pre-training Objective

Forgetful Causal Masking (FCM). FCM uses a stan-

dard causal, decoder-only Transformer model architec-

ture (Vaswani et al.,2017), i.e., each timestep can only at-

tend to itself and past timesteps. We illustrate FCM in Fig-

ure 2. Given an input text x= [x1,··· , xn], the standard

causal language modeling objective is deﬁned to maximize

the log likelihood of xautoregressively:

log p(x) = log

i=1

p(xi|x1, x2, . . . , xi−1)

= log

i=1

p(xi|x<i) := log

i=1

p(xi|[xj]i−1

j=0).

(1)

In FCM, we randomly sample a mask ratio from m∼[0, η]

where η∈[0,1] is a ﬁxed maximum mask ratio. We

use η= 0.15 throughout the experiments unless other-

wise mentioned. The model is asked to predict each token

xi∈x, and can only attend to tokens in x<i that are not

sampled. Concretely, the FCM objective is given by:

log p(x) = log

i=1

p(xi|[I[mj> η]·xj]i−1

j=0),(2)

where mj∼ U(0,1). This can be efﬁciently implemented

by combining it with causal attention mask. While apply-

ing random masking to the token sequence, we always ex-

clude the special BOS (‘beginning of sentence’) token at

the beginning of each sequence, so that the model is aware

of the beginning of a sentence. Moreover, keeping the BOS

token unmasked helps with training stability because it en-

sures that there is at least one token unmasked without

changing the semantic meaning of the sequence. For ex-

ample, when predicting token xtfor small t, it is possible

that all tokens [x1, ..., xt−1]are masked, which can cause

instability in the training loss. We found that this technique

enables us to train arbitrary high mask ratios without incur-

ring instability.

Two-Pass FCM (T-FCM). Prior work has discovered that

masked language models have better ﬁnetuning perfor-

mance than similar size or bigger causal language mod-

els (see, e.g., Wang et al.,2022;Tay et al.,2022, inter

alia). One hypothesis for this performance gap is that

masked language models can use bidirectional context dur-

ing training, while causal language models cannot. To

bridge this gap, we propose Two-Pass FCM (T-FCM) to

introduce bidirectional context into causal language mod-

els during training. T-FCM simply replicates the input se-

quence twice, where the ﬁrst pass is identical to causal lan-

guage modeling, and the second pass is similar to masked

language modeling since the model can attend to masked

future tokens by looking into the ﬁrst pass.

During training, T-FCM introduces an additional sentinel

token [copy] to let the model know that a copied sequence

begins. In practice, we found it important to not apply loss

on predicting [bos] from [copy] token, otherwise it desta-

bilizes training. The reason is that the position of [copy]

is arbitrary, hence predicting [bos] from [copy] is not well-

deﬁned.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsBetterFew-ShotandFinetuningPerformancewithForgetfulCausalLanguageModelsHaoLiu*12XinyangGeng*12LisaLee2IgorMordatch2SergeyLevine1SharanNarang2PieterAbbeel1AbstractLargelanguagemodels(LLM)trainedusingthenext-token-predictionobjective,suchasGPT3andPaLM,haverevolutionizednaturallanguageprocessing...

展开>> 收起<<

Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models Hao Liu 1 2Xinyang Geng 1 2Lisa Lee2Igor Mordatch2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models Hao Liu 1 2Xinyang Geng 1 2Lisa Lee2Igor Mordatch2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: