Transcending Scaling Laws with 0.1 Extra Compute Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery

2025-05-06 0 0 925.78KB 21 页 10玖币

侵权投诉

Transcending Scaling Laws with 0.1% Extra Compute

Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri

Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery

Denny Zhou Donald Metzler Slav Petrov Neil Houlsby

Quoc V. Le Mostafa Dehghani

Google

Abstract

Scaling language models improves performance but comes with signiﬁcant computational costs. This paper

proposes UL2R, a method that substantially improves existing language models and their scaling curves

with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large

language model (e.g., PaLM) on a few more steps with UL2’s mixture-of-denoiser objective. We show that,

with almost negligible extra computational costs and no new sources of data, we are able to substantially

improve the scaling properties of large language models on downstream metrics. In this paper, we continue

training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-

PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM

achieves the same performance as the ﬁnal PaLM 540B model at around half its computational budget (i.e.,

saving ∼4.4 million TPUv4 hours).

We further show that this improved scaling curve leads to “emergent abilities” on challenging BIG-Bench

tasks—for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at

much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many

few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning

tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging

BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for

single and multi-span inﬁlling.

Figure 1: Compute (training ﬂops) versus Quality (average of 20+ NLP zero and few-shot tasks listed in

Appendix 7.1). The black dotted line shows the path from initialization from a PaLM checkpoint and training

further with UL2R.

arXiv:2210.11399v2 [cs.CL] 16 Nov 2022

1 Introduction

There has been signiﬁcant interest in scaling of language models (Rae et al.,2021;Chowdhery et al.,2022;

Brown et al.,2020). Scaling has inspired new research across multiple fronts, e.g., scaling laws (Kaplan

et al.,2020;Hoﬀmann et al.,2022;Tay et al.,2022a), emergent abilities (Wei et al.,2022a;Ganguli et al.,2022),

reasoning capabilities (Wei et al.,2022b;Lewkowycz et al.,2022), inter alia. Generally, scaling laws predict a

continued improvement in language model quality as we continue to scale up the computational budget

(e.g., bigger models or more data). To date, most large language models that form the basis of scaling law

research are trained almost exclusively as left-to-right causal language models (Kaplan et al.,2020;Hoﬀmann

et al.,2022).

This paper proposes a new method to dramatically improve the scaling curves of large language models

on downstream performance with a relatively tiny amount of additional computation cost. The key idea

is to continue training an existing causal language model (Chowdhery et al.,2022) with a mixture of new

objectives—speciﬁcally, the UL2 training objective mixture (Tay et al.,2022b). This restoration is expected to

only cost roughly

0.1%

of the original training FLOPs and requires no new data sources, making it

highly eﬃcient and convenient. We call this approach UL2R or UL2Restore.

The UL2 objective combines preﬁx language modeling and long-short span corruption (e.g., inﬁlling)

tasks (Raﬀel et al.,2019) that can be controlled at inference time using a mode switching prompt. Training

a large language model with UL2 can be interpreted as teaching it to leverage bidirectional attention (i.e.,

PreﬁxLM) or leverage inﬁlling-style pretraining that have been the foundation of language understanding

(e.g., T5 (Raﬀel et al.,2019)). To this end, we postulate that imbuing a state-of-the-art large language model

such as PaLM (Chowdhery et al.,2022) with these diverse pretraining schemes as a complement to the

original language model objective, enables the model to perform signiﬁcantly better. Moreover, the UL2

objective enables new prompting capabilities in PaLM which allows it to perform inﬁlling based prompting.

We show that adapting PaLM with UL2R not only results in signiﬁcantly better scaling laws on well-established

few-shot NLP tasks, but also, in our scaling experiments on downstream few-shot tasks, we show that UL2R is

two times more eﬃcient (computation savings of approximately 2x) at 540B scale - reaching the performance

of the ﬁnal PaLM 540B model with only half the computation, saving up to 4.4 million TPUv4 hours.

In addition to competitive performance across a range of well-established NLP (Wang et al.,2019), multi-

lingual (Clark et al.,2020a;Shi et al.,2022), and reasoning (Cobbe et al.,2021) benchmarks, we also study

the impact of UL2R on a suite of challenging BigBench tasks from Wei et al. (2022a). Notably, a subset of

tasks are described as ‘emergent‘ because PaLM’s performance remains ﬂat up to model scale of 62B and

only becomes better than non-random at 540B scale. On these set of tasks, we ﬁnd that UL2R enables (1)

doing signiﬁcantly better at tasks that PaLM struggles at (e.g., navigate, geometric shapes, hyperbaton) and

(2) elicits emergent behavior at a smaller scale such as 62B or 8B (e.g., crass ai, vitaminc fact veriﬁcation). On

top of that, U-PaLM strongly outperforms PaLM on some challenging BigBench tasks.

Emergence within the context of large language models is a nascent research area. As the Nobel prize-

winning physicist Philip Anderson put it, ‘More is diﬀerent.‘ (Anderson,1972) which describes unpredictable

phenomena at diﬀerent scales. In our context and with mixture-of-denoisers in UL2, we would like to think

of this phenomena as ‘More is diﬀerent, but diﬀerent can also more’ since diﬀerent pretraining objectives can

improve language model quality or elicit new emergent abilities. This work shows that diversity and richer

training paradigms can be key to learning new capabilities that were previously hard to acquire with only

causal language modeling.

Finally, in addition to emergent task performance and overall improved scaling curves, we show that U-PaLM

is also practically more useful since it is equipped with a secondary mode of prompting, i.e., bidirectional

inﬁlling. Speciﬁcally, UL2R enables a secondary capability for prompting U-PaLM which can be used to ﬁll

in more than one blanks in the input prompt. Interestingly, we ﬁnd that only a small amount of UL2R (e.g.,

0.1% tokens or FLOPs) is suﬃcient to imbue the model with this new capability.

2 Related Work

Large language models

Scaling and improving large language models is one of the most impactful research

areas in modern artiﬁcial intelligence (Chowdhery et al.,2022). To this end, large language models not only

continue to improve as we scale in terms of data or computational budget (Hoﬀmann et al.,2022;Kaplan

et al.,2020) but also acquire new abilities (Wei et al.,2022a). The impact of large language models has been

ubiquitous and pervasive, unlocking breakthroughs across many ﬁelds, e.g., reasoning (Wei et al.,2022b;

Wang et al.,2022b;Zhou et al.,2022;Drozdov et al.,2022), math (Lewkowycz et al.,2022), dialog (Thoppilan

et al.,2022), multimodal applications (Yu et al.,2022), retrieval (Tay et al.,2022c)inter alia.

While there have been many paradigms and self-supervision methods proposed to train these models (Devlin

et al.,2018;Clark et al.,2020b;Yang et al.,2019;Raﬀel et al.,2019), to this date most large language models

(i.e., more than 100B parameters) are trained as decoder-only casual language models. For example, ﬂagship

large language models such as GPT-3 (Brown et al.,2020), Gopher (Rae et al.,2021) and PaLM (Chowdhery

et al.,2022) are all trained as causal language models. Meanwhile, bidirectional models (e.g., BERT (Devlin

et al.,2018), T5 (Raﬀel et al.,2019), ST-MoE (Zoph et al.,2022)) have also been very popular as the goto

model of choice, especially in smaller computational regimes (e.g., less than 30B parameters and often times

in the ranges of hundred of millions of parameters).

Scaling laws of large language models

Kaplan et al. (2020) investigated scaling laws of Transformer

language models and ﬁrst showed the scaling laws are predictive of future performance. The authors found

that model size (and not shape) correlates strongly with model quality, i.e., upstream cross entropy. Tay

et al. (2021) studied the scaling properties of encoder-decoder models and their impact on upstream and

downstream ﬁnetuning tasks. Generally, Tay et al. (2021) found that upstream perplexity and downstream

quality does not always correlate. As a follow up, Tay et al. (2022a) studied the scaling laws of diﬀerent

model architectures and found that inductive bias does signiﬁcantly impact the scaling behavior of the model.

Finally, Hoﬀmann et al. (2022) proposed compute-optimal models that popularized the ‘chinchilla’ scaling

laws - an approach that aims to be predictive of the optimal amount of data given the number of model

parameters. In this work, we mainly consider scaling laws over downstream performance largely because this

is more reﬂective of a language model’s usability. Since downstream performance is more important than

upstream cross entropy, we advocate for future scaling studies to always incorporate downstream evaluation

(and metrics) as opposed to only using cross entropy loss.

Emergent Abilities

New behaviors that arise due to scaling language models have been increasingly

referred to as emergent abilities (Steinhardt,2022;Ganguli et al.,2022;Wei et al.,2022a). For instance, Wei et al.

(2022a) deﬁne emergent abilities as “abilities that are not present in smaller models but as present in larger

models.” For a few-shot prompted task, this would look like a ﬂat scaling curve (random performance) until

a certain critical threshold, during which performance increases to substantially above random. This type of

phenomena has been observed across dozens of tasks in the BIG-Bench benchmark (Srivastava et al.,2022).

Although such emergent abilities are typically observed as a function of scale, increasing model scale to

induce emergent abilities is computationally expensive. In this paper we show how UL2R unlocks emergence

without increasing the number of model parameters.

Continued Training of Language Models

The paradigm of continue to train (or ﬁnetune) a language

model on more data or tasks is commonly known as adaptation. A range of prior work has shown that

ﬁnetuning language models on a collection of NLP tasks can improve downstream performance on a broad

range of downstream tasks (Aghajanyan et al.,2021;Aribandi et al.,2022;Wei et al.,2021;Sanh et al.,2022;

Ouyang et al.,2022,inter alia). The majority of this prior work, however, requires additional data such as

aggregating dozens or hundreds of NLP datasets (Raﬀel et al.,2019;Aghajanyan et al.,2021;Aribandi et al.,

2022), writing additional templates of instructions (Wei et al.,2021;Sanh et al.,2022), or ﬁnetuning on

human-labeled annotations (Ouyang et al.,2022). UL2R does not require new data since it simply re-uses the

pre-training data, which makes it orthogonal to continued training methods that leverage large collections of

NLP datasets. Adapting a pretrained language model with a new self-supervised objective has been explored.

For example, a model trained with a language modeling objective can be adapted by further training with

the masked language modeling objective (Wang et al.,2022a). The other direction is also possible; a model

trained with a masked language objective can be adapted with the causal language modeling objective (Wang

et al.,2022a;Lester et al.,2021). UL2R follows a similar idea but uptrains a language model with a set

of diverse and new preordaining tasks from mixture-of-denoisers, even after a vast amounts of standard

pretraining and demonstrates a very rapid improvement on variety of setups and tasks.

Uniﬁed language learner (UL2)

The UL2 (Tay et al.,2022b) model is a state-of-the-art model that bridges

both generative causal language models and bidirectional language models. UL2 proposes a mixture-of-

denoiser objective that mixes preﬁx (non-causal) language modeling and inﬁlling (span corruption) within

the same model and leverages mode prompts to switch between modes during downstream tasks. UL2 is

architecture agnostic in which the authors argue that the choice of decoder-only versus encoder-decoder

models is largely an eﬃciency trade-oﬀ. In (Tay et al.,2022b), the ﬁnal UL2 model was trained as a 20B

encoder-decoder model, which achieves very compelling performance on both ﬁnetuning and in-context

learning.

3 U-PaLM

This section introduces the technical details of U-PaLM (i.e., PaLM + UL2R). U-PaLM is initialized from

PaLM and leverages the same architecture. This section describes the training procedures of UL2R and how

they are applied to continue training PaLM.

3.1 Training Data

To keep things consistent, we train this model with the same data mixture as PaLM and do not rely on

additional sources of data (labeled or unlabeled).

There are three main reasons for this choice. Firstly, we did not want to introduce new tokens to our training

process which could conﬂate ﬁndings. Secondly, we did not want to over-index on scaling studies that only

measure impact on upstream cross entropy (Hernandez et al.,2022) which claims that repeating data in

small quantities could be dis-proportionally harmful. Since the empirical results we obtained are strong,

we postulate that repeating tokens could perhaps be not harmful at smaller quantities after all. This is also

backed by the continued training of PaLM 62B in (Chowdhery et al.,2022) which showed that repeated data

could result in small gains, albeit not as strong as fresh tokens. Thirdly, we consider our data transformation

(via UL2) on the training data suﬃciently unique and therefore prevents us from explicitly training on the

same data with the exact objective or suﬀering from any memorization issues.

3.2 Preﬁx Language Model Architecture

We train U-PaLM using the preﬁx language model (PreﬁxLM) architecture, also sometimes known as a

non-causal decoder-only model. The PreﬁxLM architecture keeps a non-causal mask in its preﬁx (or inputs)

and applies bidirectional attention to input tokens.

In this architecture, we use a total combined sequence length of

2048

(e.g., PaLM’s sequence length) which

is then split to 1024 inputs and 1024 targets. In the original UL2 paper and infrastructure, an artifact of its

preprocessing pipeline applies padding tokens ﬁrst before combining

inputs

and

targets

. For decoder-only

language models, this is ineﬃcient since we would end up with a concatenation of

[prefix] [prefix’s

padding] [target].

In this work, we optimize the Preﬁx padding by forcing the model to concatenate preﬁx and target before

applying any additional padding. Packing, trimming and padding is then subsequently applied later after

the preﬁx has been concatenated with the targets. Through this preﬁx optimization, we are able to improve

example-level sample eﬃciency of the model.

3.3 Loss Objectives

This section describes the setting for the UL2 mixture-of-denoisers that we use in UL2R. The UL2 mixture-of-

denoiser objective comprises of three types of denoisers.

•Regular denoising

whereby the noise is sampled as spans, replaced with sentinel tokens. This is also

the standard span corruption task used in Raﬀel et al. (2019). Spans are typically uniformly sampled

with a mean of 3and a corruption rate of 15%.

•Extreme denoising

whereby the noise is increased to relatively ‘extreme‘ amounts in either a huge

percentage of the original text or being very long in nature. Spans are typically uniformly sampled

with a mean length of 32 OR a corruption rate of up to 50%.

•Sequential denoising

whereby the noise is always sampled from the start of the text to a randomly

sampled point in the text. This is also known as the PreﬁxLM objective (not to be confused with the

architecture).

We kept this simple since many ablations were already explored in Tay et al. (2022b). We kept the original 7

denoisers as the initial version but later found that a mixture of only three tasks, e.g.,

50%

PreﬁxLM, 25%

Long (extreme) span corruption, and 25% regular span corruption to be quite simple and eﬃcient for the

setup of continued training. We kept the original mode prompting tokens in the original UL2 design. We

used

[S2S]

for S-denoisers (PreﬁxLM),

[NLU]

for R-denosiers and

[NLG]

for X-denoisers. The 540B U-PaLM

model was mainly trained with 50% S-denoiser (PreﬁxLM), 25% R-denoisers, and 25% X-denoisers.

3.4 Training

We train the 540B model for a total of 20k steps with a batch size of

. We mildly ablate these settings in early

experiments with 62B and 8B models but keep them capped within a certain ballpark (e.g., 128 batch size

for 50k steps). As a result, this is more similar to ‘ﬁnetuning’ as compared to full pretraining. The number

of additional tokens is therefore very negligible compared to the original pretraining run often coming in

at around or less than

0.1%

additional compute. The total number of extra tokens we train on for the 540B

model is approximately 1.3 Billion which constitutes 0.16% extra computation. We use a cosine learning

rate decay schedule that anneals the learning rate from

10−4

10−6

. Notably, we also tried a low constant

learning rate and found them to perform quite identically. Our U-PaLM 8B and 62B models are trained using

64 TPUv4 chips. Training an U-PaLM 540B model only consumes 512 TPUv4 chips and ﬁnishes in about 5

days which is considered to be lightweight.

4 Experiments

This section reports the experimental results of U-PaLM.

4.1 Improved Scaling Properties on Few-shot Learning

In this experiment, we show improved scaling curves from small amounts of UL2R training on top of both

PaLM 8B and PaLM 540B. We use downstream metrics and few-shot evaluation since (1) this is closer to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TranscendingScalingLawswith0.1%ExtraComputeYiTayJasonWeiHyungWonChungVinhQ.TranDavidR.SoSiamakShakeriXavierGarciaHuaixiuStevenZhengJinfengRaoAakankshaChowdheryDennyZhouDonaldMetzlerSlavPetrovNeilHoulsbyQuocV.LeMostafaDehghaniGoogleAbstractScalinglanguagemodelsimprovesperformancebutcomeswithsignican...

展开>> 收起<<

Transcending Scaling Laws with 0.1 Extra Compute Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transcending Scaling Laws with 0.1 Extra Compute Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: