Transcending Scaling Laws with 0.1 Extra Compute Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery

2025-05-06 0 0 925.78KB 21 页 10玖币
侵权投诉
Transcending Scaling Laws with 0.1% Extra Compute
Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri
Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery
Denny Zhou Donald Metzler Slav Petrov Neil Houlsby
Quoc V. Le Mostafa Dehghani
Google
Abstract
Scaling language models improves performance but comes with significant computational costs. This paper
proposes UL2R, a method that substantially improves existing language models and their scaling curves
with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large
language model (e.g., PaLM) on a few more steps with UL2’s mixture-of-denoiser objective. We show that,
with almost negligible extra computational costs and no new sources of data, we are able to substantially
improve the scaling properties of large language models on downstream metrics. In this paper, we continue
training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-
PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM
achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e.,
saving 4.4 million TPUv4 hours).
We further show that this improved scaling curve leads to “emergent abilities” on challenging BIG-Bench
tasks—for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at
much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many
few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning
tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging
BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for
single and multi-span infilling.
Figure 1: Compute (training flops) versus Quality (average of 20+ NLP zero and few-shot tasks listed in
Appendix 7.1). The black dotted line shows the path from initialization from a PaLM checkpoint and training
further with UL2R.
1
arXiv:2210.11399v2 [cs.CL] 16 Nov 2022
1 Introduction
There has been significant interest in scaling of language models (Rae et al.,2021;Chowdhery et al.,2022;
Brown et al.,2020). Scaling has inspired new research across multiple fronts, e.g., scaling laws (Kaplan
et al.,2020;Hoffmann et al.,2022;Tay et al.,2022a), emergent abilities (Wei et al.,2022a;Ganguli et al.,2022),
reasoning capabilities (Wei et al.,2022b;Lewkowycz et al.,2022), inter alia. Generally, scaling laws predict a
continued improvement in language model quality as we continue to scale up the computational budget
(e.g., bigger models or more data). To date, most large language models that form the basis of scaling law
research are trained almost exclusively as left-to-right causal language models (Kaplan et al.,2020;Hoffmann
et al.,2022).
This paper proposes a new method to dramatically improve the scaling curves of large language models
on downstream performance with a relatively tiny amount of additional computation cost. The key idea
is to continue training an existing causal language model (Chowdhery et al.,2022) with a mixture of new
objectives—specifically, the UL2 training objective mixture (Tay et al.,2022b). This restoration is expected to
only cost roughly
0.1%
to
1%
of the original training FLOPs and requires no new data sources, making it
highly efficient and convenient. We call this approach UL2R or UL2Restore.
The UL2 objective combines prefix language modeling and long-short span corruption (e.g., infilling)
tasks (Raffel et al.,2019) that can be controlled at inference time using a mode switching prompt. Training
a large language model with UL2 can be interpreted as teaching it to leverage bidirectional attention (i.e.,
PrefixLM) or leverage infilling-style pretraining that have been the foundation of language understanding
(e.g., T5 (Raffel et al.,2019)). To this end, we postulate that imbuing a state-of-the-art large language model
such as PaLM (Chowdhery et al.,2022) with these diverse pretraining schemes as a complement to the
original language model objective, enables the model to perform significantly better. Moreover, the UL2
objective enables new prompting capabilities in PaLM which allows it to perform infilling based prompting.
We show that adapting PaLM with UL2R not only results in significantly better scaling laws on well-established
few-shot NLP tasks, but also, in our scaling experiments on downstream few-shot tasks, we show that UL2R is
two times more efficient (computation savings of approximately 2x) at 540B scale - reaching the performance
of the final PaLM 540B model with only half the computation, saving up to 4.4 million TPUv4 hours.
In addition to competitive performance across a range of well-established NLP (Wang et al.,2019), multi-
lingual (Clark et al.,2020a;Shi et al.,2022), and reasoning (Cobbe et al.,2021) benchmarks, we also study
the impact of UL2R on a suite of challenging BigBench tasks from Wei et al. (2022a). Notably, a subset of
tasks are described as ‘emergent‘ because PaLM’s performance remains flat up to model scale of 62B and
only becomes better than non-random at 540B scale. On these set of tasks, we find that UL2R enables (1)
doing significantly better at tasks that PaLM struggles at (e.g., navigate, geometric shapes, hyperbaton) and
(2) elicits emergent behavior at a smaller scale such as 62B or 8B (e.g., crass ai, vitaminc fact verification). On
top of that, U-PaLM strongly outperforms PaLM on some challenging BigBench tasks.
Emergence within the context of large language models is a nascent research area. As the Nobel prize-
winning physicist Philip Anderson put it, ‘More is different.‘ (Anderson,1972) which describes unpredictable
phenomena at different scales. In our context and with mixture-of-denoisers in UL2, we would like to think
of this phenomena as ‘More is different, but different can also more’ since different pretraining objectives can
improve language model quality or elicit new emergent abilities. This work shows that diversity and richer
training paradigms can be key to learning new capabilities that were previously hard to acquire with only
causal language modeling.
Finally, in addition to emergent task performance and overall improved scaling curves, we show that U-PaLM
is also practically more useful since it is equipped with a secondary mode of prompting, i.e., bidirectional
infilling. Specifically, UL2R enables a secondary capability for prompting U-PaLM which can be used to fill
in more than one blanks in the input prompt. Interestingly, we find that only a small amount of UL2R (e.g.,
0.1% tokens or FLOPs) is sufficient to imbue the model with this new capability.
2
2 Related Work
Large language models
Scaling and improving large language models is one of the most impactful research
areas in modern artificial intelligence (Chowdhery et al.,2022). To this end, large language models not only
continue to improve as we scale in terms of data or computational budget (Hoffmann et al.,2022;Kaplan
et al.,2020) but also acquire new abilities (Wei et al.,2022a). The impact of large language models has been
ubiquitous and pervasive, unlocking breakthroughs across many fields, e.g., reasoning (Wei et al.,2022b;
Wang et al.,2022b;Zhou et al.,2022;Drozdov et al.,2022), math (Lewkowycz et al.,2022), dialog (Thoppilan
et al.,2022), multimodal applications (Yu et al.,2022), retrieval (Tay et al.,2022c)inter alia.
While there have been many paradigms and self-supervision methods proposed to train these models (Devlin
et al.,2018;Clark et al.,2020b;Yang et al.,2019;Raffel et al.,2019), to this date most large language models
(i.e., more than 100B parameters) are trained as decoder-only casual language models. For example, flagship
large language models such as GPT-3 (Brown et al.,2020), Gopher (Rae et al.,2021) and PaLM (Chowdhery
et al.,2022) are all trained as causal language models. Meanwhile, bidirectional models (e.g., BERT (Devlin
et al.,2018), T5 (Raffel et al.,2019), ST-MoE (Zoph et al.,2022)) have also been very popular as the goto
model of choice, especially in smaller computational regimes (e.g., less than 30B parameters and often times
in the ranges of hundred of millions of parameters).
Scaling laws of large language models
Kaplan et al. (2020) investigated scaling laws of Transformer
language models and first showed the scaling laws are predictive of future performance. The authors found
that model size (and not shape) correlates strongly with model quality, i.e., upstream cross entropy. Tay
et al. (2021) studied the scaling properties of encoder-decoder models and their impact on upstream and
downstream finetuning tasks. Generally, Tay et al. (2021) found that upstream perplexity and downstream
quality does not always correlate. As a follow up, Tay et al. (2022a) studied the scaling laws of different
model architectures and found that inductive bias does significantly impact the scaling behavior of the model.
Finally, Hoffmann et al. (2022) proposed compute-optimal models that popularized the ‘chinchilla’ scaling
laws - an approach that aims to be predictive of the optimal amount of data given the number of model
parameters. In this work, we mainly consider scaling laws over downstream performance largely because this
is more reflective of a language model’s usability. Since downstream performance is more important than
upstream cross entropy, we advocate for future scaling studies to always incorporate downstream evaluation
(and metrics) as opposed to only using cross entropy loss.
Emergent Abilities
New behaviors that arise due to scaling language models have been increasingly
referred to as emergent abilities (Steinhardt,2022;Ganguli et al.,2022;Wei et al.,2022a). For instance, Wei et al.
(2022a) define emergent abilities as “abilities that are not present in smaller models but as present in larger
models. For a few-shot prompted task, this would look like a flat scaling curve (random performance) until
a certain critical threshold, during which performance increases to substantially above random. This type of
phenomena has been observed across dozens of tasks in the BIG-Bench benchmark (Srivastava et al.,2022).
Although such emergent abilities are typically observed as a function of scale, increasing model scale to
induce emergent abilities is computationally expensive. In this paper we show how UL2R unlocks emergence
without increasing the number of model parameters.
Continued Training of Language Models
The paradigm of continue to train (or finetune) a language
model on more data or tasks is commonly known as adaptation. A range of prior work has shown that
finetuning language models on a collection of NLP tasks can improve downstream performance on a broad
range of downstream tasks (Aghajanyan et al.,2021;Aribandi et al.,2022;Wei et al.,2021;Sanh et al.,2022;
Ouyang et al.,2022,inter alia). The majority of this prior work, however, requires additional data such as
aggregating dozens or hundreds of NLP datasets (Raffel et al.,2019;Aghajanyan et al.,2021;Aribandi et al.,
2022), writing additional templates of instructions (Wei et al.,2021;Sanh et al.,2022), or finetuning on
human-labeled annotations (Ouyang et al.,2022). UL2R does not require new data since it simply re-uses the
pre-training data, which makes it orthogonal to continued training methods that leverage large collections of
3
NLP datasets. Adapting a pretrained language model with a new self-supervised objective has been explored.
For example, a model trained with a language modeling objective can be adapted by further training with
the masked language modeling objective (Wang et al.,2022a). The other direction is also possible; a model
trained with a masked language objective can be adapted with the causal language modeling objective (Wang
et al.,2022a;Lester et al.,2021). UL2R follows a similar idea but uptrains a language model with a set
of diverse and new preordaining tasks from mixture-of-denoisers, even after a vast amounts of standard
pretraining and demonstrates a very rapid improvement on variety of setups and tasks.
Unified language learner (UL2)
The UL2 (Tay et al.,2022b) model is a state-of-the-art model that bridges
both generative causal language models and bidirectional language models. UL2 proposes a mixture-of-
denoiser objective that mixes prefix (non-causal) language modeling and infilling (span corruption) within
the same model and leverages mode prompts to switch between modes during downstream tasks. UL2 is
architecture agnostic in which the authors argue that the choice of decoder-only versus encoder-decoder
models is largely an efficiency trade-off. In (Tay et al.,2022b), the final UL2 model was trained as a 20B
encoder-decoder model, which achieves very compelling performance on both finetuning and in-context
learning.
3 U-PaLM
This section introduces the technical details of U-PaLM (i.e., PaLM + UL2R). U-PaLM is initialized from
PaLM and leverages the same architecture. This section describes the training procedures of UL2R and how
they are applied to continue training PaLM.
3.1 Training Data
To keep things consistent, we train this model with the same data mixture as PaLM and do not rely on
additional sources of data (labeled or unlabeled).
There are three main reasons for this choice. Firstly, we did not want to introduce new tokens to our training
process which could conflate findings. Secondly, we did not want to over-index on scaling studies that only
measure impact on upstream cross entropy (Hernandez et al.,2022) which claims that repeating data in
small quantities could be dis-proportionally harmful. Since the empirical results we obtained are strong,
we postulate that repeating tokens could perhaps be not harmful at smaller quantities after all. This is also
backed by the continued training of PaLM 62B in (Chowdhery et al.,2022) which showed that repeated data
could result in small gains, albeit not as strong as fresh tokens. Thirdly, we consider our data transformation
(via UL2) on the training data sufficiently unique and therefore prevents us from explicitly training on the
same data with the exact objective or suffering from any memorization issues.
3.2 Prefix Language Model Architecture
We train U-PaLM using the prefix language model (PrefixLM) architecture, also sometimes known as a
non-causal decoder-only model. The PrefixLM architecture keeps a non-causal mask in its prefix (or inputs)
and applies bidirectional attention to input tokens.
In this architecture, we use a total combined sequence length of
2048
(e.g., PaLM’s sequence length) which
is then split to 1024 inputs and 1024 targets. In the original UL2 paper and infrastructure, an artifact of its
preprocessing pipeline applies padding tokens first before combining
inputs
and
targets
. For decoder-only
language models, this is inefficient since we would end up with a concatenation of
[prefix] [prefix’s
padding] [target].
4
In this work, we optimize the Prefix padding by forcing the model to concatenate prefix and target before
applying any additional padding. Packing, trimming and padding is then subsequently applied later after
the prefix has been concatenated with the targets. Through this prefix optimization, we are able to improve
example-level sample efficiency of the model.
3.3 Loss Objectives
This section describes the setting for the UL2 mixture-of-denoisers that we use in UL2R. The UL2 mixture-of-
denoiser objective comprises of three types of denoisers.
Regular denoising
whereby the noise is sampled as spans, replaced with sentinel tokens. This is also
the standard span corruption task used in Raffel et al. (2019). Spans are typically uniformly sampled
with a mean of 3and a corruption rate of 15%.
Extreme denoising
whereby the noise is increased to relatively ‘extreme‘ amounts in either a huge
percentage of the original text or being very long in nature. Spans are typically uniformly sampled
with a mean length of 32 OR a corruption rate of up to 50%.
Sequential denoising
whereby the noise is always sampled from the start of the text to a randomly
sampled point in the text. This is also known as the PrefixLM objective (not to be confused with the
architecture).
We kept this simple since many ablations were already explored in Tay et al. (2022b). We kept the original 7
denoisers as the initial version but later found that a mixture of only three tasks, e.g.,
50%
PrefixLM, 25%
Long (extreme) span corruption, and 25% regular span corruption to be quite simple and efficient for the
setup of continued training. We kept the original mode prompting tokens in the original UL2 design. We
used
[S2S]
for S-denoisers (PrefixLM),
[NLU]
for R-denosiers and
[NLG]
for X-denoisers. The 540B U-PaLM
model was mainly trained with 50% S-denoiser (PrefixLM), 25% R-denoisers, and 25% X-denoisers.
3.4 Training
We train the 540B model for a total of 20k steps with a batch size of
32
. We mildly ablate these settings in early
experiments with 62B and 8B models but keep them capped within a certain ballpark (e.g., 128 batch size
for 50k steps). As a result, this is more similar to ‘finetuning’ as compared to full pretraining. The number
of additional tokens is therefore very negligible compared to the original pretraining run often coming in
at around or less than
0.1%
additional compute. The total number of extra tokens we train on for the 540B
model is approximately 1.3 Billion which constitutes 0.16% extra computation. We use a cosine learning
rate decay schedule that anneals the learning rate from
104
to
106
. Notably, we also tried a low constant
learning rate and found them to perform quite identically. Our U-PaLM 8B and 62B models are trained using
64 TPUv4 chips. Training an U-PaLM 540B model only consumes 512 TPUv4 chips and finishes in about 5
days which is considered to be lightweight.
4 Experiments
This section reports the experimental results of U-PaLM.
4.1 Improved Scaling Properties on Few-shot Learning
In this experiment, we show improved scaling curves from small amounts of UL2R training on top of both
PaLM 8B and PaLM 540B. We use downstream metrics and few-shot evaluation since (1) this is closer to
5
摘要:

TranscendingScalingLawswith0.1%ExtraComputeYiTayJasonWeiHyungWonChungVinhQ.TranDavidR.SoSiamakShakeriXavierGarciaHuaixiuStevenZhengJinfengRaoAakankshaChowdheryDennyZhouDonaldMetzlerSlavPetrovNeilHoulsbyQuocV.LeMostafaDehghaniGoogleAbstractScalinglanguagemodelsimprovesperformancebutcomeswithsignican...

展开>> 收起<<
Transcending Scaling Laws with 0.1 Extra Compute Yi Tay Jason Wei Hyung Won Chung Vinh Q. Tran David R. So Siamak Shakeri Xavier Garcia Huaixiu Steven Zheng Jinfeng Rao Aakanksha Chowdhery.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:925.78KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注