What Language Model to Train if You Have One Million GPU Hours The BigScience Architecture Scaling Group Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1

2025-05-06 0 0 750.76KB 18 页 10玖币
侵权投诉
What Language Model to Train if You Have One Million GPU Hours?
The BigScience Architecture & Scaling Group
Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1
M Saiful Bari3Stella Biderman4,5Hady Elsahar6Niklas Muennighoff1Jason Phang5Ofir Press8
Colin Raffel1Victor Sanh1Sheng Shen9Lintang Sutawika10 Jaesung Tae1Zheng Xin Yong11
Julien Launay2,12Iz Beltagy13
1Hugging Face 2LightOn 3NTU, Singapore 4Booz Allen 5EleutherAI 6Naver Labs Europe 7New York University
8University of Washington 9Berkeley University 10 Big Science 11 Brown University 12 LPENS 13 Allen Institute for AI
Abstract
The crystallization of modeling methods
around the Transformer architecture has been a
boon for practitioners. Simple, well-motivated
architectural variations can transfer across
tasks and scale, increasing the impact of mod-
eling research. However, with the emer-
gence of state-of-the-art 100B+ parameters
models, large language models are increas-
ingly expensive to accurately design and train.
Notably, it can be difficult to evaluate how
modeling decisions may impact emergent ca-
pabilities, given that these capabilities arise
mainly from sheer scale alone. In the pro-
cess of building BLOOM–the Big Science
Large Open-science Open-access Multilingual
language model–our goal is to identify an ar-
chitecture and training setup that makes the
best use of our 1,000,000 A100-GPU-hours
budget. Specifically, we perform an ablation
study at the billion-parameter scale compar-
ing different modeling practices and their im-
pact on zero-shot generalization. In addition,
we study the impact of various popular pre-
training corpora on zero-shot generalization.
We also study the performance of a multilin-
gual model and how it compares to the English-
only one. Finally, we consider the scaling be-
haviour of Transformers to choose the target
model size, shape, and training setup. All our
models and code are open-sourced at https:
//huggingface.co/bigscience.
1 Introduction
Recent years have seen the advent of large language
models characterized by emergent capabilities (e.g.,
zero-shot generalization) arising from sheer scale
alone (Radford et al.,2019;Brown et al.,2020).
Scaling LLMs results in a predictable increase in
performance: simple scaling laws connect the num-
ber of parameters, pretraining dataset size, and
compute budget (Kaplan et al.,2020;Ganguli et al.,
Equal contribution.
Equal supervision.
102100102
PF-days
3×100
4×100
6×100
LM-Loss
125M
350M
760M
1.3B
13B
3.01C.046
Figure 1: Smooth scaling of language modeling loss
as compute budget and model size increase. We ob-
serve a power-law coefficient αC0.046, in-line with
Kaplan et al. (2020). We use this fit to estimate the op-
timal size and number of tokens to train on for the final
model given the available budget.
2022;Hoffmann et al.,2022), providing a clear
path towards more capable models. This paradigm
shift has been fueled by the wide adoption of the
Transformer (Vaswani et al.,2017), providing a
scalable basis for practitioners to build upon.
In this paper, we design an architecture and
training setup for a multilingual 100B+ parameters
model (BLOOM, BigScience Workshop (2022)),
seeking to best use a fixed 1,000,000 A100-hours
budget. Because of the costs involved with training
large language models, we cannot exhaustively ex-
plore the landscape of possible models. Instead, we
position ourselves as practitioners exploring "off-
the-shelf" solutions. We thus test promising addi-
tions to the Transformer to attempt to reproduce
their findings in a controlled, large-scale setting.
Although our main goal was to prepare the archi-
tecture and training setup of BLOOM, our findings
are also valuable for practitioners building models
in the 1-10B range, as they equally improve the per-
formance of such smaller models. At variance with
major works on large language models, we also
arXiv:2210.15424v2 [cs.CL] 8 Nov 2022
make a significant effort towards reproducibility
and openness: all of our pretrained models, code,
and notes from our weekly meetings are made avail-
able. See Appendix Afor the relevant links.
Contributions.
We first study the impact of pre-
training corpora, positional embeddings, activation
functions, and embedding norm on zero-shot gener-
alization. We base our study on the popular GPT-2
architecture (Radford et al.,2019), with experi-
ments at the 1.3B parameters scale. We then con-
sider the impact of massive multilinguality, show-
ing language-specific scaling laws in a multilingual
setting for the first time. Finally, we describe our
approach to drafting an architecture for the final
176B parameters BLOOM model.
2 Methods
We first justify our choice to base our model on
the popular recipe of combining a decoder-only
model with an autoregressive language modeling
objective, and introduce our experimental setup.
We then discuss our evaluation benchmarks, and
motivate our choice of zero-shot generalization as
our key metric. Finally, we introduce the baselines
we compare to throughout the paper.
2.1 Architecture and Pretraining Objective
In this paper, we base all models on a decoder-only
Transformer pretrained with an autoregressive lan-
guage modeling objective. This is a popular choice
for large language models (Brown et al.,2020;Rae
et al.,2021;Thoppilan et al.,2022), possibly be-
cause it lends itself to zero-shot application to many
downstream tasks (Radford et al.,2019). Alterna-
tives include encoder-decoder models trained with
a span-corruption objective (e.g., T5 Raffel et al.
(2019)), as well as non-causal decoders models
with visibility over a prefix (so-called Prefix LMs,
Liu et al. (2018); Dong et al. (2019)).
Our decision is motivated by the findings
of Wang et al. (2022), which showed that decoder-
only models combined with an autoregressive lan-
guage modeling objective provide the best zero-
shot generalization abilities immediately after pre-
training. Although multitask finetuning (Sanh
et al.,2021;Wei et al.,2021) will instead favor
an encoder-decoder with span corruption for best
zero-shot generalization, Wang et al. (2022) found
a compromise between these two practices. Fol-
lowing autoregressive pretraining, decoder-only
models can be efficiently adapted into non-causal
decoders, simply by extending pretraining with
span corruption. This adaptation produces a sec-
ond model, which can provide excellent zero-shot
generalization after multitask finetuning. Accord-
ingly, we follow their recommendation, and train
an autoregressive decoder-only model first which
we will later consider adapting and finetuning.
2.2 Experimental Setup
We follow the architectures GPT-2 (Radford et al.,
2019) and the hyperparameters of GPT-3 (Brown
et al.,2020). For learning rate, we use a maxi-
mum value of
2×104
, with a linear warm-up
over 375M tokens, followed by cosine decay to a
minimum value of 1×105. We use a 1M tokens
batch size, with linear ramp-up over the first 4B
tokens, and a sequence length of 2,048. We use
the Adam optimizer (Kingma and Ba,2014), with
β1= 0.9
,
β2= 0.999
,
= 1 ×108
, weight
decay 0.1, and gradient clipping to 1.0. We also
tie the word embedding and softmax matrix (Press
and Wolf,2017). Unless noted otherwise, we con-
duct our experiments with 1.3B parameters models,
pretraining on 112B tokens.
We picked this size and dataset size as a compro-
mise between compute cost and the likelihood that
our conclusions would transfer to the target 100B+
model. Notably, we needed to be able to reliably
measure zero-shot generalization above random
chance. We note that training for 112B tokens 1.3B
parameters models bring them significantly above
the optimality threshold of Kaplan et al. (2020),
and of Hoffmann et al. (2022).
The main architectural difference with GPT-3 is
that all our layers use full attention, while GPT-3
uses alternating sparse attention layers (Child et al.,
2019). The main value of sparse attention layers is
to save compute with long sequence lengths. How-
ever, at the 100B+ scale, sparse attention layers
provide negligible compute savings, as the vast
majority of the compute is spent on the large feed-
forward layers. Kaplan et al. (2020) estimated the
amount of compute per token to be:
Cforward = 2 ×(12nlayerd2+nlayernctxd),
where
Cforward
is the cost for the forward pass,
nlayer
is the number of layers,
d
is the hidden di-
mension, and
nctx
is the sequence length. This
means if
12d >> nctx
, the second
nlayernctxd
term
is negligible, which is the case for our final model
where d > 10,000 and nctx = 2048.
Model Parameters Pretraining tokens
Dataset 112B 250B 300B
OpenAI — Curie 6.7B 49.28
OpenAI — Babbage 1.3B 45.30
EleutherAI — GPT-Neo 1.3B The Pile 42.94
Ours 13B OSCAR v1 47.09
Ours
1.3B The Pile 42.79 43.12 43.46
1.3B C4 42.77
1.3B OSCAR v1 41.72
Table 1: Pretraining datasets with diverse cross-domain high-quality data improves zero-shot generaliza-
tion. Average accuracy on EAI harness (higher is better) using different pretraining corpora and comparison with
baseline models. Bold is best 1.3B model for amount of tokens seen, underline is best overall.
What is a FLOP exactly?
We report throughput
per GPU in FLOPS and total budgets in PF-days
(i.e. one PFLOPS sustained for a day). It is im-
portant to highlight that FLOPS are never directly
measured, but always estimated, with widely dif-
ferent practices across papers. We refer to model
FLOP the estimates based on the
C= 6ND
for-
mula from Kaplan et al. (2020), where
C
is the
total compute,
N
the model size, and
D
the num-
ber of tokens processed. These are the FLOP ac-
tually used to train the model, and which are used
for scaling laws. We refer to hardware FLOP the
estimates reported by our codebase, using the for-
mula from Narayanan et al. (2021). This notably
includes gradient checkpointing, which trades ad-
ditionnal computations for reduced memory needs,
and a more thorough accounting of operations.
2.3 Evaluation Benchmarks
We measure upstream performance using the lan-
guage modeling loss on an held out sample of the
pretraining dataset. However, it is not always pos-
sible to compare losses across objectives and tok-
enizers. Moreover, as upstream performance is not
always aligned with task performance (Tay et al.,
2021), we must also measure downstream perfor-
mance explicitly. We could use zero/few-shot gen-
eralization, with or without specific finetuning.
Specifically, we choose to measure zero-shot
generalization on a diverse set of tasks. Few-shot
and zero-shot results are strongly correlated: we
found a Pearson correlation coefficient of 0.93 be-
tween zero-shot and few-shot performance across
model sizes in Brown et al. (2020). We do not rely
on finetuning as it is not how the main final model
is likely to be used, given its size and the challenges
associated with finetuning at the 100B+ scale.
We use the popular EleutherAI Language Model
Evaluation Harness (EAI harness, Gao et al.
(2021)), evaluating models across 27 diverse tasks
that are similar to those used in Brown et al. (2020)
(see Appendix Cfor a list of tasks). Overall, the
random baseline on our benchmark sits at 33.3%.
2.4 Baselines
We use GPT-Neo (Black et al.,2021), a 1.3B
decoder-only autoregressive language model
trained on the Pile (Gao et al.,2020), and GPT-
3 (Brown et al.,2020), accessed via the OpenAI
API. We evaluate two models, Babbage and Curie
1
.
Based on Gao (2021) and our own analysis, we
assume Babbage is 1.3B while Curie is 6.7B based
on how close our computed results are to those re-
ported in the original paper. However, as details
of the OpenAI API are kept secret, there is no way
to make sure that the models are actually the ones
described in Brown et al. (2020) – the number of
pretraining tokens reported in Table 1is thus to be
taken cautiously.
3 Impact of Pretraining Data
We first study the impact of pretraining data on
zero-shot generalization. More diverse pretraining
data, ideally curated from a cross-domain collec-
tion of high-quality datasets, has been suggested to
help with downstream task performance and zero-
shot generalization (Rosset,2020;Gao et al.,2020).
1
These models are now referred to as
text-babbage-001 and text-curie-001.
3.1 Corpora
We evaluate three possible corpora, all commonly
used to train large language models:
OSCAR v1
(Ortiz Suárez et al.,2019)
2
, a mul-
tilingual, filtered version of Common Crawl;
C4
(Raffel et al.,2019), specifically its repli-
cation by AllenAI, a processed and filtered
version of Common Crawl;
The Pile
(Gao et al.,2020), a diverse pre-
training corpus that contains webscrapes from
Common Crawl in addition to high-quality
data from cross-domain sources such as aca-
demic texts and source code.
For each pretraining corpus, we train a 1.3B pa-
rameter model for 112B tokens. For the Pile specif-
ically, motivated by good early results at 112B
tokens, we train up to 300B tokens, to compare
with GPT-3 models and validate against GPT-Neo.
3.2 Results
Evaluation results are outlined in Table 1. We find
that training on the Pile produces models that are
better at zero-shot generalization, with C4 a close
second, and OSCAR significantly behind.
Importantly, this finding transfers to larger
scales: as part of engineering test runs, a 13B
model was trained on OSCAR for 300B tokens.
We found this 13B model to underperform the 6.7B
model from OpenAI API which we attribute to the
low quality of the English data in OSCAR.
We also note that our model trained on The Pile
outperforms the 1.3B GPT-Neo trained on the same
dataset. Finally, our 1.3B model still underper-
forms the 1.3B model from the OpenAI API by
1.6%. It seems most likely that the difference is
that of data, but we cannot investigate this further
as the GPT-3 training dataset is neither publicly
available nor reproducible.
Finding 1.
Diverse cross-domain pretraining
data combining web crawls with curated high-
quality sources improves zero-shot generaliza-
tion over pretraining datasets constructed from
Common Crawl only.
2
The recent release of OSCAR v2 is a better dataset, but it
wasn’t available when we started this project.
4 Architecture Ablations
We now consider ablation studies to better identify
the best positional embedding, activation function,
and embedding normalization placement.
4.1 Positional Embeddings
Background
Originally, both static sinusoidal
position embeddings and learned position embed-
dings were proposed to capture positionnal infor-
mation; the latter are popular in large language
models (Brown et al.,2020). Su et al. (2021) pro-
posed rotary embeddings, where the query and key
representations inside the self-attention mechanism
are modified such that the attention captures rela-
tive distances between them. Recently, Press et al.
(2022) introduced a method which does not use em-
beddings, instead directly attenuating the attention
scores based on how far away the keys/queries are.
Results
We compare learned, rotary, and ALiBi
position embeddings, and include a baseline with-
out position embeddings. Our results are presented
in Table 2. Although learned positional embed-
dings outperform rotary embeddings, ALiBi yields
significantly better results than all alternatives. We
also confirm the findings of Biderman (2021): a
baseline with no positional information exhibits
competitive performance. While bidirectional mod-
els require positional embeddings to determine the
location of tokens, we find autoregressive models
can simply leverage the causal attention mask. We
also confirm the ability of ALiBi to extrapolate to
longer sequences than trained on in Figure 2. Note
that results in Table 2do not use any extrapolation:
ALiBi embeddings are a better choice even without
taking into account their ability to extrapolate.
Finding 2.
ALiBi positional embeddings sig-
nificantly outperforms other embeddings for
Positional Embedding Average EAI Results
None 41.23
Learned 41.71
Rotary 41.46
ALiBi 43.70
Table 2: ALiBi significantly outperforms other em-
beddings for zero-shot generalization. All models
are trained on the OSCAR dataset for 112 billion to-
kens.
摘要:

WhatLanguageModeltoTrainifYouHaveOneMillionGPUHours?TheBigScienceArchitecture&ScalingGroupTevenLeScao1ThomasWang1DanielHesslow2LucileSaulnier1StasBekman1MSaifulBari3StellaBiderman4;5HadyElsahar6NiklasMuennighoff1JasonPhang5OrPress8ColinRaffel1VictorSanh1ShengShen9LintangSutawika10JaesungTae1Zh...

展开>> 收起<<
What Language Model to Train if You Have One Million GPU Hours The BigScience Architecture Scaling Group Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:750.76KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注