What Language Model to Train if You Have One Million GPU Hours The BigScience Architecture Scaling Group Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1

2025-05-06 0 0 750.76KB 18 页 10玖币

侵权投诉

What Language Model to Train if You Have One Million GPU Hours?

The BigScience Architecture & Scaling Group

Teven Le Scao1∗Thomas Wang1∗Daniel Hesslow2∗Lucile Saulnier1∗Stas Bekman1∗

M Saiful Bari3Stella Biderman4,5Hady Elsahar6Niklas Muennighoff1Jason Phang5Oﬁr Press8

Colin Raffel1Victor Sanh1Sheng Shen9Lintang Sutawika10 Jaesung Tae1Zheng Xin Yong11

Julien Launay2,12†Iz Beltagy13†

1Hugging Face 2LightOn 3NTU, Singapore 4Booz Allen 5EleutherAI 6Naver Labs Europe 7New York University

8University of Washington 9Berkeley University 10 Big Science 11 Brown University 12 LPENS 13 Allen Institute for AI

Abstract

The crystallization of modeling methods

around the Transformer architecture has been a

boon for practitioners. Simple, well-motivated

architectural variations can transfer across

tasks and scale, increasing the impact of mod-

eling research. However, with the emer-

gence of state-of-the-art 100B+ parameters

models, large language models are increas-

ingly expensive to accurately design and train.

Notably, it can be difﬁcult to evaluate how

modeling decisions may impact emergent ca-

pabilities, given that these capabilities arise

mainly from sheer scale alone. In the pro-

cess of building BLOOM–the Big Science

Large Open-science Open-access Multilingual

language model–our goal is to identify an ar-

chitecture and training setup that makes the

best use of our 1,000,000 A100-GPU-hours

budget. Speciﬁcally, we perform an ablation

study at the billion-parameter scale compar-

ing different modeling practices and their im-

pact on zero-shot generalization. In addition,

we study the impact of various popular pre-

training corpora on zero-shot generalization.

We also study the performance of a multilin-

gual model and how it compares to the English-

only one. Finally, we consider the scaling be-

haviour of Transformers to choose the target

model size, shape, and training setup. All our

models and code are open-sourced at https:

//huggingface.co/bigscience.

1 Introduction

Recent years have seen the advent of large language

models characterized by emergent capabilities (e.g.,

zero-shot generalization) arising from sheer scale

alone (Radford et al.,2019;Brown et al.,2020).

Scaling LLMs results in a predictable increase in

performance: simple scaling laws connect the num-

ber of parameters, pretraining dataset size, and

compute budget (Kaplan et al.,2020;Ganguli et al.,

∗Equal contribution.

†Equal supervision.

10−2100102

PF-days

3×100

4×100

6×100

LM-Loss

125M

350M

760M

1.3B

13B

3.01C−.046

Figure 1: Smooth scaling of language modeling loss

as compute budget and model size increase. We ob-

serve a power-law coefﬁcient αC∼0.046, in-line with

Kaplan et al. (2020). We use this ﬁt to estimate the op-

timal size and number of tokens to train on for the ﬁnal

model given the available budget.

2022;Hoffmann et al.,2022), providing a clear

path towards more capable models. This paradigm

shift has been fueled by the wide adoption of the

Transformer (Vaswani et al.,2017), providing a

scalable basis for practitioners to build upon.

In this paper, we design an architecture and

training setup for a multilingual 100B+ parameters

model (BLOOM, BigScience Workshop (2022)),

seeking to best use a ﬁxed 1,000,000 A100-hours

budget. Because of the costs involved with training

large language models, we cannot exhaustively ex-

plore the landscape of possible models. Instead, we

position ourselves as practitioners exploring "off-

the-shelf" solutions. We thus test promising addi-

tions to the Transformer to attempt to reproduce

their ﬁndings in a controlled, large-scale setting.

Although our main goal was to prepare the archi-

tecture and training setup of BLOOM, our ﬁndings

are also valuable for practitioners building models

in the 1-10B range, as they equally improve the per-

formance of such smaller models. At variance with

major works on large language models, we also

arXiv:2210.15424v2 [cs.CL] 8 Nov 2022

make a signiﬁcant effort towards reproducibility

and openness: all of our pretrained models, code,

and notes from our weekly meetings are made avail-

able. See Appendix Afor the relevant links.

Contributions.

We ﬁrst study the impact of pre-

training corpora, positional embeddings, activation

functions, and embedding norm on zero-shot gener-

alization. We base our study on the popular GPT-2

architecture (Radford et al.,2019), with experi-

ments at the 1.3B parameters scale. We then con-

sider the impact of massive multilinguality, show-

ing language-speciﬁc scaling laws in a multilingual

setting for the ﬁrst time. Finally, we describe our

approach to drafting an architecture for the ﬁnal

176B parameters BLOOM model.

2 Methods

We ﬁrst justify our choice to base our model on

the popular recipe of combining a decoder-only

model with an autoregressive language modeling

objective, and introduce our experimental setup.

We then discuss our evaluation benchmarks, and

motivate our choice of zero-shot generalization as

our key metric. Finally, we introduce the baselines

we compare to throughout the paper.

2.1 Architecture and Pretraining Objective

In this paper, we base all models on a decoder-only

Transformer pretrained with an autoregressive lan-

guage modeling objective. This is a popular choice

for large language models (Brown et al.,2020;Rae

et al.,2021;Thoppilan et al.,2022), possibly be-

cause it lends itself to zero-shot application to many

downstream tasks (Radford et al.,2019). Alterna-

tives include encoder-decoder models trained with

a span-corruption objective (e.g., T5 Raffel et al.

(2019)), as well as non-causal decoders models

with visibility over a preﬁx (so-called Preﬁx LMs,

Liu et al. (2018); Dong et al. (2019)).

Our decision is motivated by the ﬁndings

of Wang et al. (2022), which showed that decoder-

only models combined with an autoregressive lan-

guage modeling objective provide the best zero-

shot generalization abilities immediately after pre-

training. Although multitask ﬁnetuning (Sanh

et al.,2021;Wei et al.,2021) will instead favor

an encoder-decoder with span corruption for best

zero-shot generalization, Wang et al. (2022) found

a compromise between these two practices. Fol-

lowing autoregressive pretraining, decoder-only

models can be efﬁciently adapted into non-causal

decoders, simply by extending pretraining with

span corruption. This adaptation produces a sec-

ond model, which can provide excellent zero-shot

generalization after multitask ﬁnetuning. Accord-

ingly, we follow their recommendation, and train

an autoregressive decoder-only model ﬁrst which

we will later consider adapting and ﬁnetuning.

2.2 Experimental Setup

We follow the architectures GPT-2 (Radford et al.,

2019) and the hyperparameters of GPT-3 (Brown

et al.,2020). For learning rate, we use a maxi-

mum value of

2×10−4

, with a linear warm-up

over 375M tokens, followed by cosine decay to a

minimum value of 1×10−5. We use a 1M tokens

batch size, with linear ramp-up over the ﬁrst 4B

tokens, and a sequence length of 2,048. We use

the Adam optimizer (Kingma and Ba,2014), with

β1= 0.9

β2= 0.999

= 1 ×10−8

, weight

decay 0.1, and gradient clipping to 1.0. We also

tie the word embedding and softmax matrix (Press

and Wolf,2017). Unless noted otherwise, we con-

duct our experiments with 1.3B parameters models,

pretraining on 112B tokens.

We picked this size and dataset size as a compro-

mise between compute cost and the likelihood that

our conclusions would transfer to the target 100B+

model. Notably, we needed to be able to reliably

measure zero-shot generalization above random

chance. We note that training for 112B tokens 1.3B

parameters models bring them signiﬁcantly above

the optimality threshold of Kaplan et al. (2020),

and of Hoffmann et al. (2022).

The main architectural difference with GPT-3 is

that all our layers use full attention, while GPT-3

uses alternating sparse attention layers (Child et al.,

2019). The main value of sparse attention layers is

to save compute with long sequence lengths. How-

ever, at the 100B+ scale, sparse attention layers

provide negligible compute savings, as the vast

majority of the compute is spent on the large feed-

forward layers. Kaplan et al. (2020) estimated the

amount of compute per token to be:

Cforward = 2 ×(12nlayerd2+nlayernctxd),

where

Cforward

is the cost for the forward pass,

nlayer

is the number of layers,

is the hidden di-

mension, and

nctx

is the sequence length. This

means if

12d >> nctx

, the second

nlayernctxd

term

is negligible, which is the case for our ﬁnal model

where d > 10,000 and nctx = 2048.

Model Parameters Pretraining tokens

Dataset 112B 250B 300B

OpenAI — Curie 6.7B 49.28

OpenAI — Babbage 1.3B 45.30

EleutherAI — GPT-Neo 1.3B The Pile 42.94

Ours 13B OSCAR v1 47.09

Ours

1.3B The Pile 42.79 43.12 43.46

1.3B C4 42.77

1.3B OSCAR v1 41.72

Table 1: Pretraining datasets with diverse cross-domain high-quality data improves zero-shot generaliza-

tion. Average accuracy on EAI harness (higher is better) using different pretraining corpora and comparison with

baseline models. Bold is best 1.3B model for amount of tokens seen, underline is best overall.

What is a FLOP exactly?

We report throughput

per GPU in FLOPS and total budgets in PF-days

(i.e. one PFLOPS sustained for a day). It is im-

portant to highlight that FLOPS are never directly

measured, but always estimated, with widely dif-

ferent practices across papers. We refer to model

FLOP the estimates based on the

C= 6ND

for-

mula from Kaplan et al. (2020), where

is the

total compute,

the model size, and

the num-

ber of tokens processed. These are the FLOP ac-

tually used to train the model, and which are used

for scaling laws. We refer to hardware FLOP the

estimates reported by our codebase, using the for-

mula from Narayanan et al. (2021). This notably

includes gradient checkpointing, which trades ad-

ditionnal computations for reduced memory needs,

and a more thorough accounting of operations.

2.3 Evaluation Benchmarks

We measure upstream performance using the lan-

guage modeling loss on an held out sample of the

pretraining dataset. However, it is not always pos-

sible to compare losses across objectives and tok-

enizers. Moreover, as upstream performance is not

always aligned with task performance (Tay et al.,

2021), we must also measure downstream perfor-

mance explicitly. We could use zero/few-shot gen-

eralization, with or without speciﬁc ﬁnetuning.

Speciﬁcally, we choose to measure zero-shot

generalization on a diverse set of tasks. Few-shot

and zero-shot results are strongly correlated: we

found a Pearson correlation coefﬁcient of 0.93 be-

tween zero-shot and few-shot performance across

model sizes in Brown et al. (2020). We do not rely

on ﬁnetuning as it is not how the main ﬁnal model

is likely to be used, given its size and the challenges

associated with ﬁnetuning at the 100B+ scale.

We use the popular EleutherAI Language Model

Evaluation Harness (EAI harness, Gao et al.

(2021)), evaluating models across 27 diverse tasks

that are similar to those used in Brown et al. (2020)

(see Appendix Cfor a list of tasks). Overall, the

random baseline on our benchmark sits at 33.3%.

2.4 Baselines

We use GPT-Neo (Black et al.,2021), a 1.3B

decoder-only autoregressive language model

trained on the Pile (Gao et al.,2020), and GPT-

3 (Brown et al.,2020), accessed via the OpenAI

API. We evaluate two models, Babbage and Curie

Based on Gao (2021) and our own analysis, we

assume Babbage is 1.3B while Curie is 6.7B based

on how close our computed results are to those re-

ported in the original paper. However, as details

of the OpenAI API are kept secret, there is no way

to make sure that the models are actually the ones

described in Brown et al. (2020) – the number of

pretraining tokens reported in Table 1is thus to be

taken cautiously.

3 Impact of Pretraining Data

We ﬁrst study the impact of pretraining data on

zero-shot generalization. More diverse pretraining

data, ideally curated from a cross-domain collec-

tion of high-quality datasets, has been suggested to

help with downstream task performance and zero-

shot generalization (Rosset,2020;Gao et al.,2020).

These models are now referred to as

text-babbage-001 and text-curie-001.

3.1 Corpora

We evaluate three possible corpora, all commonly

used to train large language models:

•OSCAR v1

(Ortiz Suárez et al.,2019)

, a mul-

tilingual, ﬁltered version of Common Crawl;

•C4

(Raffel et al.,2019), speciﬁcally its repli-

cation by AllenAI, a processed and ﬁltered

version of Common Crawl;

•The Pile

(Gao et al.,2020), a diverse pre-

training corpus that contains webscrapes from

Common Crawl in addition to high-quality

data from cross-domain sources such as aca-

demic texts and source code.

For each pretraining corpus, we train a 1.3B pa-

rameter model for 112B tokens. For the Pile specif-

ically, motivated by good early results at 112B

tokens, we train up to 300B tokens, to compare

with GPT-3 models and validate against GPT-Neo.

3.2 Results

Evaluation results are outlined in Table 1. We ﬁnd

that training on the Pile produces models that are

better at zero-shot generalization, with C4 a close

second, and OSCAR signiﬁcantly behind.

Importantly, this ﬁnding transfers to larger

scales: as part of engineering test runs, a 13B

model was trained on OSCAR for 300B tokens.

We found this 13B model to underperform the 6.7B

model from OpenAI API which we attribute to the

low quality of the English data in OSCAR.

We also note that our model trained on The Pile

outperforms the 1.3B GPT-Neo trained on the same

dataset. Finally, our 1.3B model still underper-

forms the 1.3B model from the OpenAI API by

1.6%. It seems most likely that the difference is

that of data, but we cannot investigate this further

as the GPT-3 training dataset is neither publicly

available nor reproducible.

Finding 1.

Diverse cross-domain pretraining

data combining web crawls with curated high-

quality sources improves zero-shot generaliza-

tion over pretraining datasets constructed from

Common Crawl only.

The recent release of OSCAR v2 is a better dataset, but it

wasn’t available when we started this project.

4 Architecture Ablations

We now consider ablation studies to better identify

the best positional embedding, activation function,

and embedding normalization placement.

4.1 Positional Embeddings

Background

Originally, both static sinusoidal

position embeddings and learned position embed-

dings were proposed to capture positionnal infor-

mation; the latter are popular in large language

models (Brown et al.,2020). Su et al. (2021) pro-

posed rotary embeddings, where the query and key

representations inside the self-attention mechanism

are modiﬁed such that the attention captures rela-

tive distances between them. Recently, Press et al.

(2022) introduced a method which does not use em-

beddings, instead directly attenuating the attention

scores based on how far away the keys/queries are.

Results

We compare learned, rotary, and ALiBi

position embeddings, and include a baseline with-

out position embeddings. Our results are presented

in Table 2. Although learned positional embed-

dings outperform rotary embeddings, ALiBi yields

signiﬁcantly better results than all alternatives. We

also conﬁrm the ﬁndings of Biderman (2021): a

baseline with no positional information exhibits

competitive performance. While bidirectional mod-

els require positional embeddings to determine the

location of tokens, we ﬁnd autoregressive models

can simply leverage the causal attention mask. We

also conﬁrm the ability of ALiBi to extrapolate to

longer sequences than trained on in Figure 2. Note

that results in Table 2do not use any extrapolation:

ALiBi embeddings are a better choice even without

taking into account their ability to extrapolate.

Finding 2.

ALiBi positional embeddings sig-

niﬁcantly outperforms other embeddings for

Positional Embedding Average EAI Results

None 41.23

Learned 41.71

Rotary 41.46

ALiBi 43.70

Table 2: ALiBi signiﬁcantly outperforms other em-

beddings for zero-shot generalization. All models

are trained on the OSCAR dataset for 112 billion to-

kens.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhatLanguageModeltoTrainifYouHaveOneMillionGPUHours?TheBigScienceArchitecture&ScalingGroupTevenLeScao1ThomasWang1DanielHesslow2LucileSaulnier1StasBekman1MSaifulBari3StellaBiderman4;5HadyElsahar6NiklasMuennighoff1JasonPhang5OrPress8ColinRaffel1VictorSanh1ShengShen9LintangSutawika10JaesungTae1Zh...

展开>> 收起<<

What Language Model to Train if You Have One Million GPU Hours The BigScience Architecture Scaling Group Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What Language Model to Train if You Have One Million GPU Hours The BigScience Architecture Scaling Group Teven Le Scao1Thomas Wang1Daniel Hesslow2Lucile Saulnier1Stas Bekman1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: