make a significant effort towards reproducibility
and openness: all of our pretrained models, code,
and notes from our weekly meetings are made avail-
able. See Appendix Afor the relevant links.
Contributions.
We first study the impact of pre-
training corpora, positional embeddings, activation
functions, and embedding norm on zero-shot gener-
alization. We base our study on the popular GPT-2
architecture (Radford et al.,2019), with experi-
ments at the 1.3B parameters scale. We then con-
sider the impact of massive multilinguality, show-
ing language-specific scaling laws in a multilingual
setting for the first time. Finally, we describe our
approach to drafting an architecture for the final
176B parameters BLOOM model.
2 Methods
We first justify our choice to base our model on
the popular recipe of combining a decoder-only
model with an autoregressive language modeling
objective, and introduce our experimental setup.
We then discuss our evaluation benchmarks, and
motivate our choice of zero-shot generalization as
our key metric. Finally, we introduce the baselines
we compare to throughout the paper.
2.1 Architecture and Pretraining Objective
In this paper, we base all models on a decoder-only
Transformer pretrained with an autoregressive lan-
guage modeling objective. This is a popular choice
for large language models (Brown et al.,2020;Rae
et al.,2021;Thoppilan et al.,2022), possibly be-
cause it lends itself to zero-shot application to many
downstream tasks (Radford et al.,2019). Alterna-
tives include encoder-decoder models trained with
a span-corruption objective (e.g., T5 Raffel et al.
(2019)), as well as non-causal decoders models
with visibility over a prefix (so-called Prefix LMs,
Liu et al. (2018); Dong et al. (2019)).
Our decision is motivated by the findings
of Wang et al. (2022), which showed that decoder-
only models combined with an autoregressive lan-
guage modeling objective provide the best zero-
shot generalization abilities immediately after pre-
training. Although multitask finetuning (Sanh
et al.,2021;Wei et al.,2021) will instead favor
an encoder-decoder with span corruption for best
zero-shot generalization, Wang et al. (2022) found
a compromise between these two practices. Fol-
lowing autoregressive pretraining, decoder-only
models can be efficiently adapted into non-causal
decoders, simply by extending pretraining with
span corruption. This adaptation produces a sec-
ond model, which can provide excellent zero-shot
generalization after multitask finetuning. Accord-
ingly, we follow their recommendation, and train
an autoregressive decoder-only model first which
we will later consider adapting and finetuning.
2.2 Experimental Setup
We follow the architectures GPT-2 (Radford et al.,
2019) and the hyperparameters of GPT-3 (Brown
et al.,2020). For learning rate, we use a maxi-
mum value of
2×10−4
, with a linear warm-up
over 375M tokens, followed by cosine decay to a
minimum value of 1×10−5. We use a 1M tokens
batch size, with linear ramp-up over the first 4B
tokens, and a sequence length of 2,048. We use
the Adam optimizer (Kingma and Ba,2014), with
β1= 0.9
,
β2= 0.999
,
= 1 ×10−8
, weight
decay 0.1, and gradient clipping to 1.0. We also
tie the word embedding and softmax matrix (Press
and Wolf,2017). Unless noted otherwise, we con-
duct our experiments with 1.3B parameters models,
pretraining on 112B tokens.
We picked this size and dataset size as a compro-
mise between compute cost and the likelihood that
our conclusions would transfer to the target 100B+
model. Notably, we needed to be able to reliably
measure zero-shot generalization above random
chance. We note that training for 112B tokens 1.3B
parameters models bring them significantly above
the optimality threshold of Kaplan et al. (2020),
and of Hoffmann et al. (2022).
The main architectural difference with GPT-3 is
that all our layers use full attention, while GPT-3
uses alternating sparse attention layers (Child et al.,
2019). The main value of sparse attention layers is
to save compute with long sequence lengths. How-
ever, at the 100B+ scale, sparse attention layers
provide negligible compute savings, as the vast
majority of the compute is spent on the large feed-
forward layers. Kaplan et al. (2020) estimated the
amount of compute per token to be:
Cforward = 2 ×(12nlayerd2+nlayernctxd),
where
Cforward
is the cost for the forward pass,
nlayer
is the number of layers,
d
is the hidden di-
mension, and
nctx
is the sequence length. This
means if
12d >> nctx
, the second
nlayernctxd
term
is negligible, which is the case for our final model
where d > 10,000 and nctx = 2048.