2 Related Work
Large language models
Scaling and improving large language models is one of the most impactful research
areas in modern artificial intelligence (Chowdhery et al.,2022). To this end, large language models not only
continue to improve as we scale in terms of data or computational budget (Hoffmann et al.,2022;Kaplan
et al.,2020) but also acquire new abilities (Wei et al.,2022a). The impact of large language models has been
ubiquitous and pervasive, unlocking breakthroughs across many fields, e.g., reasoning (Wei et al.,2022b;
Wang et al.,2022b;Zhou et al.,2022;Drozdov et al.,2022), math (Lewkowycz et al.,2022), dialog (Thoppilan
et al.,2022), multimodal applications (Yu et al.,2022), retrieval (Tay et al.,2022c)inter alia.
While there have been many paradigms and self-supervision methods proposed to train these models (Devlin
et al.,2018;Clark et al.,2020b;Yang et al.,2019;Raffel et al.,2019), to this date most large language models
(i.e., more than 100B parameters) are trained as decoder-only casual language models. For example, flagship
large language models such as GPT-3 (Brown et al.,2020), Gopher (Rae et al.,2021) and PaLM (Chowdhery
et al.,2022) are all trained as causal language models. Meanwhile, bidirectional models (e.g., BERT (Devlin
et al.,2018), T5 (Raffel et al.,2019), ST-MoE (Zoph et al.,2022)) have also been very popular as the goto
model of choice, especially in smaller computational regimes (e.g., less than 30B parameters and often times
in the ranges of hundred of millions of parameters).
Scaling laws of large language models
Kaplan et al. (2020) investigated scaling laws of Transformer
language models and first showed the scaling laws are predictive of future performance. The authors found
that model size (and not shape) correlates strongly with model quality, i.e., upstream cross entropy. Tay
et al. (2021) studied the scaling properties of encoder-decoder models and their impact on upstream and
downstream finetuning tasks. Generally, Tay et al. (2021) found that upstream perplexity and downstream
quality does not always correlate. As a follow up, Tay et al. (2022a) studied the scaling laws of different
model architectures and found that inductive bias does significantly impact the scaling behavior of the model.
Finally, Hoffmann et al. (2022) proposed compute-optimal models that popularized the ‘chinchilla’ scaling
laws - an approach that aims to be predictive of the optimal amount of data given the number of model
parameters. In this work, we mainly consider scaling laws over downstream performance largely because this
is more reflective of a language model’s usability. Since downstream performance is more important than
upstream cross entropy, we advocate for future scaling studies to always incorporate downstream evaluation
(and metrics) as opposed to only using cross entropy loss.
Emergent Abilities
New behaviors that arise due to scaling language models have been increasingly
referred to as emergent abilities (Steinhardt,2022;Ganguli et al.,2022;Wei et al.,2022a). For instance, Wei et al.
(2022a) define emergent abilities as “abilities that are not present in smaller models but as present in larger
models.” For a few-shot prompted task, this would look like a flat scaling curve (random performance) until
a certain critical threshold, during which performance increases to substantially above random. This type of
phenomena has been observed across dozens of tasks in the BIG-Bench benchmark (Srivastava et al.,2022).
Although such emergent abilities are typically observed as a function of scale, increasing model scale to
induce emergent abilities is computationally expensive. In this paper we show how UL2R unlocks emergence
without increasing the number of model parameters.
Continued Training of Language Models
The paradigm of continue to train (or finetune) a language
model on more data or tasks is commonly known as adaptation. A range of prior work has shown that
finetuning language models on a collection of NLP tasks can improve downstream performance on a broad
range of downstream tasks (Aghajanyan et al.,2021;Aribandi et al.,2022;Wei et al.,2021;Sanh et al.,2022;
Ouyang et al.,2022,inter alia). The majority of this prior work, however, requires additional data such as
aggregating dozens or hundreds of NLP datasets (Raffel et al.,2019;Aghajanyan et al.,2021;Aribandi et al.,
2022), writing additional templates of instructions (Wei et al.,2021;Sanh et al.,2022), or finetuning on
human-labeled annotations (Ouyang et al.,2022). UL2R does not require new data since it simply re-uses the
pre-training data, which makes it orthogonal to continued training methods that leverage large collections of
3