
Params Model Architecture Pre-training
Objective
Pre-train
FLOPs
Finetune
FLOPs
% Finetune
Compute
80M Flan-T5-Small encoder-decoder span corruption 1.8E+20 2.9E+18 1.6%
250M Flan-T5-Base encoder-decoder span corruption 6.6E+20 9.1E+18 1.4%
780M Flan-T5-Large encoder-decoder span corruption 2.3E+21 2.4E+19 1.1%
3B Flan-T5-XL encoder-decoder span corruption 9.0E+21 5.6E+19 0.6%
11B Flan-T5-XXL encoder-decoder span corruption 3.3E+22 7.6E+19 0.2%
8B Flan-PaLM decoder-only causal LM 3.7E+22 1.6E+20 0.4%
62B Flan-PaLM decoder-only causal LM 2.9E+23 1.2E+21 0.4%
540B Flan-PaLM decoder-only causal LM 2.5E+24 5.6E+21 0.2%
62B Flan-cont-PaLM decoder-only causal LM 4.8E+23 1.8E+21 0.4%
540B Flan-U-PaLM decoder-only prefix LM + span corruption 2.5E+23 5.6E+21 0.2%
Table 2: Across several models, instruction finetuning only costs a small amount of compute relative to
pre-training. T5: Raffel et al. (2020). PaLM and cont-PaLM (also known as PaLM 62B at 1.3T tokens):
Chowdhery et al. (2022). U-PaLM: Tay et al. (2022b).
2.3 Evaluation protocol
Evaluation benchmarks
. We focus on performance on held-out tasks which were not included as part of the
finetuning data. We are interested in Flan-PaLM’s overall capabilities on world knowledge and reasoning
tasks. Thus, we evaluate the model on a range of different benchmarks, including multilingual ones. We
do not use the evaluation set from Brown et al. (2020) since almost all of those tasks have training sets that
are included in our finetuning mixture. Instead, we use the following challenging benchmarks, for which
current language models still perform well below expert human raters.
(1) MMLU
(Hendrycks et al.,2020)
includes exam questions from 57 tasks such as mathematics, history, law, and medicine.
(2) BBH
includes
23 challenging tasks from BIG-Bench (Srivastava et al.,2022) for which PaLM performs below an average
human rater (Suzgun et al.,2022).
(3) TyDiQA
(Clark et al.,2020) is a question-answering benchmark
across 8 typologically diverse languages.
(4) MGSM
(Shi et al.,2022) is a multilingual benchmark of math
word problems from Cobbe et al. (2021) manually translated into 10 languages. These benchmarks were also
used in the PaLM paper (Chowdhery et al.,2022), which did not find any meaningful data contamination
with pre-training data, consistent with data contamination analyses in previous work (Brown et al.,2020;
Wei et al.,2021;Du et al.,2022). Responsible AI evaluations are discussed in Appendix C.
Evaluation methods and metrics.
For MMLU and BBH, we evaluate both the ability to directly predict the
answer via direct prompting, where the model directly gives the answer (Brown et al.,2020;Srivastava et al.,
2022), as well as via chain-of-thought (CoT) prompting, where the model must provide a reasoning chain
before giving the final answer (Wei et al.,2022b). For TyDiQA, we only measure direct prompting exact-match
score, since highlighting the portion of a passage with the correct answer may not require sophisticated
reasoning. For MGSM, we only measure CoT prompting accuracy since direct prompting has very low
performance. For all benchmarks we use the given few-shot exemplars, with the number of exemplars
following prior work: five-shot for MMLU, three-shot for BBH, one-shot for TyDiQA, and 8-shot for MGSM.
For a given model we also report a single “normalized average” metric, following the “normalized preferred
metric” in BIG-Bench (Srivastava et al.,2022).
5
Our normalized average metric is the macro-average over
six normalized scores: MMLU-Direct, MMLU-CoT, BBH-Direct, BBH-CoT, TyDiQA-Direct, and MGSM-CoT.
Results for all tasks in each benchmark are reported in Appendix D. Some Responsible AI benchmarks use
additional methods for generative tasks described in Appendix C.
5
A normalized metric scales an evaluation number with respect to a task-specific lower bound such as random guessing baseline for
a multiple choice question. For example, if random guessing produces 50% accuracy and the max accuracy of 100%, then a raw accuracy
of 55% would be be normalized to 10%, and a raw accuracy of 45% would be normalized to -10% since it is worse than random.
5