Scaling Instruction-Finetuned Language Models Hyung Won ChungLe HouShayne LongpreBarret ZophYi Tay William FedusYunxuan Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma

2025-05-03 0 0 1.49MB 54 页 10玖币
侵权投诉
Scaling Instruction-Finetuned Language Models
Hyung Won ChungLe HouShayne LongpreBarret ZophYi Tay
William FedusYunxuan Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma
Albert Webson Shixiang Shane Gu Zhuyun Dai Mirac Suzgun Xinyun Chen
Aakanksha Chowdhery Alex Castro-Ros Marie Pellat Kevin Robinson
Dasha Valter Sharan Narang Gaurav Mishra Adams Yu Vincent Zhao
Yanping Huang Andrew Dai Hongkun Yu Slav Petrov Ed H. Chi
Jeff Dean Jacob Devlin Adam Roberts Denny Zhou Quoc V. Le
Jason Wei
Google
Abstract
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve
model performance and generalization to unseen tasks. In this paper we explore instruction finetuning
with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on
chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves
performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT),
and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts).
For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin
(+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as
75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints,
1
which achieve strong few-shot
performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a
general method for improving the performance and usability of pretrained language models.
The cafeteria had 23 apples
originally. They used 20 to
make lunch. So they had 23 -
20 = 3. They bought 6 more
apples, so they have 3 + 6 = 9.
The cafeteria had 23 apples
originally. They used 20 to
make lunch. So they had 23 -
20 = 3. They bought 6 more
apples, so they have 3 + 6 = 9.
The cafeteria had 23 apples
originally. They used 20 to
make lunch. So they had 23 -
20 = 3. They bought 6 more
apples, so they have 3 + 6 = 9.
The cafeteria had 23 apples
originally. They used 20 to
make lunch. So they had 23 -
20 = 3. They bought 6 more
apples, so they have 3 + 6 = 9.
(B)
(B)
(B)
(B)
Language
model
Please answer the following question.
What is the boiling point of Nitrogen?
-320.4F
Answer the following question by
reasoning step-by-step.
The cafeteria had 23 apples. If they
used 20 for lunch and bought 6 more,
how many apples do they have?
The cafeteria had 23 apples
originally. They used 20 to
make lunch. So they had 23 -
20 = 3. They bought 6 more
apples, so they have 3 + 6 = 9.
Q: Can Georey Hinton have a
conversation with George Washington?
Give the rationale before answering.
Georey Hinton is a British-Canadian
computer scientist born in 1947. George
Washington died in 1799. Thus, they
could not have had a conversation
together. So the answer is “no”.
Instruction netuning
Chain-of-thought netuning
Inference: generalization to unseen tasks
Multi-task instruction netuning (1.8K tasks)
Figure 1: We finetune various language models on 1.8K tasks phrased as instructions, and evaluate them on unseen tasks.
We finetune both with and without exemplars (i.e., zero-shot and few-shot) and with and without chain-of-thought,
enabling generalization across a range of evaluation scenarios.
Equal contribution. Correspondence: lehou@google.com.
Core contributor.
1Public checkpoints: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints.
1
arXiv:2210.11416v5 [cs.LG] 6 Dec 2022
1 Introduction
An important goal of artificial intelligence is to develop models that can generalize to unseen tasks. In natural
language processing (NLP), pretrained language models have made significant progress towards this goal,
as they can perform tasks given natural language descriptions (Brown et al.,2020,inter alia). Further progress
has been made by finetuning language models on a collection of tasks phrased as instructions, which enables
models to respond better to instructions and reduces the need for few-shot exemplars (Ouyang et al.,2022;
Wei et al.,2021;Sanh et al.,2021,inter alia).
In this paper we advance instruction finetuning in several ways. First, we study the impact of scaling
on instruction finetuning. Our experiments show that instruction finetuning does scale well with the
number of tasks and the size of the model. Their respective scaling behaviors suggest that future research
should scale up the number of tasks and the size of the model even further. Second, we study the effect of
finetuning on the ability of the models to perform reasoning tasks. Our experiments show that whereas
prior instruction finetuning methods that do not include chain-of-thought (CoT; Wei et al.,2022b) severely
degrade performance on CoT evaluations, adding just nine CoT datasets into the finetuning mixture enables
better performance on all evaluations.
Based on these findings, we train Flan-PaLM by using a 540B-parameter model, increasing the number of
finetuning tasks to 1.8K, and including CoT data. Flan-PaLM outperforms PaLM, achieving new state-of-the-
art on several benchmarks. For instance, Flan-PaLM’s improved reasoning abilities enable it to leverage CoT
and self-consistency (Wang et al.,2022c) to achieve 75.2% on Massive Multi-task Language Understanding
(MMLU; Hendrycks et al.,2020). Flan-PaLM also has improved multilingual abilities compared to PaLM, such
as 14.9% absolute improvement on one-shot TyDiQA (Clark et al.,2020) and 8.1% on arithmetic reasoning
in under-represented languages (Shi et al.,2022). In human rater evaluations, Flan-PaLM substantially
outperforms PaLM on a challenging set of open-ended generation questions, suggesting improved usability.
Moreover, we found that instruction finetuning also improves performance across several responsible AI
evaluation benchmarks.
In addition, we also instruction-finetune Flan-T5 models (80M to 11B). These checkpoints have strong zero-
shot, few-shot, and CoT abilities, outperforming prior public checkpoints such as T5 (Raffel et al.,2020). For
example, Flan-T5 11B outperforms T5 11B by double-digit improvements and even outperforms PaLM 62B on
some challenging BIG-Bench tasks (Srivastava et al.,2022). Overall, our results underscore how instruction
finetuning can improve performance across a range of models, prompting setups, and evaluation tasks.
-Random 25.0
-Average human rater 34.5
May 2020 GPT-3 5-shot 43.9
Mar. 2022 Chinchilla 5-shot 67.6
Apr. 2022 PaLM 5-shot 69.3
Oct. 2022 Flan-PaLM 5-shot 72.2
Flan-PaLM 5-shot: CoT + SC 75.2
-Average human expert 89.8
Jun. 2023 forecast (Hypermind) 73.2
Jun. 2024 forecast (Hypermind) 75.0
Jun. 2023 forecast (Metaculus) 82.7
Jun. 2024 forecast (Metaculus) 87.6
Table 1: Average 5-shot MMLU scores (%) for 57 tasks with model and human accuracy comparisons
(Hendrycks et al.,2020). Forecasts were made in July 2022 by competitive human forecasters, regarding a sin-
gle model (Steinhardt,2021); see
https://prod.hypermind.com/ngdp/en/showcase2/showcase.html?sc=
JSAI
and
https://www.metaculus.com/questions/11676/mmlu-sota-in-2023-2025/
. CoT + SC: chain-
of-thought prompting with self-consistency (Wang et al.,2022b).
2
T0-SF
Commonsense reasoning
Question generation
Closed-book QA
Adversarial QA
Extractive QA
Title/context generation
Topic classication
Struct-to-text
55 Datasets, 14 Categories,
193 Tasks
Mun
Natural language inference Closed-book QA
Code instruction gen. Conversational QA
Program synthesis Code repair
Dialog context generation …
69 Datasets, 27 Categories, 80 Tasks
CoT (Reasoning)
Arithmetic reasoning Explanation generation
Commonsense Reasoning Sentence composition
Implicit reasoning …
9 Datasets, 1 Category, 9 Tasks
Natural
Instructions v2
Cause eect classication
Commonsense reasoning
Named entity recognition
Toxic language detection
Question answering
Question generation
Program execution
Text categorization
372 Datasets, 108 Categories,
1554 Tasks
A Dataset is an original data source (e.g. SQuAD).
A Task Category is unique task setup (e.g. the SQuAD dataset is congurable for multiple task categories such as
extractive question answering, query generation, and context generation).
A Task is a unique <dataset, task category> pair, with any number of templates which preserve the task category (e.g.
query generation on the SQuAD dataset.)
Finetuning tasks
Held-out tasks
MMLU
Abstract algebra Sociology
College medicine Philosophy
Professional law …
57 tasks
BBH
Boolean expressions Navigate
Tracking shued objects Word soing
Dyck languages …
27 tasks
TyDiQA
Information
seeking QA
8 languages
MGSM
Grade school
math problems
10 languages
Figure 2: Our finetuning data comprises 473 datasets, 146 task categories, and 1,836 total tasks. Details for
the tasks used in this paper is given in Appendix F.
2 Flan Finetuning
We instruction-finetune on a collection of data sources (Figure 2) with a variety of instruction template
types (Figure 3). We call this finetuning procedure Flan (Finetuning language models; Wei et al.,2021) and
prepend “Flan” to the resulting finetuned models (e.g., Flan-PaLM).
2
We show that Flan works across several
model sizes and architectures (Table 2).
2.1 Finetuning Data
Task mixtures.
Prior literature has shown that increasing the number of tasks in finetuning with instructions
improves generalization to unseen tasks (Wei et al.,2021;Sanh et al.,2021,inter alia). In this paper we scale
to 1,836 finetuning tasks by combining four mixtures from prior work: Muffin, T0-SF, NIV2, and CoT, as
summarized in Figure 2.
Muffin3(80 tasks)
comprises 62 tasks from Wei et al. (2021) and 26 new tasks that
we added in this work, including dialog data (Byrne et al.,2019;Anantha et al.,2021;Dai et al.,2022) and
program synthesis data (Yasunaga and Liang,2020;Li et al.,2022).
T0-SF (193 tasks)
comprises tasks from
T0 (Sanh et al.,2021) that do not overlap with the data used in Muffin (SF stands for “sans Flan”).
NIV2
(1554 tasks) comprises tasks from Wang et al. (2022c).4
2We use “Flan” to refer to our finetuning procedure. “FLAN” is a model in Wei et al. (2021).
3Multi-task finetuning with instructions.
4We removed 44 tasks related to MMLU (Hendrycks et al.,2020), since MMLU is used for evaluation.
3
Answer the following
yes/no question.
Can you write a whole
Haiku in a single tweet?
Answer the following yes/no question
by reasoning step-by-step.
Can you write a whole Haiku in a
single tweet?
Q: Answer the following
yes/no question.
Could a dandelion suer
from hepatitis?
A: no
Q: Answer the following
yes/no question.
Can you write a whole Haiku
in a single tweet?
A:
Q: Answer the following yes/no question by
reasoning step-by-step.
Could a dandelion suer from hepatitis?
A: Hepatitis only aects organisms with livers.
Dandelions don’t have a liver. The answer is no.
Q: Answer the following yes/no question by
reasoning step-by-step.
Can you write a whole Haiku in a single tweet?
A:
A haiku is a japanese
three-line poem.
That is sho enough
to t in 280
characters. The
answer is yes.
A haiku is a japanese
three-line poem.
That is sho enough
to t in 280
characters. The
answer is yes.
yes
yes
With chain-of-thoughtWithout chain-of-thought
Instruction
without
exemplars
Instruction
with exemplars
Figure 3: Combinations of finetuning data formats in this work. We finetune with and without exemplars,
and also with and without chain-of-thought. In addition, we have some data formats without instructions
but with few-shot exemplars only, like in Min et al. (2022) (not shown in the figure). Note that only nine
chain-of-thought (CoT) datasets use the CoT formats.
Chain-of-thought finetuning mixture.
The fourth finetuning data mixture (reasoning) involves CoT anno-
tations, which we use to explore whether finetuning on CoT annotations improves performance on unseen
reasoning tasks. We create a new mixture of nine datasets from prior work for which human raters manually
wrote CoT annotations for a training corpus. These nine datasets include tasks such as arithmetic reasoning
(Cobbe et al.,2021), multi-hop reasoning (Geva et al.,2021), and natural language inference (Camburu et al.,
2020). We manually compose ten instruction templates per task. A data card is in Appendix F.
Templates and formatting.
For Muffin, T0-SF, and NIV2, we use instructional templates for each task as
given by the creators of the mixtures. For CoT, we manually write around ten instruction templates for each of
the nine datasets. To create few-shot templates, we write a variety of exemplar delimiters (e.g., "Q:"/"A:") and
apply them randomly at the example level. An example of formatting for both with and without exemplars,
as well as with and without CoT, is shown in Figure 3.
2.2 Finetuning procedure
In this paper, we apply instruction finetuning across a broad range of model families, including T5 (Raffel
et al.,2020), PaLM (Chowdhery et al.,2022), and U-PaLM (Tay et al.,2022b). These model families span a
range of sizes, from Flan-T5-small (80M parameters), to PaLM and U-PaLM (540B parameters). For each
model, we apply the same training procedure, except for a few hyperparameters: learning rate, batch size,
dropout, and finetuning steps. We use a constant learning rate schedule and finetune using the Adafactor
optimizer (Shazeer and Stern,2018). We use packing (Raffel et al.,2020) to combine multiple training
examples into a single sequence, separating inputs from targets using an end-of-sequence token. Masking is
applied to prevent the tokens from attending to others across the packed example boundary. The number
of finetuning steps, learning rate, batch size, and dropout for each model are given in Appendix E. For
each model, we use a single checkpoint for all evaluations; the optimal step was chosen based on periodic
evaluations (every 2k to 10k steps depending the model size) of the held-out tasks, and we used the same
number of checkpoint steps across all ablation runs for a given model. Notably, the amount of compute used
for finetuning is only a small fraction relative to the training compute, as shown in Table 2. For example, we
only use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4
TPU chips for 37 hours). We use the JAX-based T5X framework (Bradbury et al.,2018;Roberts et al.,2022).
4
Params Model Architecture Pre-training
Objective
Pre-train
FLOPs
Finetune
FLOPs
% Finetune
Compute
80M Flan-T5-Small encoder-decoder span corruption 1.8E+20 2.9E+18 1.6%
250M Flan-T5-Base encoder-decoder span corruption 6.6E+20 9.1E+18 1.4%
780M Flan-T5-Large encoder-decoder span corruption 2.3E+21 2.4E+19 1.1%
3B Flan-T5-XL encoder-decoder span corruption 9.0E+21 5.6E+19 0.6%
11B Flan-T5-XXL encoder-decoder span corruption 3.3E+22 7.6E+19 0.2%
8B Flan-PaLM decoder-only causal LM 3.7E+22 1.6E+20 0.4%
62B Flan-PaLM decoder-only causal LM 2.9E+23 1.2E+21 0.4%
540B Flan-PaLM decoder-only causal LM 2.5E+24 5.6E+21 0.2%
62B Flan-cont-PaLM decoder-only causal LM 4.8E+23 1.8E+21 0.4%
540B Flan-U-PaLM decoder-only prefix LM + span corruption 2.5E+23 5.6E+21 0.2%
Table 2: Across several models, instruction finetuning only costs a small amount of compute relative to
pre-training. T5: Raffel et al. (2020). PaLM and cont-PaLM (also known as PaLM 62B at 1.3T tokens):
Chowdhery et al. (2022). U-PaLM: Tay et al. (2022b).
2.3 Evaluation protocol
Evaluation benchmarks
. We focus on performance on held-out tasks which were not included as part of the
finetuning data. We are interested in Flan-PaLM’s overall capabilities on world knowledge and reasoning
tasks. Thus, we evaluate the model on a range of different benchmarks, including multilingual ones. We
do not use the evaluation set from Brown et al. (2020) since almost all of those tasks have training sets that
are included in our finetuning mixture. Instead, we use the following challenging benchmarks, for which
current language models still perform well below expert human raters.
(1) MMLU
(Hendrycks et al.,2020)
includes exam questions from 57 tasks such as mathematics, history, law, and medicine.
(2) BBH
includes
23 challenging tasks from BIG-Bench (Srivastava et al.,2022) for which PaLM performs below an average
human rater (Suzgun et al.,2022).
(3) TyDiQA
(Clark et al.,2020) is a question-answering benchmark
across 8 typologically diverse languages.
(4) MGSM
(Shi et al.,2022) is a multilingual benchmark of math
word problems from Cobbe et al. (2021) manually translated into 10 languages. These benchmarks were also
used in the PaLM paper (Chowdhery et al.,2022), which did not find any meaningful data contamination
with pre-training data, consistent with data contamination analyses in previous work (Brown et al.,2020;
Wei et al.,2021;Du et al.,2022). Responsible AI evaluations are discussed in Appendix C.
Evaluation methods and metrics.
For MMLU and BBH, we evaluate both the ability to directly predict the
answer via direct prompting, where the model directly gives the answer (Brown et al.,2020;Srivastava et al.,
2022), as well as via chain-of-thought (CoT) prompting, where the model must provide a reasoning chain
before giving the final answer (Wei et al.,2022b). For TyDiQA, we only measure direct prompting exact-match
score, since highlighting the portion of a passage with the correct answer may not require sophisticated
reasoning. For MGSM, we only measure CoT prompting accuracy since direct prompting has very low
performance. For all benchmarks we use the given few-shot exemplars, with the number of exemplars
following prior work: five-shot for MMLU, three-shot for BBH, one-shot for TyDiQA, and 8-shot for MGSM.
For a given model we also report a single “normalized average” metric, following the “normalized preferred
metric” in BIG-Bench (Srivastava et al.,2022).
5
Our normalized average metric is the macro-average over
six normalized scores: MMLU-Direct, MMLU-CoT, BBH-Direct, BBH-CoT, TyDiQA-Direct, and MGSM-CoT.
Results for all tasks in each benchmark are reported in Appendix D. Some Responsible AI benchmarks use
additional methods for generative tasks described in Appendix C.
5
A normalized metric scales an evaluation number with respect to a task-specific lower bound such as random guessing baseline for
a multiple choice question. For example, if random guessing produces 50% accuracy and the max accuracy of 100%, then a raw accuracy
of 55% would be be normalized to 10%, and a raw accuracy of 45% would be normalized to -10% since it is worse than random.
5
摘要:

ScalingInstruction-FinetunedLanguageModelsHyungWonChung*LeHou*ShayneLongpre*BarretZoph„YiTay„WilliamFedus„YunxuanLiXuezhiWangMostafaDehghaniSiddharthaBrahmaAlbertWebsonShixiangShaneGuZhuyunDaiMiracSuzgunXinyunChenAakankshaChowdheryAlexCastro-RosMariePellatKevinRobinsonDashaValterSharanNarangGauravMi...

展开>> 收起<<
Scaling Instruction-Finetuned Language Models Hyung Won ChungLe HouShayne LongpreBarret ZophYi Tay William FedusYunxuan Li Xuezhi Wang Mostafa Dehghani Siddhartha Brahma.pdf

共54页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:54 页 大小:1.49MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 54
客服
关注