LARGE LANGUAGE MODELS CANSELF-IMPROVE Jiaxin Huang1Shixiang Shane Gu2Le Hou2yYuexin Wu2Xuezhi Wang2 Hongkun Yu2Jiawei Han1

2025-05-03 0 0 531.56KB 19 页 10玖币
侵权投诉
LARGE LANGUAGE MODELS CAN SELF-IMPROVE
Jiaxin Huang1Shixiang Shane Gu2Le Hou2Yuexin Wu2Xuezhi Wang2
Hongkun Yu2Jiawei Han1
1University of Illinois at Urbana-Champaign 2Google
1{jiaxinh3, hanj}@illinois.edu 2{shanegu, lehou, crickwu,
xuezhiw, hongkuny}@google.com
ABSTRACT
Large Language Models (LLMs) have achieved excellent performances in vari-
ous tasks. However, fine-tuning an LLM requires extensive supervision. Human,
on the other hand, may improve their reasoning abilities by self-thinking with-
out external inputs. In this work, we demonstrate that an LLM is also capable of
self-improving with only unlabeled datasets. We use a pre-trained LLM to gener-
ate “high-confidence” rationale-augmented answers for unlabeled questions using
Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using
those self-generated solutions as target outputs. We show that our approach im-
proves the general reasoning ability of a 540B-parameter LLM (74.4%82.1%
on GSM8K, 78.2%83.0% on DROP, 90.0%94.4% on OpenBookQA, and
63.4%67.9% on ANLI-A3) and achieves state-of-the-art-level performance,
without any ground truth label. We conduct ablation studies and show that fine-
tuning on reasoning is critical for self-improvement.
1 INTRODUCTION
Scaling has enabled Large Language Models (LLMs) to achieve state-of-the-art performance on
a range of Natural Language Processing (NLP) tasks (Wang et al., 2018; 2019; Rajpurkar et al.,
2016). More importantly, new capabilities have emerged from LLMs as they are scaled to hundreds
of billions of parameters (Wei et al., 2022a): in-context few-shot learning (Brown et al., 2020) makes
it possible for an LLM to perform well on a task it never trained on with only a handful of examples;
Chain-of-Thought (CoT) prompting (Wei et al., 2022b; Kojima et al., 2022) demonstrates strong
reasoning ability of LLMs across diverse tasks with or without few-shot examples; self-consistency
(Wang et al., 2022b) further improves the performance via self-evaluating multiple reasoning paths.
Despite these incredible capabilities of models trained on large text corpus (Brown et al., 2020;
Chowdhery et al., 2022), fundamentally improving the model performances beyond few-shot
baselines still requires finetuning on an extensive amount of high-quality supervised datasets.
FLAN (Wei et al., 2021; Chung et al., 2022) and T0 (Sanh et al., 2022) curated tens of benchmark
NLP datasets to boost zero-shot task performances on unseen tasks; InstructGPT (Ouyang et al.,
2022) crowd-sourced many human answers for diverse sets of text instructions to better align their
model to human instructions. While significant efforts were committed on collecting high-quality
supervised datasets, human brain, on the contrary, is capable of the metacognition process (Dunlosky
& Metcalfe, 2008), where we can refine our own reasoning ability without external inputs.
In this paper, we study how an LLM is able to self-improve its reasoning ability without supervised
data. We show that using only input sequences (without ground truth output sequences) from mul-
tiple NLP task datasets, a pre-trained LLM is able to improve performances for both in-domain
and out-of-domain tasks. Our method is shown in Figure 1: we first sample multiple predictions
using few-shot Chain-of-Thought (CoT) (Wei et al., 2022b) as prompts, filter “high-confidence”
predictions using majority voting (Wang et al., 2022b), and finally finetune the LLM on these high-
confidence predictions. The resulting model shows improved reasoning in both greedy and multi-
path evaluations. We call the model fine-tuned in this way as Language Model Self-Improved
Work was done during Google internship.
Corresponding author.
1
arXiv:2210.11610v2 [cs.CL] 25 Oct 2022
2
Alex is 10-8 = 2
years old.
Alex’s age is in the
middle of 8 and 10.
Alex is 9 years old.
(8+10)/2 = 9.
The answer is 9.
Language
Model
Q: John buys 20 cards and 1/4 are
uncommon. How many uncommon
cards did he get?
A: John gets 20 * 1/4 = 5 uncommon
cards. The answer is 5.
Q: Amy is 10. Jake is 8. Alex’s age is
right in the middle. How old is Alex?
A:
CoT examples
Q:
A:
Q: … How old is Alex?
A:
… (8+10)/2=9 … answer is 9.
Mixed formats of selected reasoning pathsSelf-training
Training-set questions or
self-generated questions
Multiple path
decoding
Majority
Voting
by answer
<latexit sha1_base64="HOCrR75PTdXMU4JpjxYtpz9BBcY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkVI8FLx4r2A9oQ9lsNu3aTTbsToRS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IJXCoOt+O4WNza3tneJuaW//4PCofHzSNirTjLeYkkp3A2q4FAlvoUDJu6nmNA4k7wTj27nfeeLaCJU84CTlfkyHiYgEo2ildp+FCs2gXHGr7gJknXg5qUCO5qD81Q8Vy2KeIJPUmJ7npuhPqUbBJJ+V+pnhKWVjOuQ9SxMac+NPF9fOyIVVQhIpbStBslB/T0xpbMwkDmxnTHFkVr25+J/XyzC68aciSTPkCVsuijJJUJH56yQUmjOUE0so08LeStiIasrQBlSyIXirL6+T9lXVq1dr97VKo57HUYQzOIdL8OAaGnAHTWgBg0d4hld4c5Tz4rw7H8vWgpPPnMIfOJ8/ramPLA==</latexit>
···
<latexit sha1_base64="HOCrR75PTdXMU4JpjxYtpz9BBcY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkVI8FLx4r2A9oQ9lsNu3aTTbsToRS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IJXCoOt+O4WNza3tneJuaW//4PCofHzSNirTjLeYkkp3A2q4FAlvoUDJu6nmNA4k7wTj27nfeeLaCJU84CTlfkyHiYgEo2ildp+FCs2gXHGr7gJknXg5qUCO5qD81Q8Vy2KeIJPUmJ7npuhPqUbBJJ+V+pnhKWVjOuQ9SxMac+NPF9fOyIVVQhIpbStBslB/T0xpbMwkDmxnTHFkVr25+J/XyzC68aciSTPkCVsuijJJUJH56yQUmjOUE0so08LeStiIasrQBlSyIXirL6+T9lXVq1dr97VKo57HUYQzOIdL8OAaGnAHTWgBg0d4hld4c5Tz4rw7H8vWgpPPnMIfOJ8/ramPLA==</latexit>
···
9
… Alex is 9 years old …
Input:
Output:
CoT examples
Q: … How old is Alex?
A: Let’s think step-by-step.
Figure 1: Overview of our method. With Chain-of-Thought (CoT) examples as demonstration (Wei
et al., 2022b), the language model generates multiple CoT reasoning paths and answers (temperature
T > 0) for each question. The most consistent answer is selected by majority voting (Wang et al.,
2022b). The “high-confidence” CoT reasoning paths that lead to the majority answer are augmented
by mixed formats as the final training samples to be fed back to the model for fine-tuning.
(LMSI). This is similar to how a human brain sometimes learns: given a question, think multi-
ple times to derive different possible results, conclude on how the question should be solved, and
then learn from or memorize its own solution. We empirically verify our method using a pre-trained
PaLM-540B LLM, where our method not only improves training task performances (74.4%82.1%
on GSM8K, 78.2%83.0% on DROP, 90.0%94.4% on OpenBookQA, and 63.4%67.9% on
ANLI-A3), but also enhances out-of-domain (OOD) test tasks (AQUA, StrategyQA, MNLI), achiev-
ing state-of-the-art performances in many tasks without relying on supervised ground truth answers.
Lastly, we conduct preliminary studies on self-generating additional input questions and few-shot
CoT prompts, which could further reduce the amount of human effort required for model self-
improving, and ablation studies on important hyperparameters of our approach. We hope our simple
approach and strong empirical results could encourage more future work by the community to in-
vestigate optimal performances of pretrained LLMs without additional human supervision.
Our contributions are summarized as follows:
We demonstrate that a large language model can self-improve by taking datasets with-
out ground truth outputs, by leveraging CoT reasoning (Wei et al., 2022b) and self-
consistency (Wang et al., 2022b), achieving competitive in-domain multi-task perfor-
mances as well as out-of-domain generalization. We achieve state-of-the-art-level results
on ARC, OpenBookQA, and ANLI datasets.
We provide detailed ablation studies on training sample formatting and sampling tem-
perature after fine-tuning, and identify critical design choices for most successful self-
improvement by LLMs.
We study two other approaches for self-improvements, where the model generates addi-
tional questions from finite input questions and generates few-shot CoT prompt templates
itself. The latter achieves 74.2% on GSM8K, which is the state-of-the-art zero-shot per-
formance, against 43.0% by Kojima et al. (2022) or 70.1% through its naive extension
with Wang et al. (2022b).
2 RELATED WORK
Learning from explanations. Augmenting a machine learning model with explanations has been
studied in existing literature extensively. For example, in the supervised learning setting, a model
can be fine-tuned using human-annotated rationales (Zaidan et al., 2007; Ling et al., 2017b; Narang
et al., 2020; Camburu et al., 2018; Cobbe et al., 2021; Chung et al., 2022). A few works have
also looked at how explanations can help the models in various settings, e.g., in-context learning
2
(Lampinen et al., 2022) and in distillation (Pruthi et al., 2022). In this paper, we focus more on
the unsupervised learning setting, where we do not assume we have a rationale-augmented training
dataset available, since human-annotated rationales can be expensive.
Few-shot explanations improves reasoning in LLMs. Recently, a lot of progress has been made
towards improving LLMs’ reasoning abilities via prompting or in-context learning. Wei et al.
(2022b) propose Chain-of-Thought prompting, which prompts the language model to generate a se-
ries of natural-language-based intermediate steps, and show it can help language models better solve
complex and multi-step reasoning tasks. Wang et al. (2022b) improve Chain-of-Thought prompting
by sampling multiple diverse reasoning paths and finding the most consistent answers via majority
voting. Kojima et al. (2022) propose to prompt the language model with “Let’s think step by step”
to generate reasoning in a zero-shot fashion. Zhou et al. (2022a) further decompose the questions
into multiple sub-questions, and ask the language model to solve each sub-question sequentially.
Refining explanations. More recent work proposes to further refine the generated reasoning paths
as some of them could be unreliable. For example, Ye & Durrett (2022) calibrate model predictions
based on the reliability of the explanations, Jung et al. (2022) show that inducing a tree of expla-
nations and inferring the satisfiability of each explanation can further help judge the correctness of
explanations. Li et al. (2022b) show that sampling a diverse set of prompts from the training data,
and a voting verifier can be used to improve model’s reasoning performance. Zelikman et al. (2022)
proposes better rationale generation by augmenting ground truth answers as hints when predicted an-
swers are incorrect. Our work is orthogonal to these lines of work, as we utilize refined explanations
from Wang et al. (2022b) for fine-tuning the model for self-improvement, and could readily incor-
porate these other refinement techniques for generating higher-quality self-training data. Our work
is similar to Zelikman et al. (2022) where we both propose to fine-tune a model on self-generated
CoT data, but our method does not require ground truth labels and shows stronger empirical results
with multi-task generalization.
Self-training models. One related line of work is self-training (see a survey from Amini et al.
(2022)). The key idea is to assign pseudo labels from a learned classifier to unlabeled data, and use
these pseudo-labeled examples to further improve the original model training, e.g., (RoyChowdhury
et al., 2019; Xie et al., 2020; He et al., 2020; Chen et al., 2021). Different from such prior work, our
proposed self-improvement framework uses CoT prompting plus self-consistency to obtain high-
confidence solutions on a large set of unlabeled data to augment the fine-tuning process.
Distillation and dark knowledge. Our method also tangentially relates to rich literature on dis-
tillation (Ba & Caruana, 2014; Hinton et al., 2015), where a student network imitates a teacher
network’s classifier predictions on input examples. A key detail is to learn from soft targets instead
of hard predicted labels, as softmax outputs with a high temperature reveal more detailed relative
class likelihoods, colloquially known as dark knowledge (Hinton et al., 2015; Korattikara Balan
et al., 2015). Recent studies (Zelikman et al., 2022; Snell et al., 2022; Eisenstein et al., 2022) show
that dark knowledge within LLMs can be retrieved with more computation at inference time, such
as adding informative instructions into the input sequence, and output CoT generation (Wei et al.,
2022b; Kojima et al., 2022). In our work, we explicitly show that imperfect CoT reasoning (which
may lead to incorrect answer) can be used directly for self-improving language models as evidenced
in our experiments in Sections 5.2 and 5.3.
3 METHOD
The overview of our method is illustrated in Fig. 1: We are given a pre-trained Large Language
Model (LLM) Mand a question-only training dataset Dtrain ={xi}D
i=1 with few-shot Chain-of-
Thought (CoT) examples (Wei et al., 2022b). We apply multiple path decoding with a sampling
temperature T > 0for generating mreasoning paths and answers {ri1, ri2, . . . , rim}for each ques-
tion xiin Dtrain, and use majority voting (self-consistency) to select the most consistent, highest
confidence answer (Wang et al., 2022b). We then keep all reasoning paths that lead to the most
consistent answer, apply mixed formats of prompts and answers for augmentation, and fine-tune
the model on these self-generated reasoning-answer data. We consider our approach as making the
3
model self-improve. In the following sections, we detail important designs within our method, along
with additional approaches for the model to self-improve without supervised data.
3.1 GENERATING AND FILTERING MULTIPLE REASONING PATHS
Self-consistency (Wang et al., 2022b) brings large improvements on reasoning tasks (e.g., 56.5%
74.4% on GSM8K test set), and the gap between greedy decoding and diverse decoding shows
there is a potential for further improving the reasoning ability of M, using the self-selected high-
confidence reasoning paths as training data.
Question: Stefan goes to a restaurant with his family. They order an appetizer that costs $10 and
4 entrees that are $20 each. If they tip 20% of the total, what is the total amount of money that
they spend?
Multiple Path Decoding:
Output 1: The appetizer costs $10. The entrees cost $20 each so in total 4 * $20 = $80. This means
the total cost is $10 + $80 = $90. They tip 20% of it, so the total amount they spend is $90 * 1.2
= $108. The answer is 108.
Output 2: The appetizer costs $10 and the entrees are $20 each. There are 4 entrees so the sum is
$20 * 4 = $80. The waiter gets 20% of the total. 20% of $80 is $80 * .2 = $16. The answer is $80
+ $16 = $96. (Incorrect reasoning path)
Output 3: The appetizer costs $10. The entrees cost 4 * $20 = $80. The tip is 20% of the total, so
it is 20% of the $90 they have spent. The tip is 0.2 * 90 = $18. The total they spent is $90 + $18
= $108. The answer is 108.
Table 1: Examples of 3 self-generated CoT reasoning paths given a question. Output 1 and 3 are the
most consistent reasoning paths based on majority voting and kept as self-training data.
For each training question xi, we sample mCoT reasoning paths, denoted as {ri1, ri2, . . . , rim}
(see Table 1 for examples). Since Mis prompted with the CoT examples from Wei et al.
(2022b), we apply the same output parsing with “The answer is” to generate their predicted an-
swers {yi1, yi2, . . . , yim}. The most consistent answer, which is not necessarily a correct answer,
is selected by majority voting, denoted as ˜yi= arg maxyijPm
k=1 I(yij=yik). For all the train-
ing questions, we filter the CoT reasoning paths that reach ˜yas the final answer to be put into the
self-training data, denoted as Dselfconsistent ={xi,˜
ri}, where ˜
ri={rij|1jm, yij= ˜yi}.
Figure 2: The relation of accu-
racy and confidence of the majority-
voted answer after multiple path de-
coding on GSM8K training-set ques-
tions. Predicted confidence from self-
consistency (Wang et al., 2022b) is well
calibrated (Guo et al., 2017).
Since we do not use any ground truth labels to filter out
cases where ˜yi6=yi, it is important that the self-generated
CoT reasoning paths are mostly reliable and incorrect an-
swers do not hurt the self-improvement of the model. We
plot the relation between the accuracy and confidence of
self-generated CoT paths for each question in GSM8K
training set in Fig. 2. The confidence is the number of
CoT paths leading to ˜ydivided by the total path number
m. The y-axis shows the accuracy of ˜yunder a certain
confidence. The circle area and the color darkness shows
the number of questions under a certain confidence. We
can observe that confident answers are more likely to be
correct, which means that when a question has many con-
sistent CoT paths, then the corresponding ˜yis more likely
to be correct. On the other hand, when ˜yis wrong, it is
likely to be supported by fewer CoT paths, and brings lit-
tle noise to the training samples.
4
摘要:

LARGELANGUAGEMODELSCANSELF-IMPROVEJiaxinHuang1ShixiangShaneGu2LeHou2yYuexinWu2XuezhiWang2HongkunYu2JiaweiHan11UniversityofIllinoisatUrbana-Champaign2Google1fjiaxinh3,hanjg@illinois.edu2fshanegu,lehou,crickwu,xuezhiw,hongkunyg@google.comABSTRACTLargeLanguageModels(LLMs)haveachievedexcellentperforman...

展开>> 收起<<
LARGE LANGUAGE MODELS CANSELF-IMPROVE Jiaxin Huang1Shixiang Shane Gu2Le Hou2yYuexin Wu2Xuezhi Wang2 Hongkun Yu2Jiawei Han1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:531.56KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注