LARGE LANGUAGE MODELS CANSELF-IMPROVE Jiaxin Huang1Shixiang Shane Gu2Le Hou2yYuexin Wu2Xuezhi Wang2 Hongkun Yu2Jiawei Han1

2025-05-03 0 0 531.56KB 19 页 10玖币

侵权投诉

LARGE LANGUAGE MODELS CAN SELF-IMPROVE

Jiaxin Huang1∗Shixiang Shane Gu2Le Hou2†Yuexin Wu2Xuezhi Wang2

Hongkun Yu2Jiawei Han1

1University of Illinois at Urbana-Champaign 2Google

1{jiaxinh3, hanj}@illinois.edu 2{shanegu, lehou, crickwu,

xuezhiw, hongkuny}@google.com

ABSTRACT

Large Language Models (LLMs) have achieved excellent performances in vari-

ous tasks. However, ﬁne-tuning an LLM requires extensive supervision. Human,

on the other hand, may improve their reasoning abilities by self-thinking with-

out external inputs. In this work, we demonstrate that an LLM is also capable of

self-improving with only unlabeled datasets. We use a pre-trained LLM to gener-

ate “high-conﬁdence” rationale-augmented answers for unlabeled questions using

Chain-of-Thought prompting and self-consistency, and ﬁne-tune the LLM using

those self-generated solutions as target outputs. We show that our approach im-

proves the general reasoning ability of a 540B-parameter LLM (74.4%→82.1%

on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and

63.4%→67.9% on ANLI-A3) and achieves state-of-the-art-level performance,

without any ground truth label. We conduct ablation studies and show that ﬁne-

tuning on reasoning is critical for self-improvement.

1 INTRODUCTION

Scaling has enabled Large Language Models (LLMs) to achieve state-of-the-art performance on

a range of Natural Language Processing (NLP) tasks (Wang et al., 2018; 2019; Rajpurkar et al.,

2016). More importantly, new capabilities have emerged from LLMs as they are scaled to hundreds

of billions of parameters (Wei et al., 2022a): in-context few-shot learning (Brown et al., 2020) makes

it possible for an LLM to perform well on a task it never trained on with only a handful of examples;

Chain-of-Thought (CoT) prompting (Wei et al., 2022b; Kojima et al., 2022) demonstrates strong

reasoning ability of LLMs across diverse tasks with or without few-shot examples; self-consistency

(Wang et al., 2022b) further improves the performance via self-evaluating multiple reasoning paths.

Despite these incredible capabilities of models trained on large text corpus (Brown et al., 2020;

Chowdhery et al., 2022), fundamentally improving the model performances beyond few-shot

baselines still requires ﬁnetuning on an extensive amount of high-quality supervised datasets.

FLAN (Wei et al., 2021; Chung et al., 2022) and T0 (Sanh et al., 2022) curated tens of benchmark

NLP datasets to boost zero-shot task performances on unseen tasks; InstructGPT (Ouyang et al.,

2022) crowd-sourced many human answers for diverse sets of text instructions to better align their

model to human instructions. While signiﬁcant efforts were committed on collecting high-quality

supervised datasets, human brain, on the contrary, is capable of the metacognition process (Dunlosky

& Metcalfe, 2008), where we can reﬁne our own reasoning ability without external inputs.

In this paper, we study how an LLM is able to self-improve its reasoning ability without supervised

data. We show that using only input sequences (without ground truth output sequences) from mul-

tiple NLP task datasets, a pre-trained LLM is able to improve performances for both in-domain

and out-of-domain tasks. Our method is shown in Figure 1: we ﬁrst sample multiple predictions

using few-shot Chain-of-Thought (CoT) (Wei et al., 2022b) as prompts, ﬁlter “high-conﬁdence”

predictions using majority voting (Wang et al., 2022b), and ﬁnally ﬁnetune the LLM on these high-

conﬁdence predictions. The resulting model shows improved reasoning in both greedy and multi-

path evaluations. We call the model ﬁne-tuned in this way as Language Model Self-Improved

∗Work was done during Google internship.

†Corresponding author.

arXiv:2210.11610v2 [cs.CL] 25 Oct 2022

Alex is 10-8 = 2

years old.

Alex’s age is in the

middle of 8 and 10.

Alex is 9 years old.

(8+10)/2 = 9.

The answer is 9.

…

Language

Model

Q: John buys 20 cards and 1/4 are

uncommon. How many uncommon

cards did he get?

A: John gets 20 * 1/4 = 5 uncommon

cards. The answer is 5.

…

Q: Amy is 10. Jake is 8. Alex’s age is

right in the middle. How old is Alex?

CoT examples

Q: … How old is Alex?

… (8+10)/2=9 … answer is 9.

Mixed formats of selected reasoning pathsSelf-training

Training-set questions or

self-generated questions

Multiple path

decoding

Majority

Voting

by answer

<latexit sha1_base64="HOCrR75PTdXMU4JpjxYtpz9BBcY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkVI8FLx4r2A9oQ9lsNu3aTTbsToRS+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8IJXCoOt+O4WNza3tneJuaW//4PCofHzSNirTjLeYkkp3A2q4FAlvoUDJu6nmNA4k7wTj27nfeeLaCJU84CTlfkyHiYgEo2ildp+FCs2gXHGr7gJknXg5qUCO5qD81Q8Vy2KeIJPUmJ7npuhPqUbBJJ+V+pnhKWVjOuQ9SxMac+NPF9fOyIVVQhIpbStBslB/T0xpbMwkDmxnTHFkVr25+J/XyzC68aciSTPkCVsuijJJUJH56yQUmjOUE0so08LeStiIasrQBlSyIXirL6+T9lXVq1dr97VKo57HUYQzOIdL8OAaGnAHTWgBg0d4hld4c5Tz4rw7H8vWgpPPnMIfOJ8/ramPLA==</latexit>

···

… Alex is 9 years old …

…

Input:

Output:

CoT examples

Q: … How old is Alex?

A: Let’s think step-by-step.

Figure 1: Overview of our method. With Chain-of-Thought (CoT) examples as demonstration (Wei

et al., 2022b), the language model generates multiple CoT reasoning paths and answers (temperature

T > 0) for each question. The most consistent answer is selected by majority voting (Wang et al.,

2022b). The “high-conﬁdence” CoT reasoning paths that lead to the majority answer are augmented

by mixed formats as the ﬁnal training samples to be fed back to the model for ﬁne-tuning.

(LMSI). This is similar to how a human brain sometimes learns: given a question, think multi-

ple times to derive different possible results, conclude on how the question should be solved, and

then learn from or memorize its own solution. We empirically verify our method using a pre-trained

PaLM-540B LLM, where our method not only improves training task performances (74.4%→82.1%

on GSM8K, 78.2%→83.0% on DROP, 90.0%→94.4% on OpenBookQA, and 63.4%→67.9% on

ANLI-A3), but also enhances out-of-domain (OOD) test tasks (AQUA, StrategyQA, MNLI), achiev-

ing state-of-the-art performances in many tasks without relying on supervised ground truth answers.

Lastly, we conduct preliminary studies on self-generating additional input questions and few-shot

CoT prompts, which could further reduce the amount of human effort required for model self-

improving, and ablation studies on important hyperparameters of our approach. We hope our simple

approach and strong empirical results could encourage more future work by the community to in-

vestigate optimal performances of pretrained LLMs without additional human supervision.

Our contributions are summarized as follows:

• We demonstrate that a large language model can self-improve by taking datasets with-

out ground truth outputs, by leveraging CoT reasoning (Wei et al., 2022b) and self-

consistency (Wang et al., 2022b), achieving competitive in-domain multi-task perfor-

mances as well as out-of-domain generalization. We achieve state-of-the-art-level results

on ARC, OpenBookQA, and ANLI datasets.

• We provide detailed ablation studies on training sample formatting and sampling tem-

perature after ﬁne-tuning, and identify critical design choices for most successful self-

improvement by LLMs.

• We study two other approaches for self-improvements, where the model generates addi-

tional questions from ﬁnite input questions and generates few-shot CoT prompt templates

itself. The latter achieves 74.2% on GSM8K, which is the state-of-the-art zero-shot per-

formance, against 43.0% by Kojima et al. (2022) or 70.1% through its naive extension

with Wang et al. (2022b).

2 RELATED WORK

Learning from explanations. Augmenting a machine learning model with explanations has been

studied in existing literature extensively. For example, in the supervised learning setting, a model

can be ﬁne-tuned using human-annotated rationales (Zaidan et al., 2007; Ling et al., 2017b; Narang

et al., 2020; Camburu et al., 2018; Cobbe et al., 2021; Chung et al., 2022). A few works have

also looked at how explanations can help the models in various settings, e.g., in-context learning

(Lampinen et al., 2022) and in distillation (Pruthi et al., 2022). In this paper, we focus more on

the unsupervised learning setting, where we do not assume we have a rationale-augmented training

dataset available, since human-annotated rationales can be expensive.

Few-shot explanations improves reasoning in LLMs. Recently, a lot of progress has been made

towards improving LLMs’ reasoning abilities via prompting or in-context learning. Wei et al.

(2022b) propose Chain-of-Thought prompting, which prompts the language model to generate a se-

ries of natural-language-based intermediate steps, and show it can help language models better solve

complex and multi-step reasoning tasks. Wang et al. (2022b) improve Chain-of-Thought prompting

by sampling multiple diverse reasoning paths and ﬁnding the most consistent answers via majority

voting. Kojima et al. (2022) propose to prompt the language model with “Let’s think step by step”

to generate reasoning in a zero-shot fashion. Zhou et al. (2022a) further decompose the questions

into multiple sub-questions, and ask the language model to solve each sub-question sequentially.

Reﬁning explanations. More recent work proposes to further reﬁne the generated reasoning paths

as some of them could be unreliable. For example, Ye & Durrett (2022) calibrate model predictions

based on the reliability of the explanations, Jung et al. (2022) show that inducing a tree of expla-

nations and inferring the satisﬁability of each explanation can further help judge the correctness of

explanations. Li et al. (2022b) show that sampling a diverse set of prompts from the training data,

and a voting veriﬁer can be used to improve model’s reasoning performance. Zelikman et al. (2022)

proposes better rationale generation by augmenting ground truth answers as hints when predicted an-

swers are incorrect. Our work is orthogonal to these lines of work, as we utilize reﬁned explanations

from Wang et al. (2022b) for ﬁne-tuning the model for self-improvement, and could readily incor-

porate these other reﬁnement techniques for generating higher-quality self-training data. Our work

is similar to Zelikman et al. (2022) where we both propose to ﬁne-tune a model on self-generated

CoT data, but our method does not require ground truth labels and shows stronger empirical results

with multi-task generalization.

Self-training models. One related line of work is self-training (see a survey from Amini et al.

(2022)). The key idea is to assign pseudo labels from a learned classiﬁer to unlabeled data, and use

these pseudo-labeled examples to further improve the original model training, e.g., (RoyChowdhury

et al., 2019; Xie et al., 2020; He et al., 2020; Chen et al., 2021). Different from such prior work, our

proposed self-improvement framework uses CoT prompting plus self-consistency to obtain high-

conﬁdence solutions on a large set of unlabeled data to augment the ﬁne-tuning process.

Distillation and dark knowledge. Our method also tangentially relates to rich literature on dis-

tillation (Ba & Caruana, 2014; Hinton et al., 2015), where a student network imitates a teacher

network’s classiﬁer predictions on input examples. A key detail is to learn from soft targets instead

of hard predicted labels, as softmax outputs with a high temperature reveal more detailed relative

class likelihoods, colloquially known as dark knowledge (Hinton et al., 2015; Korattikara Balan

et al., 2015). Recent studies (Zelikman et al., 2022; Snell et al., 2022; Eisenstein et al., 2022) show

that dark knowledge within LLMs can be retrieved with more computation at inference time, such

as adding informative instructions into the input sequence, and output CoT generation (Wei et al.,

2022b; Kojima et al., 2022). In our work, we explicitly show that imperfect CoT reasoning (which

may lead to incorrect answer) can be used directly for self-improving language models as evidenced

in our experiments in Sections 5.2 and 5.3.

3 METHOD

The overview of our method is illustrated in Fig. 1: We are given a pre-trained Large Language

Model (LLM) Mand a question-only training dataset Dtrain ={xi}D

i=1 with few-shot Chain-of-

Thought (CoT) examples (Wei et al., 2022b). We apply multiple path decoding with a sampling

temperature T > 0for generating mreasoning paths and answers {ri1, ri2, . . . , rim}for each ques-

tion xiin Dtrain, and use majority voting (self-consistency) to select the most consistent, highest

conﬁdence answer (Wang et al., 2022b). We then keep all reasoning paths that lead to the most

consistent answer, apply mixed formats of prompts and answers for augmentation, and ﬁne-tune

the model on these self-generated reasoning-answer data. We consider our approach as making the

model self-improve. In the following sections, we detail important designs within our method, along

with additional approaches for the model to self-improve without supervised data.

3.1 GENERATING AND FILTERING MULTIPLE REASONING PATHS

Self-consistency (Wang et al., 2022b) brings large improvements on reasoning tasks (e.g., 56.5% →

74.4% on GSM8K test set), and the gap between greedy decoding and diverse decoding shows

there is a potential for further improving the reasoning ability of M, using the self-selected high-

conﬁdence reasoning paths as training data.

Question: Stefan goes to a restaurant with his family. They order an appetizer that costs $10 and

4 entrees that are $20 each. If they tip 20% of the total, what is the total amount of money that

they spend?

Multiple Path Decoding:

Output 1: The appetizer costs $10. The entrees cost $20 each so in total 4 * $20 = $80. This means

the total cost is $10 + $80 = $90. They tip 20% of it, so the total amount they spend is $90 * 1.2

= $108. The answer is 108.

Output 2: The appetizer costs $10 and the entrees are $20 each. There are 4 entrees so the sum is

$20 * 4 = $80. The waiter gets 20% of the total. 20% of $80 is $80 * .2 = $16. The answer is $80

+ $16 = $96. (Incorrect reasoning path)

Output 3: The appetizer costs $10. The entrees cost 4 * $20 = $80. The tip is 20% of the total, so

it is 20% of the $90 they have spent. The tip is 0.2 * 90 = $18. The total they spent is $90 + $18

= $108. The answer is 108.

Table 1: Examples of 3 self-generated CoT reasoning paths given a question. Output 1 and 3 are the

most consistent reasoning paths based on majority voting and kept as self-training data.

For each training question xi, we sample mCoT reasoning paths, denoted as {ri1, ri2, . . . , rim}

(see Table 1 for examples). Since Mis prompted with the CoT examples from Wei et al.

(2022b), we apply the same output parsing with “The answer is” to generate their predicted an-

swers {yi1, yi2, . . . , yim}. The most consistent answer, which is not necessarily a correct answer,

is selected by majority voting, denoted as ˜yi= arg maxyijPm

k=1 I(yij=yik). For all the train-

ing questions, we ﬁlter the CoT reasoning paths that reach ˜yas the ﬁnal answer to be put into the

self-training data, denoted as Dself−consistent ={xi,˜

ri}, where ˜

ri={rij|1≤j≤m, yij= ˜yi}.

Figure 2: The relation of accu-

racy and conﬁdence of the majority-

voted answer after multiple path de-

coding on GSM8K training-set ques-

tions. Predicted conﬁdence from self-

consistency (Wang et al., 2022b) is well

calibrated (Guo et al., 2017).

Since we do not use any ground truth labels to ﬁlter out

cases where ˜yi6=yi, it is important that the self-generated

CoT reasoning paths are mostly reliable and incorrect an-

swers do not hurt the self-improvement of the model. We

plot the relation between the accuracy and conﬁdence of

self-generated CoT paths for each question in GSM8K

training set in Fig. 2. The conﬁdence is the number of

CoT paths leading to ˜ydivided by the total path number

m. The y-axis shows the accuracy of ˜yunder a certain

conﬁdence. The circle area and the color darkness shows

the number of questions under a certain conﬁdence. We

can observe that conﬁdent answers are more likely to be

correct, which means that when a question has many con-

sistent CoT paths, then the corresponding ˜yis more likely

to be correct. On the other hand, when ˜yis wrong, it is

likely to be supported by fewer CoT paths, and brings lit-

tle noise to the training samples.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LARGELANGUAGEMODELSCANSELF-IMPROVEJiaxinHuang1ShixiangShaneGu2LeHou2yYuexinWu2XuezhiWang2HongkunYu2JiaweiHan11UniversityofIllinoisatUrbana-Champaign2Google1fjiaxinh3,hanjg@illinois.edu2fshanegu,lehou,crickwu,xuezhiw,hongkunyg@google.comABSTRACTLargeLanguageModels(LLMs)haveachievedexcellentperforman...

展开>> 收起<<

LARGE LANGUAGE MODELS CANSELF-IMPROVE Jiaxin Huang1Shixiang Shane Gu2Le Hou2yYuexin Wu2Xuezhi Wang2 Hongkun Yu2Jiawei Han1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LARGE LANGUAGE MODELS CANSELF-IMPROVE Jiaxin Huang1Shixiang Shane Gu2Le Hou2yYuexin Wu2Xuezhi Wang2 Hongkun Yu2Jiawei Han1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: