Explanations from Large Language Models Make Small Reasoners Better Shiyang Li1 Jianshu Chen2 Yelong Shen3 Zhiyu Chen1 Xinlu Zhang1 Zekun Li1 Hong Wang1Jing Qian1Baolin Peng3Yi Mao3Wenhu Chen4andXifeng Yan1

2025-04-24 0 0 437.92KB 16 页 10玖币

侵权投诉

Explanations from Large Language Models Make Small Reasoners Better

Shiyang Li1, Jianshu Chen2, Yelong Shen3, Zhiyu Chen1, Xinlu Zhang1, Zekun Li1

Hong Wang1,Jing Qian1,Baolin Peng3,Yi Mao3,Wenhu Chen4and Xifeng Yan1

1University of California, Santa Barbara

2Tencent AI Lab, 3Microsoft

4University of Waterloo, Vector Institute

{shiyangli,zhiyuchen,xinluzhang,zekunli,hongwang600,jing_qian,xyan}@cs.ucsb.edu

jianshuchen@tencent.com,wenhuchen@uwaterloo.ca

{yelong.shen,bapeng,maoyi}@microsoft.com

Abstract

Integrating free-text explanations to in-context

learning of large language models (LLM) is

shown to elicit strong reasoning capabilities

along with reasonable explanations. In this pa-

per, we consider the problem of leveraging the

explanations generated by LLM to improve the

training of small reasoners, which are more fa-

vorable in real-production deployment due to

their low cost. We systematically explore three

explanation generation approaches from LLM

and utilize a multi-task learning framework to

facilitate small models to acquire strong rea-

soning power together with explanation gen-

eration capabilities. Experiments on multi-

ple reasoning tasks show that our method can

consistently and signiﬁcantly outperform ﬁne-

tuning baselines across different settings, and

even perform better than ﬁnetuning/prompting

a 60x larger GPT-3 (175B) model by up to

9.5% in accuracy. As a side beneﬁt, human

evaluation further shows that our method can

generate high-quality explanations to justify

its predictions, moving towards the goal of ex-

plainable AI.

1 Introduction

Large language models (LLM) have achieved im-

pressive results with in-context learning; by adding

a few demonstrations as the prompts, they can solve

unseen tasks without any parameter update (Brown

et al.,2020;Thoppilan et al.,2022;Chowdhery

et al.,2022;Wei et al.,2022a). Recently, it is

shown that adding explanation-augmented prompts

can elicit strong performance in various reasoning

tasks (Wei et al.,2022b;Lampinen et al.,2022),

such as math word problem (Cobbe et al.,2021),

symbolic reasoning (Wei et al.,2022b), numerical

reasoning (Zhou et al.,2022) and commonsense

reasoning tasks (Talmor et al.,2019). In addition,

they also enable LLM to generate reasonable ex-

planations to justify the reasoning outcomes.

In this paper, we consider the problem of leverag-

ing these elicited explanations by LLM to improve

the training of small reasoners. Small language

models (SLM)

could be more favorable over LLM

in many real situations due to their low cost in both

storage and computation. Nevertheless, one impor-

tant open question is how to close the performance

gap with respect to LLM on complicated reasoning

tasks, as is observed in Zelikman et al. (2022), espe-

cially in few-shot settings (Li et al.,2019). Surpris-

ingly, Hase et al. (2020) shows that using human-

annotated explanations does not improve the perfor-

mance compared to standard ﬁnetuning on T5 (Raf-

fel et al.,2019). One possible reason is that many

human-annotated explanations collected via crowd-

sourcing (Wiegreffe and Marasovi´

c,2021) could be

logically inconsistent and grammatically incorrect

(Narang et al.,2020), which restricts the amount of

available high-quality explanations. On the other

hand, using explanation-augmented prompts en-

ables LLM to automatically generate descent ex-

planations (Wiegreffe et al.,2021a), making it a

plausible alternative to generate arbitrary amount

of explanations. Therefore, a key question is: Can

the explanations generated by LLM improve the

reasoning capability of SLM?

In this paper, we show that explanations gener-

ated from LLM can consistently improve reasoning

capability of SLM. Our framework is shown in Fig-

ure 1. Speciﬁcally, we ﬁrst utilize several examples

with human-written explanations as demonstrations

of LLM and then generate explanations for training

set. We systematically explore three approaches to

generating explanations. The ﬁrst approach utilizes

explanations generated through chain of thought

prompting and explanations are adopted if LLM

have correct predictions and are rejected otherwise

(Zelikman et al.,2022). The second one is to gener-

ate explanations by rationalization prompting con-

ditioned on golden labels (Wiegreffe et al.,2021a).

We argue that small and large models are relative concepts.

For the same model, it can be small or large depending on the

context.

arXiv:2210.06726v1 [cs.CL] 13 Oct 2022

LLM ExplanationsRaw Data T5

Prompting Decoding

Multi-task Learning

Explanation

Prediction

Inference

Task 1

Task 2

Task 1

Task 2

Figure 1: Overview of proposed framework.

Intuitively, the ﬁrst approach may generate higher

quality explanations than the second if LLM’ pre-

dictions are correct as incorrect explanations tend

to generate incorrect predictions (Wei et al.,2022b).

However, the ﬁrst approach will reject explana-

tions on problems with incorrect predictions, leav-

ing their explanations empty. On the other hand,

the second one explicitly conditions on golden la-

bels and may still generate useful explanations on

problems where chain of thought prompting cannot

predict correctly. Therefore, we propose a third

hybrid approach: adopting explanations generated

by chain of thought prompting if LLM have cor-

rect predictions and use rationalization prompting

otherwise. As we will show in section 5, three ex-

planation generation methods can consistently and

signiﬁcantly improve ﬁne-tuning baselines without

explanations and our hybrid approach achieves best

results on two of three datasets.

We further adopt a multi-task (MT) learning

framework shown in Figure 2to utilize the LLM-

generated explanations since (1) it can naturally

allow training with partially generated explana-

tions and (2) self-rationalizing model (Wiegreffe

et al.,2021b), where golden label and the human-

written explanation is linearly concatenated as the

target, performs signiﬁcantly worse than MT coun-

terpart(Hase et al.,2020). Interestingly, we ﬁnd

that even with the same MT approaches (i.e., MT-

Re (Hase et al.,2020) and MT-Ra (Camburu et al.,

2018) ) as Hase et al. (2020), we can consistently

and signiﬁcantly improve strong T5 standard ﬁne-

tuning baseline using LLM-generated explanations,

which is in stark contrast to the results in Hase

et al. (2020), where ﬁnetuning T5 with MT-Re and

MT-Ra only achieves on par results using crowd-

sourced ones. In addition, we further propose MT-

CoT, where the small language model is trained

to jointly solve two tasks: (i) directly generating

the answer and (ii) generating an explanation and

then the answer, as shown in Figure 2(c). Unlike

MT-Re and MT-Ra, MT-CoT positions the answer

after the explanation, hoping the model can learn to

derive it from the explanation like chain of thought

(Wei et al.,2022b). Our results show that all three

explanation generation approaches can improve rea-

soning capability of small language models with

MT-Ra, MT-Re and MT-CoT setups. And MT-CoT

can achieve the best results over MT-Re and MT-Ra

on two of three datasets. In addition, our method

can outperform standard ﬁnetuning baseline by up

to 8.1% in accuracy and even perform better than

ﬁnetuning/prompting a 60x larger GPT-3 model

(175B) by up to 9.5% in accuracy on Common-

senseQA. Finally, as a side beneﬁt, human evalu-

ation further shows that our method can generate

high-quality explanations to justify its predictions,

moving towards the goal of explainable AI (Samek

et al.,2019).

In a nutshell, we summarize our contribution as

following:

•

We show that multi-task learning with expla-

nations from LLM can consistently and sig-

niﬁcantly improve strong T5 single-task ﬁne-

tuning baselines across various settings.

•

We propose a hybrid prompting approach to

generating explanations from LLM and MT-

CoT to further improve our learning with ex-

planations from LLM paradiam.

•

We demonstrate that our method can perform

better than ﬁnetuning/prompting a 60x larger

GPT-3 model (175B) by up to 9.5% in accu-

racy on CommonsenseQA and generate high-

quality explanations to justify its predictions

towards the goal of explainable AI.

2 Related Work

Prompting with Explanations.

Recently, a new

learning paradigm, in-context learning where sev-

eral training examples are used as demonstrations

of LLM without any parameter update, has shown

promising results in various NLP tasks (Brown

et al.,2020). Although promising, LLM still strug-

gle with tasks requiring strong reasoning capability

(Wei et al.,2022b). To enable better few-shot in-

context learning of LLM for reasoning tasks, Wei

et al. (2022b) proposes chain of thought prompt-

ing, which provides intermediate reasoning steps

as explanations in prompts before answers and has

achieved state-of-the-art in arithmetic, symbolic

and common sense reasoning tasks. Zhou et al.

(2022) further extends chain of thought prompting

with least-to-most prompting, which decomposes

a complex problem into a list of subproblems with

natural languages, and then sequentially solves

these subproblems in a recursive fashion. Kojima

et al. (2022) moves one step further and shows

that LLM are zero-shot reasoners by simply adding

“Let’s think step by step” without any demonstra-

tion in prompts. Unlike these work, Lampinen et al.

(2022) explores explanations after answers prompt-

ing for LLM, where answers are fed into LLM

before providing their explanations in prompts, and

also observes consistent gains.

These also exist work to utilize explanations gen-

erated from LLM rather than focusing on their ﬁ-

nal predictions. Wiegreffe et al. (2021a) explores

utilizing LLM to annotate explanations for ex-

isting datasets and proposes a sample-then-ﬁlter

paradigm with human annotations. Ye and Dur-

rett (2022) proposes to utilize a calibrator to cali-

brate GPT-3 as they ﬁnd that GPT-3 tends to gen-

erate consistent but less factual explanations for

textual reasoning tasks. However, none of these

work explores if these noisy explanations gener-

ated from LLM without human-involved ﬁltering

can be used to improve SLM reasoning capabil-

ity. The closest work to ours is STaR (Zelikman

et al.,2022). STaR begins with prompting a descent

large language model GPT-J with 6B parameters

(Wang,2021) possibly including answer hints via

chain of thought prompting to generate explana-

tions with incorrect answer rejection. After that,

they utilize ﬁltered training datasets with expla-

nations to ﬁnetune GPT-J as a teacher model and

then utilize the teacher model to generate explana-

tions of training datasets to train a student GPT-J

model iteratively with a self-training fashion un-

til performance plateaus. However, STaR often

requires dozens of iterations to converge, which

is both time-consuming and compute-intensive to

train a large 6B model. What’s worse, their method

may not be applicable to smaller language models,

e.g. GPT-2 (Radford et al.,2019) and strong non-

autoregressive models, e.g. T5, as they may not

generate high-quality explanations with prompting.

In addition, they only focus on chain of thought

style prompting and ﬁnetuning while our approach

can improve SLM across model sizes, explanation

generation and multi-task ﬁnetuning methods.

Learning with Explanations.

Learning with ex-

planations has been commonly studied in robotics

(Johnson,1994) and computer vision (Hendricks

et al.,2016). Recently, it has received increasing

attention in NLP as well. Camburu et al. (2018) pro-

poses MT-Ra for natural language inference task

with LSTM and does not observe gains over single-

task ﬁnetuning. Narang et al. (2020) utilizes MT-

Ra setup on both T5-base and T5-11B models but

mainly focuses on explanation generation. Instead,

Rajani et al. (2019) observes improvements with

two-stage ﬁnetuning using human-annotated expla-

nations for common sense reasoning task, where

the ﬁrst stage is to train a model for explanation

generations with GPT (Radford et al.,2018) and the

second one utilizes explanations as input to train a

classiﬁcation model based on BERT (Devlin et al.,

2019). However, Hase et al. (2020) ﬁnds that both

two-stage ﬁnetuning and multi-task learning with

MT-Re and MT-Ra setups only obtain comparable

results over standard ﬁnetuning baselines on T5.

We instead show that MT-Re, MT-Ra and our pro-

posed MT-CoT with explanations from LLM can

consistently and signiﬁcantly outperform standard

ﬁnetuning baselines without accuracy-explanation

trade-off (Jain et al.,2020).

3 Explanation Generation from LLM

Problem setup.

Denote

D={(xi

yi)}N

to be

a dataset with

training instances, where

a problem and

is its answer. Also, we have

a handful of human-written instances

E={(xp

i)}M

, where

is a free-text explanation to

explain why a problem

has

as its answer

and

{(xp

i)}M⊂D

with

MN

(we set

M= 7

in our experiments). Our goal is to fully

leverage LLM with

as demonstrations for in-

context learning to generate explanation

for all

(xi

yi)

, where

1≤i≤N

, so that we can utilize

these generated explanations from LLM to improve

SLM reasoning capability.

COTE.

A chain of thought is a series of interme-

diate reasoning steps before providing an answer

of a problem, mimicking human deliberate think-

ing process to perform complicated reasoning tasks

(Wei et al.,2022b). Chain of thought prompting

provides intermediate reasoning steps as explana-

qta Q: The only baggage the woman checked

was adrawstring bag, where was she heading

with it? Answer Choices:(a) garbage can (b)

military (c) jewelry store (d) safe (e) airport A:

qtr Q: The only baggage the woman checked was

adrawstring bag, where was she heading with

it? Answer Choices:(a) garbage can (b) military

The answer must be aplace where someone

would check abag.The only place where

someone would check abag is at an airport.

Therefore, the answer is airport (e).

The answer is airport (e).Explanation:The

answer must be aplace where someone would

check abag.The only place where someone

would check abag is at an airport.

The answer must be aplace where someone

would check abag.The only place where

someone would check abag is at an airport.

(e)

(b) MT-Ra

(a) MT-Re

Chain of Thought (CoT)

Rationalization (Ra)

Reasoning (Re)

qta task: Answer Prediction

qtr task: Explanation Generation

Figure 2: The comparison among (a) MT-Re (Hase et al.,2020), (b) MT-Ra (Camburu et al.,2018) and (c) our

proposed MT-CoT for multi-task learning with explanations under text-to-text format using T5. Left parts are

inputs of T5 and right parts are targets for different multi-task learning setups. Task qta (question to answer) is

trained to directly generate answers for all three modes while qtr (question to reason) task is trained to generate

reasoning, rationalization and chain of thought for (a) MT-Re, (b) MT-Ra and (c) MT-CoT, respectively.

tions before answers in prompts. Formally, for

1≤i≤N

, we ﬁrst concatenate all instances in

and

as prompt

ˆpi

= (

, ...,

). We then feed prompt

ˆpi

into LLM and greedily

decode until a stop token is generated. After that,

we parse the decoded sentence as explanation part

ˆei

and prediction part

ˆyi

. Intuitively, if

ˆyi6=yi

ˆei

may not have high quality as incorrect explana-

tions tend to generate incorrect predictions (Wei

et al.,2022b). Thus, we utilize

hain

hought

prompting with incorrect answer r

jection (COTE)

(Zelikman et al.,2022) by only adopting

ei:= ˆei

ˆyi=yi; otherwise, we reject ˆeiand set eias none.

RP.

Since COTE uses the answers in original

datasets to reject explanations with incorrect pre-

dictions, these instances will no longer have expla-

nations. To alleviate this issue, an alternative is

apply

ationalization

rompting (RP) (Wiegreffe

et al.,2021a) to generate explanations for every in-

stance in training sets. Unlike COTE, RP provides

explanations given golden answers. Speciﬁcally,

for

1≤i≤N

, we concatenate all instances in

and

(xi, yi)

as prompt

¯pi

= (

, ...,

). We then feed prompt

¯pi

into LLM

and greedily decode until a stop token is generated.

The decoded sentence

¯ei

is cast as explanation

ˆei

i.e. ei:= ¯ei, without ﬁltering.

CROP.

COTE will possibly generate relatively

high-quality explanations if LLM give correct pre-

dictions of problems at hand as incorrect explana-

tions tend to generate incorrect predictions (Wei

et al.,2022b). However, for problems with incor-

rect predictions, COTE casts their explanations as

none. On the other hand, RP can generate expla-

nations for every instance in the dataset, but we

cannot easily assess their quality without human an-

notation. Therefore, we propose

hain of Thought

with

ationalization Pr

mpting backu

(CROP),

where when COTE generates none as explanations,

we will utilize RP as a backup approach. Intu-

itively, if LLM cannot predict a problem correctly

under chain of thought prompting, the problem

may be difﬁcult (Zelikman et al.,2022) and RP

may provide a meaningful explanation as it can

access golden label during explanation generation

process.

4 Multi-task Learning with Explanations

In this section, we elaborate how to utilize explana-

tions generated from LLM to improve SLM reason-

ing capability with a multi-task learning framework.

We detail three multi-task learning with explana-

tions methods in the following.

MT-Re.

Multi-task Learning with Reasoning

(MT-Re) is introduced by Hase et al. (2020) (see

Figure 2(a)). MT-Re is trained to directly generate

predictions for qta (question to answer) task the

same as standard ﬁnetuning without explanations

and generate explanations without explicitly pro-

viding answers in qtr (question to reason) task. The

training objective of MT-Re is to mix loss Lqta for

qta task and Lqtr for qtr task:

Lmt =αLqta + (1 −α)Lqtr,(1)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplanationsfromLargeLanguageModelsMakeSmallReasonersBetterShiyangLi1,JianshuChen2,YelongShen3,ZhiyuChen1,XinluZhang1,ZekunLi1HongWang1,JingQian1,BaolinPeng3,YiMao3,WenhuChen4andXifengYan11UniversityofCalifornia,SantaBarbara2TencentAILab,3Microsoft4UniversityofWaterloo,VectorInstitute{shiyangli,zhiy...

收起<<

Explanations from Large Language Models Make Small Reasoners Better Shiyang Li1 Jianshu Chen2 Yelong Shen3 Zhiyu Chen1 Xinlu Zhang1 Zekun Li1 Hong Wang1Jing Qian1Baolin Peng3Yi Mao3Wenhu Chen4andXifeng Yan1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Explanations from Large Language Models Make Small Reasoners Better Shiyang Li1 Jianshu Chen2 Yelong Shen3 Zhiyu Chen1 Xinlu Zhang1 Zekun Li1 Hong Wang1Jing Qian1Baolin Peng3Yi Mao3Wenhu Chen4andXifeng Yan1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: