Explanations from Large Language Models Make Small Reasoners Better Shiyang Li1 Jianshu Chen2 Yelong Shen3 Zhiyu Chen1 Xinlu Zhang1 Zekun Li1 Hong Wang1Jing Qian1Baolin Peng3Yi Mao3Wenhu Chen4andXifeng Yan1

2025-04-24 0 0 437.92KB 16 页 10玖币
侵权投诉
Explanations from Large Language Models Make Small Reasoners Better
Shiyang Li1, Jianshu Chen2, Yelong Shen3, Zhiyu Chen1, Xinlu Zhang1, Zekun Li1
Hong Wang1,Jing Qian1,Baolin Peng3,Yi Mao3,Wenhu Chen4and Xifeng Yan1
1University of California, Santa Barbara
2Tencent AI Lab, 3Microsoft
4University of Waterloo, Vector Institute
{shiyangli,zhiyuchen,xinluzhang,zekunli,hongwang600,jing_qian,xyan}@cs.ucsb.edu
jianshuchen@tencent.com,wenhuchen@uwaterloo.ca
{yelong.shen,bapeng,maoyi}@microsoft.com
Abstract
Integrating free-text explanations to in-context
learning of large language models (LLM) is
shown to elicit strong reasoning capabilities
along with reasonable explanations. In this pa-
per, we consider the problem of leveraging the
explanations generated by LLM to improve the
training of small reasoners, which are more fa-
vorable in real-production deployment due to
their low cost. We systematically explore three
explanation generation approaches from LLM
and utilize a multi-task learning framework to
facilitate small models to acquire strong rea-
soning power together with explanation gen-
eration capabilities. Experiments on multi-
ple reasoning tasks show that our method can
consistently and significantly outperform fine-
tuning baselines across different settings, and
even perform better than finetuning/prompting
a 60x larger GPT-3 (175B) model by up to
9.5% in accuracy. As a side benefit, human
evaluation further shows that our method can
generate high-quality explanations to justify
its predictions, moving towards the goal of ex-
plainable AI.
1 Introduction
Large language models (LLM) have achieved im-
pressive results with in-context learning; by adding
a few demonstrations as the prompts, they can solve
unseen tasks without any parameter update (Brown
et al.,2020;Thoppilan et al.,2022;Chowdhery
et al.,2022;Wei et al.,2022a). Recently, it is
shown that adding explanation-augmented prompts
can elicit strong performance in various reasoning
tasks (Wei et al.,2022b;Lampinen et al.,2022),
such as math word problem (Cobbe et al.,2021),
symbolic reasoning (Wei et al.,2022b), numerical
reasoning (Zhou et al.,2022) and commonsense
reasoning tasks (Talmor et al.,2019). In addition,
they also enable LLM to generate reasonable ex-
planations to justify the reasoning outcomes.
In this paper, we consider the problem of leverag-
ing these elicited explanations by LLM to improve
the training of small reasoners. Small language
models (SLM)
1
could be more favorable over LLM
in many real situations due to their low cost in both
storage and computation. Nevertheless, one impor-
tant open question is how to close the performance
gap with respect to LLM on complicated reasoning
tasks, as is observed in Zelikman et al. (2022), espe-
cially in few-shot settings (Li et al.,2019). Surpris-
ingly, Hase et al. (2020) shows that using human-
annotated explanations does not improve the perfor-
mance compared to standard finetuning on T5 (Raf-
fel et al.,2019). One possible reason is that many
human-annotated explanations collected via crowd-
sourcing (Wiegreffe and Marasovi´
c,2021) could be
logically inconsistent and grammatically incorrect
(Narang et al.,2020), which restricts the amount of
available high-quality explanations. On the other
hand, using explanation-augmented prompts en-
ables LLM to automatically generate descent ex-
planations (Wiegreffe et al.,2021a), making it a
plausible alternative to generate arbitrary amount
of explanations. Therefore, a key question is: Can
the explanations generated by LLM improve the
reasoning capability of SLM?
In this paper, we show that explanations gener-
ated from LLM can consistently improve reasoning
capability of SLM. Our framework is shown in Fig-
ure 1. Specifically, we first utilize several examples
with human-written explanations as demonstrations
of LLM and then generate explanations for training
set. We systematically explore three approaches to
generating explanations. The first approach utilizes
explanations generated through chain of thought
prompting and explanations are adopted if LLM
have correct predictions and are rejected otherwise
(Zelikman et al.,2022). The second one is to gener-
ate explanations by rationalization prompting con-
ditioned on golden labels (Wiegreffe et al.,2021a).
1
We argue that small and large models are relative concepts.
For the same model, it can be small or large depending on the
context.
arXiv:2210.06726v1 [cs.CL] 13 Oct 2022
LLM ExplanationsRaw Data T5
Prompting Decoding
Multi-task Learning
Explanation
Prediction
Inference
Task 1
Task 2
Task 1
Task 2
Figure 1: Overview of proposed framework.
Intuitively, the first approach may generate higher
quality explanations than the second if LLM’ pre-
dictions are correct as incorrect explanations tend
to generate incorrect predictions (Wei et al.,2022b).
However, the first approach will reject explana-
tions on problems with incorrect predictions, leav-
ing their explanations empty. On the other hand,
the second one explicitly conditions on golden la-
bels and may still generate useful explanations on
problems where chain of thought prompting cannot
predict correctly. Therefore, we propose a third
hybrid approach: adopting explanations generated
by chain of thought prompting if LLM have cor-
rect predictions and use rationalization prompting
otherwise. As we will show in section 5, three ex-
planation generation methods can consistently and
significantly improve fine-tuning baselines without
explanations and our hybrid approach achieves best
results on two of three datasets.
We further adopt a multi-task (MT) learning
framework shown in Figure 2to utilize the LLM-
generated explanations since (1) it can naturally
allow training with partially generated explana-
tions and (2) self-rationalizing model (Wiegreffe
et al.,2021b), where golden label and the human-
written explanation is linearly concatenated as the
target, performs significantly worse than MT coun-
terpart(Hase et al.,2020). Interestingly, we find
that even with the same MT approaches (i.e., MT-
Re (Hase et al.,2020) and MT-Ra (Camburu et al.,
2018) ) as Hase et al. (2020), we can consistently
and significantly improve strong T5 standard fine-
tuning baseline using LLM-generated explanations,
which is in stark contrast to the results in Hase
et al. (2020), where finetuning T5 with MT-Re and
MT-Ra only achieves on par results using crowd-
sourced ones. In addition, we further propose MT-
CoT, where the small language model is trained
to jointly solve two tasks: (i) directly generating
the answer and (ii) generating an explanation and
then the answer, as shown in Figure 2(c). Unlike
MT-Re and MT-Ra, MT-CoT positions the answer
after the explanation, hoping the model can learn to
derive it from the explanation like chain of thought
(Wei et al.,2022b). Our results show that all three
explanation generation approaches can improve rea-
soning capability of small language models with
MT-Ra, MT-Re and MT-CoT setups. And MT-CoT
can achieve the best results over MT-Re and MT-Ra
on two of three datasets. In addition, our method
can outperform standard finetuning baseline by up
to 8.1% in accuracy and even perform better than
finetuning/prompting a 60x larger GPT-3 model
(175B) by up to 9.5% in accuracy on Common-
senseQA. Finally, as a side benefit, human evalu-
ation further shows that our method can generate
high-quality explanations to justify its predictions,
moving towards the goal of explainable AI (Samek
et al.,2019).
In a nutshell, we summarize our contribution as
following:
We show that multi-task learning with expla-
nations from LLM can consistently and sig-
nificantly improve strong T5 single-task fine-
tuning baselines across various settings.
We propose a hybrid prompting approach to
generating explanations from LLM and MT-
CoT to further improve our learning with ex-
planations from LLM paradiam.
We demonstrate that our method can perform
better than finetuning/prompting a 60x larger
GPT-3 model (175B) by up to 9.5% in accu-
racy on CommonsenseQA and generate high-
quality explanations to justify its predictions
towards the goal of explainable AI.
2 Related Work
Prompting with Explanations.
Recently, a new
learning paradigm, in-context learning where sev-
eral training examples are used as demonstrations
of LLM without any parameter update, has shown
promising results in various NLP tasks (Brown
et al.,2020). Although promising, LLM still strug-
gle with tasks requiring strong reasoning capability
(Wei et al.,2022b). To enable better few-shot in-
context learning of LLM for reasoning tasks, Wei
et al. (2022b) proposes chain of thought prompt-
ing, which provides intermediate reasoning steps
as explanations in prompts before answers and has
achieved state-of-the-art in arithmetic, symbolic
and common sense reasoning tasks. Zhou et al.
(2022) further extends chain of thought prompting
with least-to-most prompting, which decomposes
a complex problem into a list of subproblems with
natural languages, and then sequentially solves
these subproblems in a recursive fashion. Kojima
et al. (2022) moves one step further and shows
that LLM are zero-shot reasoners by simply adding
Let’s think step by step” without any demonstra-
tion in prompts. Unlike these work, Lampinen et al.
(2022) explores explanations after answers prompt-
ing for LLM, where answers are fed into LLM
before providing their explanations in prompts, and
also observes consistent gains.
These also exist work to utilize explanations gen-
erated from LLM rather than focusing on their fi-
nal predictions. Wiegreffe et al. (2021a) explores
utilizing LLM to annotate explanations for ex-
isting datasets and proposes a sample-then-filter
paradigm with human annotations. Ye and Dur-
rett (2022) proposes to utilize a calibrator to cali-
brate GPT-3 as they find that GPT-3 tends to gen-
erate consistent but less factual explanations for
textual reasoning tasks. However, none of these
work explores if these noisy explanations gener-
ated from LLM without human-involved filtering
can be used to improve SLM reasoning capabil-
ity. The closest work to ours is STaR (Zelikman
et al.,2022). STaR begins with prompting a descent
large language model GPT-J with 6B parameters
(Wang,2021) possibly including answer hints via
chain of thought prompting to generate explana-
tions with incorrect answer rejection. After that,
they utilize filtered training datasets with expla-
nations to finetune GPT-J as a teacher model and
then utilize the teacher model to generate explana-
tions of training datasets to train a student GPT-J
model iteratively with a self-training fashion un-
til performance plateaus. However, STaR often
requires dozens of iterations to converge, which
is both time-consuming and compute-intensive to
train a large 6B model. What’s worse, their method
may not be applicable to smaller language models,
e.g. GPT-2 (Radford et al.,2019) and strong non-
autoregressive models, e.g. T5, as they may not
generate high-quality explanations with prompting.
In addition, they only focus on chain of thought
style prompting and finetuning while our approach
can improve SLM across model sizes, explanation
generation and multi-task finetuning methods.
Learning with Explanations.
Learning with ex-
planations has been commonly studied in robotics
(Johnson,1994) and computer vision (Hendricks
et al.,2016). Recently, it has received increasing
attention in NLP as well. Camburu et al. (2018) pro-
poses MT-Ra for natural language inference task
with LSTM and does not observe gains over single-
task finetuning. Narang et al. (2020) utilizes MT-
Ra setup on both T5-base and T5-11B models but
mainly focuses on explanation generation. Instead,
Rajani et al. (2019) observes improvements with
two-stage finetuning using human-annotated expla-
nations for common sense reasoning task, where
the first stage is to train a model for explanation
generations with GPT (Radford et al.,2018) and the
second one utilizes explanations as input to train a
classification model based on BERT (Devlin et al.,
2019). However, Hase et al. (2020) finds that both
two-stage finetuning and multi-task learning with
MT-Re and MT-Ra setups only obtain comparable
results over standard finetuning baselines on T5.
We instead show that MT-Re, MT-Ra and our pro-
posed MT-CoT with explanations from LLM can
consistently and significantly outperform standard
finetuning baselines without accuracy-explanation
trade-off (Jain et al.,2020).
3 Explanation Generation from LLM
Problem setup.
Denote
D={(xi
,
yi)}N
to be
a dataset with
N
training instances, where
xi
is
a problem and
yi
is its answer. Also, we have
a handful of human-written instances
E={(xp
i
,
ep
i
,
yp
i)}M
, where
ep
i
is a free-text explanation to
explain why a problem
xp
i
has
yp
i
as its answer
and
{(xp
i
,
yp
i)}MD
with
MN
(we set
M= 7
in our experiments). Our goal is to fully
leverage LLM with
E
as demonstrations for in-
context learning to generate explanation
ei
for all
(xi
,
yi)
, where
1iN
, so that we can utilize
these generated explanations from LLM to improve
SLM reasoning capability.
COTE.
A chain of thought is a series of interme-
diate reasoning steps before providing an answer
of a problem, mimicking human deliberate think-
ing process to perform complicated reasoning tasks
(Wei et al.,2022b). Chain of thought prompting
provides intermediate reasoning steps as explana-
qta Q: The only baggage the woman checked
was adrawstring bag, where was she heading
with it? Answer Choices:(a) garbage can (b)
military (c) jewelry store (d) safe (e) airport A:
qtr Q: The only baggage the woman checked was
adrawstring bag, where was she heading with
it? Answer Choices:(a) garbage can (b) military
(c) jewelry store (d) safe (e) airport A:
T5
The answer must be aplace where someone
would check abag.The only place where
someone would check abag is at an airport.
Therefore, the answer is airport (e).
The answer is airport (e).Explanation:The
answer must be aplace where someone would
check abag.The only place where someone
would check abag is at an airport.
The answer must be aplace where someone
would check abag.The only place where
someone would check abag is at an airport.
(e)
(e)
(e)
(c) MT-CoT
(b) MT-Ra
(a) MT-Re
Chain of Thought (CoT)
Rationalization (Ra)
Reasoning (Re)
qta task: Answer Prediction
qtr task: Explanation Generation
Figure 2: The comparison among (a) MT-Re (Hase et al.,2020), (b) MT-Ra (Camburu et al.,2018) and (c) our
proposed MT-CoT for multi-task learning with explanations under text-to-text format using T5. Left parts are
inputs of T5 and right parts are targets for different multi-task learning setups. Task qta (question to answer) is
trained to directly generate answers for all three modes while qtr (question to reason) task is trained to generate
reasoning, rationalization and chain of thought for (a) MT-Re, (b) MT-Ra and (c) MT-CoT, respectively.
tions before answers in prompts. Formally, for
1iN
, we first concatenate all instances in
E
and
xi
as prompt
ˆpi
= (
xp
1
,
ep
1
,
yp
1
, ...,
xp
M
,
ep
M
,
yp
M
,
xi
). We then feed prompt
ˆpi
into LLM and greedily
decode until a stop token is generated. After that,
we parse the decoded sentence as explanation part
ˆei
and prediction part
ˆyi
. Intuitively, if
ˆyi6=yi
,
ˆei
may not have high quality as incorrect explana-
tions tend to generate incorrect predictions (Wei
et al.,2022b). Thus, we utilize
C
hain
O
f
T
hought
prompting with incorrect answer r
E
jection (COTE)
(Zelikman et al.,2022) by only adopting
ei:= ˆei
if
ˆyi=yi; otherwise, we reject ˆeiand set eias none.
RP.
Since COTE uses the answers in original
datasets to reject explanations with incorrect pre-
dictions, these instances will no longer have expla-
nations. To alleviate this issue, an alternative is
apply
R
ationalization
P
rompting (RP) (Wiegreffe
et al.,2021a) to generate explanations for every in-
stance in training sets. Unlike COTE, RP provides
explanations given golden answers. Specifically,
for
1iN
, we concatenate all instances in
E
and
(xi, yi)
as prompt
¯pi
= (
xp
1
,
yp
1
,
ep
1
, ...,
xp
M
,
yp
M
,
ep
M
,
xi
,
yi
). We then feed prompt
¯pi
into LLM
and greedily decode until a stop token is generated.
The decoded sentence
¯ei
is cast as explanation
ˆei
,
i.e. ei:= ¯ei, without filtering.
CROP.
COTE will possibly generate relatively
high-quality explanations if LLM give correct pre-
dictions of problems at hand as incorrect explana-
tions tend to generate incorrect predictions (Wei
et al.,2022b). However, for problems with incor-
rect predictions, COTE casts their explanations as
none. On the other hand, RP can generate expla-
nations for every instance in the dataset, but we
cannot easily assess their quality without human an-
notation. Therefore, we propose
C
hain of Thought
with
R
ationalization Pr
O
mpting backu
P
(CROP),
where when COTE generates none as explanations,
we will utilize RP as a backup approach. Intu-
itively, if LLM cannot predict a problem correctly
under chain of thought prompting, the problem
may be difficult (Zelikman et al.,2022) and RP
may provide a meaningful explanation as it can
access golden label during explanation generation
process.
4 Multi-task Learning with Explanations
In this section, we elaborate how to utilize explana-
tions generated from LLM to improve SLM reason-
ing capability with a multi-task learning framework.
We detail three multi-task learning with explana-
tions methods in the following.
MT-Re.
Multi-task Learning with Reasoning
(MT-Re) is introduced by Hase et al. (2020) (see
Figure 2(a)). MT-Re is trained to directly generate
predictions for qta (question to answer) task the
same as standard finetuning without explanations
and generate explanations without explicitly pro-
viding answers in qtr (question to reason) task. The
training objective of MT-Re is to mix loss Lqta for
qta task and Lqtr for qtr task:
Lmt =αLqta + (1 α)Lqtr,(1)
摘要:

ExplanationsfromLargeLanguageModelsMakeSmallReasonersBetterShiyangLi1,JianshuChen2,YelongShen3,ZhiyuChen1,XinluZhang1,ZekunLi1HongWang1,JingQian1,BaolinPeng3,YiMao3,WenhuChen4andXifengYan11UniversityofCalifornia,SantaBarbara2TencentAILab,3Microsoft4UniversityofWaterloo,VectorInstitute{shiyangli,zhiy...

收起<<
Explanations from Large Language Models Make Small Reasoners Better Shiyang Li1 Jianshu Chen2 Yelong Shen3 Zhiyu Chen1 Xinlu Zhang1 Zekun Li1 Hong Wang1Jing Qian1Baolin Peng3Yi Mao3Wenhu Chen4andXifeng Yan1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:437.92KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注