RAINIER Reinforced Knowledge Introspector for Commonsense Question Answering Jiacheng LiuSkyler HallinanXiming LuPengfei He

2025-04-29 0 0 1.25MB 21 页 10玖币

侵权投诉

RAINIER: Reinforced Knowledge Introspector

for Commonsense Question Answering

Jiacheng Liu♥Skyler Hallinan♥Ximing Lu♥♠ Pengfei He♥

Sean Welleck♥♠ Hannaneh Hajishirzi♥♠ Yejin Choi♥♠

♥Paul G. Allen School of Computer Science & Engineering, University of Washington

♠Allen Institute for Artiﬁcial Intelligence

liujc@cs.washington.edu

Abstract

Knowledge underpins reasoning. Recent re-

search demonstrates that when relevant knowl-

edge is provided as additional context to com-

monsense question answering (QA), it can sub-

stantially enhance the performance even on top

of state-of-the-art. The fundamental challenge

is where and how to ﬁnd such knowledge that

is high quality and on point with respect to

the question; knowledge retrieved from knowl-

edge bases are incomplete and knowledge gen-

erated from language models are inconsistent.

We present RAINIER1, or Reinforced Knowl-

edge Introspector, that learns to generate con-

textually relevant knowledge in response to

given questions. Our approach starts by im-

itating knowledge generated by GPT-3, then

learns to generate its own knowledge via rein-

forcement learning where rewards are shaped

based on the increased performance on the re-

sulting question answering. RAINIER demon-

strates substantial and consistent performance

gains when tested over 9 different common-

sense benchmarks: including 5 datasets that

are seen during model training, as well as 4

datasets that are kept unseen. Our work is

the ﬁrst to report that knowledge generated by

models that are orders of magnitude smaller

than GPT-3, even without direct supervision

on the knowledge itself, can exceed the qual-

ity of commonsense knowledge elicited from

GPT-3.

1 Introduction

Commonsense is a signiﬁcant challenge for mod-

ern NLP models, due to the obscurity of underly-

ing knowledge that grounds the reasoning process.

While humans are generally able to introspect the

underlying reasons for their conclusion (Mercier

and Sperber,2017), neural models lack the capabil-

ity to verbalize the premises leading to their predic-

1Code, model and knowledge-extended datasets are avail-

able at http://github.com/liujch1998/rainier

Figure 1: RAINIER can introspect for commonsense

knowledge that underpin the reasoning process, and is

trained via reinforcement learning, where the reward

is derived from the effectiveness of knowledge when

prompting a frozen, generic QA model.

tion. This hinders models’ performance and robust-

ness on commonsense tasks, and makes it difﬁcult

to inspect their point of failure. Recent research

has demonstrated that relevant knowledge can pro-

vide useful context for approaching commonsense

tasks. Yet these methods either retrieve from in-

domain knowledge bases (Mitra et al.,2019;Chang

et al.,2020) that do not have good coverage over

commonsense, or generate knowledge from neu-

ral models (Shwartz et al.,2020;Gu et al.,2022;

Liu et al.,2022), which often need domain-speciﬁc

engineering and very large models (e.g. GPT-3

(Brown et al.,2020)). It is still an open challenge

to systematically ﬁnd high-quality knowledge.

In this work, we use a novel, reinforcement-

learning-based method to develop RAINIER, a gen-

erative neural model that can introspect the un-

derlying knowledge for reasoning about common-

sense questions. As illustrated in Fig 1, RAINIER is

trained to generate knowledge that are both ﬂuent

natural language statements, and useful prompts

that optimize the performance of a generic ques-

tion answering (QA) model. Our model (1) demon-

strates strong generalization to unseen benchmarks

arXiv:2210.03078v2 [cs.CL] 22 Oct 2022

without additional engineering effort (e.g. ﬁne-

tuning), (2) produces commonsense knowledge of

high quality and diversity, and (3) is substantially

smaller in size compared to GPT-3, the best knowl-

edge source reported so far.

To train RAINIER, we optimize knowledge intro-

spection for the resulting QA, instead of direct su-

pervision, because there are usually no gold knowl-

edge labels on commonsense datasets. In order

to ensure that our model learns to generate gener-

ically useful knowledge for a broad range of QA

models, we train only RAINIER, the knowledge in-

trospector, without ﬁnetuning the QA model. Since

our desired knowledge are sequences of discrete,

non-differentiable word tokens, we adapt a rein-

forcement learning algorithm, Proximal Policy Op-

timization (PPO) (Schulman et al.,2017;Ouyang

et al.,2022), to optimize the knowledge introspec-

tor. Speciﬁcally, the reward is deﬁned as the ef-

fect of RAINIER-generated knowledge on the QA

model’s prediction. We train RAINIER in a multi-

task setting on 8 commonsense QA datasets – en-

compassing general, scientiﬁc, physical, and social

commonsense – to equip the model with better gen-

eralization to unseen benchmarks.

Experiments show that RAINIER substantially

improves the performance of QA models on 9 com-

monsense benchmarks (5 datasets seen during train-

ing and 4 unseen datasets), and gives larger and

more consistent gains than a few-shot GPT-3 (Liu

et al.,2022) despite being 16x smaller in parameter

size. It also boosts the performance on top of those

QA models that it is not trained against, indicat-

ing that it generates generically useful knowledge

instead of merely hacking into the reward given

by a single QA model. Knowledge generated by

RAINIER can even boost a QA model that is 4x

larger than it, showing the promise of using model-

generated knowledge as a complement to model

scaling in making progress in commonsense rea-

soning. Our analyses show that the knowledge

generated by RAINIER are of high quality, and are

diverse in terms of domain (e.g. scientiﬁc, social),

relation expressed (e.g. part of, member of, pur-

pose), and syntactic property (e.g. negation, com-

parison). The effect of these knowledge on the QA

model also aligns well with human judgments. The

success of RAINIER shows that moderately-sized

models can serve as source of high-quality and

useful commonsense knowledge that facilitates rea-

soning. We publicly release the code, the trained

Algorithm 1 Training RAINIER

Input

initial policy model

θ0

, initial value model

φ0

, pre-

trained QA model ψQA

Dimit ←Get silver knowledge on Dseen from GPT-3.

θimit ←Optimize θ0with Eqn 2from Dimit..Section 2.1

θRAINIER ←REINFORCEDLEARNING(Dseen,θimit,φ0,ψQA)

.Section 2.2

procedure REINFORCEDLEARNING(Dseen,θ,φ,ψQA)

θold ←θ,φold ←φ

for iterations = 1, 2, . . . do

Sample a minibatch from Dseen.

for step = 1, 2, . . . , sdo

Compute LPPO on the minibatch with Eqn 3.

Optimize θand φwith LPPO for one step.

θold ←θ,φold ←φ

return θ

Output θRAINIER

RAINIER model, and the commonsense datasets

extended with knowledge generated by RAINIER.

2 Method

Problem Overview.

We focus on the tasks of

multiple-choice commonsense QA, consisting of

instances of format

x= (q, A, a∗)

, where

is the

question,

is the set of candidate answers, and

a∗∈A

is the correct answer. For full contextu-

alization, we append candidate answers

to the

question

to form the input to the QA model as

follows:

q={question} (A) {choice_A} (B) {choice_B} ...

Common approaches only train supervised QA

models. As a complement, we train a separate

model, which we refer to as RAINIER, that can

introspect question-speciﬁc knowledges that are

useful to prompt a ﬁxed QA model. RAINIER is a

sequence-to-sequence language model,

pK(k|q;θ)

and we expect it to generate knowledge statements

(

’s) in response to the given question (

). How-

ever, the challenge is that we have no gold knowl-

edge labels as supervision.

Training.

Since we do not have gold knowledge

to train RAINIER, we obtain this model by ﬁne-

tuning a pretrained language model in two stages:

(I) imitation learning, and then (II) reinforcement

learning. In Stage I (§2.1), we get silver knowledge

labels on some datasets from GPT-3, and teach our

model to imitate this knowledge-generating GPT-3.

This equips our model with the basic functionality

of knowledge generation. In Stage II (§2.2), we

use reinforcement learning to continue training the

model obtained in Stage I to make the generated

knowledge more useful while staying ﬂuent and

meaningful. Specially, we set the reward to be the

effect of the generated knowledge on the prediction

made by a ﬁxed, generic QA model. We obtain sil-

ver knowledge and train RAINIER on the union of

multiple QA datasets (which are considered seen

during training), i.e.

Dseen =S∆seen

d=1 Dd

, where

Dd={(qj, Aj, a∗

j)}|Dd|

j=1

. The generic QA model

we use may or may not have been trained on these

seen datasets. The complete training process is

outlined in Algorithm 1.

Inference.

The effectiveness of RAINIER is eval-

uated against a set of unseen QA datasets,

Dunseen

in addition to the seen datasets. Note that RAINIER

is not trained on any unseen datasets, which means

we neither get silver knowledge, nor do imitation

learning or reinforcement learning on them. The

generic QA model we use was not trained on any

unseen datasets as well. We discuss details of infer-

ence in §2.3.

2.1 Training Stage I: Imitation Learning

In Stage I, we train RAINIER so that it generates

ﬂuent and meaningful natural language statements

that resemble knowledge. There is no large-scale

commonsense QA dataset labeled with high-quality

knowledge, but GPT-3 has been shown as a good

generator for relevant knowledge (Liu et al.,2022).

Therefore, we get silver knowledge from GPT-3

on our seen datasets. Following Liu et al. (2022),

we elicit question-related knowledge by prompting

GPT-3 with a task-speciﬁc set of few-shot demon-

strations (See §Cfor details on the prompts), and

decoding Mknowledge for each question:

K(q) = km:km∼pG(k|prompt(task(q)), q),

where

pG(·|·)

denotes GPT-3 with nucleus sam-

pling where p= 0.5(Holtzman et al.,2020). This

yields a silver dataset of question-knowledge pairs:

Dimit =n(q, k):(q, A, a∗)∈ Dseen, k ∈K(q)o,

(1)

We then train RAINIER, starting from a pre-

trained sequence-to-sequence language model, on

this silver dataset with standard supervised loss:

Ltrain(θ)∝X

(q,k)∈Dtrain

imit

−log pK(k|q;θ).(2)

The parameterization of the resulting model is de-

noted as θimit.

2.2 Training Stage II: Reinforcement

Learning

As we will see in the empirical results, the imitation

model obtained in Stage I does not provide the

most beneﬁcial knowledge. Therefore, in Stage

II, we continue optimizing RAINIER to generate

knowledge that best prompts the QA model, by

directly maximizing the reward given by this QA

model.

Knowledge generation as reinforcement learn-

ing.

Since knowledge statements (

’s) are dis-

crete and thus non-differentiable, we adopt a rein-

forcement learning approach, and consider knowl-

edge generation as a sequential decision making

process over the natural language vocabulary space.

We consider the generation of knowledge statement

with

tokens as an episode of length

. At step

t∈[1, T ]

, the state

st= (q, k<t)

is the combina-

tion of the question and the knowledge decoded up

to the

(t−1)

-th token; the action

at=kt

would

be the

-th token to decode. The RAINIER model,

pK(kt|q, k<t;θ)

, is the policy model that we op-

timize. We deﬁne a reward function

r(x, k)

that

characterizes the effect of the knowledge on the

QA model’s prediction, and discuss the deﬁnition

of this reward function in §2.2.1.

To ensure that the generated knowledge stay ﬂu-

ent and meaningful, we would like the learned pol-

icy model not to move too far from the initial im-

itation model. Therefore, we add to the reward

an (approximate) KL penalty between the learned

policy and the initial policy (Ouyang et al.,2022),

R(x, k) = r(x, k)−βlog pK(k|q;θ)

pK(k|q;θimit).

Since this reward is computed based on the full

knowledge statement, we assign it to the last step

of the episode. Non-terminal steps are assigned

zero rewards. Formally,

rT=R(x, k) (where T=|k|and kT=[EOS]);

rt= 0 (where 1≤t < T ).

We employ Proximal Policy Optimization

(PPO) (Schulman et al.,2017) as our reinforcement

learning algorithm, and adapt from the implemen-

tation of PPO in Ouyang et al. (2022). Aside from

We choose PPO because it has shown successful results

in other NLP tasks (Nakano et al.,2021;Stiennon et al.,

2020). Our earlier experiments with REINFORCE did not

show promising results.

the policy model, PPO additionally uses a value

model (parameterized by

) to estimate the value

function for states with incomplete decoded text,

i.e.

V(st;φ)

for any

. PPO minimizes a joint loss,

LPPO(θ, φ) = LPolicy(θ) + α· LValue(φ),(3)

where

LPolicy(θ)

is the loss on the policy model,

LValue(φ)

is the loss on the value model, and

is a

hyperparameter.

Policy loss.

To obtain the policy loss, we ﬁrst

compute the truncated estimated advantage func-

tion,

At=

T−1

t0=t

(γλ)t0−tδt0,

where δt0=rt0+γV (st0+1;φ)−V(st0;φ),

where the value functions

V(·)

are estimated by

the value model. PPO then maximizes the empir-

ical expectation of a so-called clipped surrogate

objective term,

cso(ˆ

At, νt(θ), ε) =

min νt(θ)ˆ

At,clip(νt(θ),1−ε, 1 + ε)ˆ

At,

where

νt(θ) = pK(kt|q;θ)

pK(kt|q;θold)

is the ratio between the

current policy

and a lagging policy

θold

. The lag-

ging policy is updated to the current policy under a

ﬁxed interval of

training steps, and is kept ﬁxed

otherwise. We adapt this to our use case, and deﬁne

the policy loss as

LPolicy(θ) = −ˆ

Ehcso(ˆ

At, νt(θ), ε)i

where the expectation is taken over all instances

in the training data (

x∼ Dtrain

seen

), the distribution

of model-generated knowledge as determined by

the current policy conditioning on the instance’s

question (

k∼pK(k|q;θ)

), and all tokens in the

knowledge statement (t∈[1,|k|]).

Value loss.

The value model is trained with MSE

loss with respect to the target value,

Vtarg

, which in

turn is estimated with a lagging value model φold:

LValue(φ) = ˆ

EhV(st;φ)−Vtarg

t2i,

where Vtarg

T−1

t0=t

γt0−trt0+γT−tV(sT;φold).

2.2.1 Reward Shaping

We deﬁne the reward function in reinforcement

learning as the quantiﬁed effect of RAINIER’s

knowledge on the QA model’s prediction. Sup-

pose we already have a reasonably good QA model,

which assigns a probability score

PQA(a|q)

to any

candidate answer

a∈A

. Since we will use a

sequence-to-sequence language model (i.e. Uni-

ﬁedQA (Khashabi et al.,2020)) as the QA model,

we deﬁne

PQA(a|q) = exp SQA(a|q)

Pa0∈Aexp SQA(a0|q),

where

SQA(a|q) = 1

|a|

i=1

−log pQA(ai|q, a<i;ψQA),

where

pQA(ai|q, a<i;ψQA)

is the language model-

ing score received by

, the

-th token of

. The

naive prediction would be the candidate answer

that gets the highest

PQA(a|q)

(or equivalently, the

highest SQA(a|q)): ˆa= arg maxa∈APQA(a|q).

We aim at maximizing

PQA(a∗|q◦k)

, the prob-

ability score received by the correct answer when

the QA model is prompted with the knowledge

generated by RAINIER, and

◦

denotes text concate-

nation. One naive deﬁnition of reward function

may be

r(x, k) = PQA(a∗|q◦k)−PQA(a∗|q).

However, this reward only captures the absolute

change of score, but not whether the model pre-

diction is changed or not. To remedy for this, we

deﬁne the reward function as

r(x, k) = 1

tanh SQA(a∗|q◦k)−max

a0∈A,

a06=a∗

SQA(a0|q◦k)

−tanh SQA(a∗|q)−max

a0∈A,

a06=a∗

SQA(a0|q)i.

Intuitively, this function would give a reward of

near

if the naive prediction is incorrect (i.e.

SQA(a∗|q)<maxa0∈A,a06=a∗SQA(a0|q)

), while

the knowledge-prompted prediction is correct (i.e.

SQA(a∗|q◦k)>maxa0∈A,a06=a∗SQA(a0|q◦k)

Similarly, the reward would be near

−1

if the naive

prediction is correct but the knowledge-prompted

prediction is incorrect. The hyperbolic tangent

serves as a smoothed sign function, and provides

a soft interpolation between the two polarity of re-

ward values by taking into account the margin of

the correct answer.

We also experiment with some alternative deﬁni-

tions of the reward function. See Table 4.

Reward normalization.

To stabilize training,

we apply an afﬁne transformation on the rewards

so that initially they are normalized. Before start-

ing Stage II training, we use the imitation model

to generate a knowledge statement for each train-

ing instance, and estimate the population mean and

standard deviation of rewards:

Rinit =r(x, k) : x∈ Dtrain

seen , k ∼pK(·|q;θimit),

µ0=µ(Rinit), σ0=σ(Rinit).(4)

In Stage II training, each reward is normalized as:

r(x, k)←r(x, k)−µ0

σ0

.(5)

2.3 Inference: Knowledge Prompting and

Aggregation

Following Liu et al. (2022), at inference time we

use RAINIER to generate multiple knowledge per

question, and prompt the QA model by individually

concatenating each knowledge to the question. The

knowledge are generated by RAINIER with nucleus

sampling where p= 0.5(Holtzman et al.,2020),

K(q) = {ε} ∪ km:km∼pp=0.5

K(k|q;θ),

m= 1 . . . M,

where

is the number of knowledge per question,

and

denotes empty string. We collect a set of out-

puts for prompting with each knowledge. The ﬁnal

prediction is the candidate answer that receives

maximum conﬁdence,

ˆa= arg max

a∈A

max

k∈K(q)PQA(a|q◦k),

and the prediction is supported by a single knowl-

edge – the selected knowledge,

k= arg max

k∈K(q)

max

a∈APQA(a|q◦k).

Training time model selection.

In Stage II train-

ing, we only generate one knowledge per question

for the validation set.

Predictions are made using

the same knowledge prompting method as above,

and the model checkpoint with the maximal accu-

racy on the union of all validation sets is selected.

This is for efﬁciency purposes. We use greedy decoding

here because it is more stable than nucleus sampling when

generating only one knowledge per question.

3 Experiments

Seen datasets.

For both imitation learning and

reinforcement learning, we use the 8 multiple-

choice datasets that

UniﬁedQAv2

(Khashabi et al.,

2022) uses for training: OpenBookQA (Mihaylov

et al.,2018), ARC (Clark et al.,2018), AI2Science

(Clark et al.,2018), CommonsenseQA (Talmor

et al.,2019), QASC (Khot et al.,2020), Physi-

calIQA (Bisk et al.,2020), SocialIQA (Sap et al.,

2019), and Winogrande (Sakaguchi et al.,2021).4

Unseen datasets.

We additionally evaluate our

method on the following 4 multiple-choice QA

datasets that our model was not trained on: Nu-

merSense (Lin et al.,2020), RiddleSense (Lin et al.,

2021), QuaRTz (Tafjord et al.,2019), and Hel-

laSwag (Zellers et al.,2019).

Models.

For Stage I training, we get silver knowl-

edge from the GPT-3-Curie (13B) model (Brown

et al.,2020). The knowledge introspector is ini-

tialized with T5-large (Raffel et al.,2019), which

has 0.77B parameters. For Stage II training, we

initialize the value model with T5-large, and re-

place the language modeling head with a value

regression head, which is initialized from scratch;

we use UniﬁedQA-large (UQA-large) (Khashabi

et al.,2020) as the QA model that provides reward,

which means the text concatenation function is de-

ﬁned as

q◦k={q} \n {k}

. We use the same

question formatting as UniﬁedQA. See Table 7 for

hyperparameters.

Baselines.

We mainly report performance im-

provements over the vanilla QA baseline (i.e. di-

rect inference with the UniﬁedQA-large model

and without prompting RAINIER-generated knowl-

edge). We also consider using knowledge from:

•

Few-shot GPT-3 (Liu et al.,2022), where

knowledge statements are elicited from the

GPT-3-Curie (13B) model – under the same

prompts used for getting silver knowledge

in Stage I training (§2.1), and the same hy-

perparameter setting for decoding (

M= 10

knowledge per question, with nucleus sam-

pling where p= 0.5).

•

Self-talk (Shwartz et al.,2020), where we gen-

erate

M= 10

knowledge per question with

GPT-3-Curie and a variety of templates.

We exclude MCTest and RACE because most questions

in these reading comprehension datasets are too long to ﬁt into

our model’s input.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RAINIER:ReinforcedKnowledgeIntrospectorforCommonsenseQuestionAnsweringJiachengLiu~SkylerHallinan~XimingLu~PengfeiHe~SeanWelleck~HannanehHajishirzi~YejinChoi~~PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingtonAllenInstituteforArticialIntelligenceliujc@cs.washington.eduAbstrac...

展开>> 收起<<

RAINIER Reinforced Knowledge Introspector for Commonsense Question Answering Jiacheng LiuSkyler HallinanXiming LuPengfei He.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RAINIER Reinforced Knowledge Introspector for Commonsense Question Answering Jiacheng LiuSkyler HallinanXiming LuPengfei He

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: