RAINIER Reinforced Knowledge Introspector for Commonsense Question Answering Jiacheng LiuSkyler HallinanXiming LuPengfei He

2025-04-29 0 0 1.25MB 21 页 10玖币
侵权投诉
RAINIER: Reinforced Knowledge Introspector
for Commonsense Question Answering
Jiacheng LiuSkyler HallinanXiming LuPengfei He
Sean WelleckHannaneh HajishirziYejin Choi
Paul G. Allen School of Computer Science & Engineering, University of Washington
Allen Institute for Artificial Intelligence
liujc@cs.washington.edu
Abstract
Knowledge underpins reasoning. Recent re-
search demonstrates that when relevant knowl-
edge is provided as additional context to com-
monsense question answering (QA), it can sub-
stantially enhance the performance even on top
of state-of-the-art. The fundamental challenge
is where and how to find such knowledge that
is high quality and on point with respect to
the question; knowledge retrieved from knowl-
edge bases are incomplete and knowledge gen-
erated from language models are inconsistent.
We present RAINIER1, or Reinforced Knowl-
edge Introspector, that learns to generate con-
textually relevant knowledge in response to
given questions. Our approach starts by im-
itating knowledge generated by GPT-3, then
learns to generate its own knowledge via rein-
forcement learning where rewards are shaped
based on the increased performance on the re-
sulting question answering. RAINIER demon-
strates substantial and consistent performance
gains when tested over 9 different common-
sense benchmarks: including 5 datasets that
are seen during model training, as well as 4
datasets that are kept unseen. Our work is
the first to report that knowledge generated by
models that are orders of magnitude smaller
than GPT-3, even without direct supervision
on the knowledge itself, can exceed the qual-
ity of commonsense knowledge elicited from
GPT-3.
1 Introduction
Commonsense is a significant challenge for mod-
ern NLP models, due to the obscurity of underly-
ing knowledge that grounds the reasoning process.
While humans are generally able to introspect the
underlying reasons for their conclusion (Mercier
and Sperber,2017), neural models lack the capabil-
ity to verbalize the premises leading to their predic-
1Code, model and knowledge-extended datasets are avail-
able at http://github.com/liujch1998/rainier
Figure 1: RAINIER can introspect for commonsense
knowledge that underpin the reasoning process, and is
trained via reinforcement learning, where the reward
is derived from the effectiveness of knowledge when
prompting a frozen, generic QA model.
tion. This hinders models’ performance and robust-
ness on commonsense tasks, and makes it difficult
to inspect their point of failure. Recent research
has demonstrated that relevant knowledge can pro-
vide useful context for approaching commonsense
tasks. Yet these methods either retrieve from in-
domain knowledge bases (Mitra et al.,2019;Chang
et al.,2020) that do not have good coverage over
commonsense, or generate knowledge from neu-
ral models (Shwartz et al.,2020;Gu et al.,2022;
Liu et al.,2022), which often need domain-specific
engineering and very large models (e.g. GPT-3
(Brown et al.,2020)). It is still an open challenge
to systematically find high-quality knowledge.
In this work, we use a novel, reinforcement-
learning-based method to develop RAINIER, a gen-
erative neural model that can introspect the un-
derlying knowledge for reasoning about common-
sense questions. As illustrated in Fig 1, RAINIER is
trained to generate knowledge that are both fluent
natural language statements, and useful prompts
that optimize the performance of a generic ques-
tion answering (QA) model. Our model (1) demon-
strates strong generalization to unseen benchmarks
arXiv:2210.03078v2 [cs.CL] 22 Oct 2022
without additional engineering effort (e.g. fine-
tuning), (2) produces commonsense knowledge of
high quality and diversity, and (3) is substantially
smaller in size compared to GPT-3, the best knowl-
edge source reported so far.
To train RAINIER, we optimize knowledge intro-
spection for the resulting QA, instead of direct su-
pervision, because there are usually no gold knowl-
edge labels on commonsense datasets. In order
to ensure that our model learns to generate gener-
ically useful knowledge for a broad range of QA
models, we train only RAINIER, the knowledge in-
trospector, without finetuning the QA model. Since
our desired knowledge are sequences of discrete,
non-differentiable word tokens, we adapt a rein-
forcement learning algorithm, Proximal Policy Op-
timization (PPO) (Schulman et al.,2017;Ouyang
et al.,2022), to optimize the knowledge introspec-
tor. Specifically, the reward is defined as the ef-
fect of RAINIER-generated knowledge on the QA
model’s prediction. We train RAINIER in a multi-
task setting on 8 commonsense QA datasets – en-
compassing general, scientific, physical, and social
commonsense – to equip the model with better gen-
eralization to unseen benchmarks.
Experiments show that RAINIER substantially
improves the performance of QA models on 9 com-
monsense benchmarks (5 datasets seen during train-
ing and 4 unseen datasets), and gives larger and
more consistent gains than a few-shot GPT-3 (Liu
et al.,2022) despite being 16x smaller in parameter
size. It also boosts the performance on top of those
QA models that it is not trained against, indicat-
ing that it generates generically useful knowledge
instead of merely hacking into the reward given
by a single QA model. Knowledge generated by
RAINIER can even boost a QA model that is 4x
larger than it, showing the promise of using model-
generated knowledge as a complement to model
scaling in making progress in commonsense rea-
soning. Our analyses show that the knowledge
generated by RAINIER are of high quality, and are
diverse in terms of domain (e.g. scientific, social),
relation expressed (e.g. part of, member of, pur-
pose), and syntactic property (e.g. negation, com-
parison). The effect of these knowledge on the QA
model also aligns well with human judgments. The
success of RAINIER shows that moderately-sized
models can serve as source of high-quality and
useful commonsense knowledge that facilitates rea-
soning. We publicly release the code, the trained
Algorithm 1 Training RAINIER
Input
initial policy model
θ0
, initial value model
φ0
, pre-
trained QA model ψQA
Dimit Get silver knowledge on Dseen from GPT-3.
θimit Optimize θ0with Eqn 2from Dimit..Section 2.1
θRAINIER REINFORCEDLEARNING(Dseen,θimit,φ0,ψQA)
.Section 2.2
procedure REINFORCEDLEARNING(Dseen,θ,φ,ψQA)
θold θ,φold φ
for iterations = 1, 2, . . . do
Sample a minibatch from Dseen.
for step = 1, 2, . . . , sdo
Compute LPPO on the minibatch with Eqn 3.
Optimize θand φwith LPPO for one step.
θold θ,φold φ
return θ
Output θRAINIER
RAINIER model, and the commonsense datasets
extended with knowledge generated by RAINIER.
2 Method
Problem Overview.
We focus on the tasks of
multiple-choice commonsense QA, consisting of
instances of format
x= (q, A, a)
, where
q
is the
question,
A
is the set of candidate answers, and
aA
is the correct answer. For full contextu-
alization, we append candidate answers
A
to the
question
q
to form the input to the QA model as
follows:
q={question} (A) {choice_A} (B) {choice_B} ...
Common approaches only train supervised QA
models. As a complement, we train a separate
model, which we refer to as RAINIER, that can
introspect question-specific knowledges that are
useful to prompt a fixed QA model. RAINIER is a
sequence-to-sequence language model,
pK(k|q;θ)
,
and we expect it to generate knowledge statements
(
k
s) in response to the given question (
q
). How-
ever, the challenge is that we have no gold knowl-
edge labels as supervision.
Training.
Since we do not have gold knowledge
to train RAINIER, we obtain this model by fine-
tuning a pretrained language model in two stages:
(I) imitation learning, and then (II) reinforcement
learning. In Stage I (§2.1), we get silver knowledge
labels on some datasets from GPT-3, and teach our
model to imitate this knowledge-generating GPT-3.
This equips our model with the basic functionality
of knowledge generation. In Stage II (§2.2), we
use reinforcement learning to continue training the
model obtained in Stage I to make the generated
knowledge more useful while staying fluent and
meaningful. Specially, we set the reward to be the
effect of the generated knowledge on the prediction
made by a fixed, generic QA model. We obtain sil-
ver knowledge and train RAINIER on the union of
multiple QA datasets (which are considered seen
during training), i.e.
Dseen =Sseen
d=1 Dd
, where
Dd={(qj, Aj, a
j)}|Dd|
j=1
. The generic QA model
we use may or may not have been trained on these
seen datasets. The complete training process is
outlined in Algorithm 1.
Inference.
The effectiveness of RAINIER is eval-
uated against a set of unseen QA datasets,
Dunseen
,
in addition to the seen datasets. Note that RAINIER
is not trained on any unseen datasets, which means
we neither get silver knowledge, nor do imitation
learning or reinforcement learning on them. The
generic QA model we use was not trained on any
unseen datasets as well. We discuss details of infer-
ence in §2.3.
2.1 Training Stage I: Imitation Learning
In Stage I, we train RAINIER so that it generates
fluent and meaningful natural language statements
that resemble knowledge. There is no large-scale
commonsense QA dataset labeled with high-quality
knowledge, but GPT-3 has been shown as a good
generator for relevant knowledge (Liu et al.,2022).
Therefore, we get silver knowledge from GPT-3
on our seen datasets. Following Liu et al. (2022),
we elicit question-related knowledge by prompting
GPT-3 with a task-specific set of few-shot demon-
strations (See §Cfor details on the prompts), and
decoding Mknowledge for each question:
K(q) = km:kmpG(k|prompt(task(q)), q),
where
pG(·|·)
denotes GPT-3 with nucleus sam-
pling where p= 0.5(Holtzman et al.,2020). This
yields a silver dataset of question-knowledge pairs:
Dimit =n(q, k):(q, A, a)∈ Dseen, k K(q)o,
(1)
We then train RAINIER, starting from a pre-
trained sequence-to-sequence language model, on
this silver dataset with standard supervised loss:
Ltrain(θ)X
(q,k)∈Dtrain
imit
log pK(k|q;θ).(2)
The parameterization of the resulting model is de-
noted as θimit.
2.2 Training Stage II: Reinforcement
Learning
As we will see in the empirical results, the imitation
model obtained in Stage I does not provide the
most beneficial knowledge. Therefore, in Stage
II, we continue optimizing RAINIER to generate
knowledge that best prompts the QA model, by
directly maximizing the reward given by this QA
model.
Knowledge generation as reinforcement learn-
ing.
Since knowledge statements (
k
s) are dis-
crete and thus non-differentiable, we adopt a rein-
forcement learning approach, and consider knowl-
edge generation as a sequential decision making
process over the natural language vocabulary space.
We consider the generation of knowledge statement
k
with
T
tokens as an episode of length
T
. At step
t[1, T ]
, the state
st= (q, k<t)
is the combina-
tion of the question and the knowledge decoded up
to the
(t1)
-th token; the action
at=kt
would
be the
t
-th token to decode. The RAINIER model,
pK(kt|q, k<t;θ)
, is the policy model that we op-
timize. We define a reward function
r(x, k)
that
characterizes the effect of the knowledge on the
QA model’s prediction, and discuss the definition
of this reward function in §2.2.1.
To ensure that the generated knowledge stay flu-
ent and meaningful, we would like the learned pol-
icy model not to move too far from the initial im-
itation model. Therefore, we add to the reward
an (approximate) KL penalty between the learned
policy and the initial policy (Ouyang et al.,2022),
R(x, k) = r(x, k)βlog pK(k|q;θ)
pK(k|q;θimit).
Since this reward is computed based on the full
knowledge statement, we assign it to the last step
of the episode. Non-terminal steps are assigned
zero rewards. Formally,
rT=R(x, k) (where T=|k|and kT=[EOS]);
rt= 0 (where 1t < T ).
We employ Proximal Policy Optimization
2
(PPO) (Schulman et al.,2017) as our reinforcement
learning algorithm, and adapt from the implemen-
tation of PPO in Ouyang et al. (2022). Aside from
2
We choose PPO because it has shown successful results
in other NLP tasks (Nakano et al.,2021;Stiennon et al.,
2020). Our earlier experiments with REINFORCE did not
show promising results.
the policy model, PPO additionally uses a value
model (parameterized by
φ
) to estimate the value
function for states with incomplete decoded text,
i.e.
V(st;φ)
for any
t
. PPO minimizes a joint loss,
LPPO(θ, φ) = LPolicy(θ) + α· LValue(φ),(3)
where
LPolicy(θ)
is the loss on the policy model,
LValue(φ)
is the loss on the value model, and
α
is a
hyperparameter.
Policy loss.
To obtain the policy loss, we first
compute the truncated estimated advantage func-
tion,
ˆ
At=
T1
X
t0=t
(γλ)t0tδt0,
where δt0=rt0+γV (st0+1;φ)V(st0;φ),
where the value functions
V(·)
are estimated by
the value model. PPO then maximizes the empir-
ical expectation of a so-called clipped surrogate
objective term,
cso(ˆ
At, νt(θ), ε) =
min νt(θ)ˆ
At,clip(νt(θ),1ε, 1 + ε)ˆ
At,
where
νt(θ) = pK(kt|q;θ)
pK(kt|q;θold)
is the ratio between the
current policy
θ
and a lagging policy
θold
. The lag-
ging policy is updated to the current policy under a
fixed interval of
s
training steps, and is kept fixed
otherwise. We adapt this to our use case, and define
the policy loss as
LPolicy(θ) = ˆ
Ehcso(ˆ
At, νt(θ), ε)i
where the expectation is taken over all instances
in the training data (
x∼ Dtrain
seen
), the distribution
of model-generated knowledge as determined by
the current policy conditioning on the instance’s
question (
kpK(k|q;θ)
), and all tokens in the
knowledge statement (t[1,|k|]).
Value loss.
The value model is trained with MSE
loss with respect to the target value,
Vtarg
t
, which in
turn is estimated with a lagging value model φold:
LValue(φ) = ˆ
EhV(st;φ)Vtarg
t2i,
where Vtarg
t=
T1
X
t0=t
γt0trt0+γTtV(sT;φold).
2.2.1 Reward Shaping
We define the reward function in reinforcement
learning as the quantified effect of RAINIERs
knowledge on the QA model’s prediction. Sup-
pose we already have a reasonably good QA model,
which assigns a probability score
PQA(a|q)
to any
candidate answer
aA
. Since we will use a
sequence-to-sequence language model (i.e. Uni-
fiedQA (Khashabi et al.,2020)) as the QA model,
we define
PQA(a|q) = exp SQA(a|q)
Pa0Aexp SQA(a0|q),
where
SQA(a|q) = 1
|a|
|a|
X
i=1
log pQA(ai|q, a<i;ψQA),
where
pQA(ai|q, a<i;ψQA)
is the language model-
ing score received by
ai
, the
i
-th token of
a
. The
naive prediction would be the candidate answer
that gets the highest
PQA(a|q)
(or equivalently, the
highest SQA(a|q)): ˆa= arg maxaAPQA(a|q).
We aim at maximizing
PQA(a|qk)
, the prob-
ability score received by the correct answer when
the QA model is prompted with the knowledge
k
generated by RAINIER, and
denotes text concate-
nation. One naive definition of reward function
may be
r(x, k) = PQA(a|qk)PQA(a|q).
However, this reward only captures the absolute
change of score, but not whether the model pre-
diction is changed or not. To remedy for this, we
define the reward function as
r(x, k) = 1
2h
tanh SQA(a|qk)max
a0A,
a06=a
SQA(a0|qk)
tanh SQA(a|q)max
a0A,
a06=a
SQA(a0|q)i.
Intuitively, this function would give a reward of
near
+1
if the naive prediction is incorrect (i.e.
SQA(a|q)<maxa0A,a06=aSQA(a0|q)
), while
the knowledge-prompted prediction is correct (i.e.
SQA(a|qk)>maxa0A,a06=aSQA(a0|qk)
).
Similarly, the reward would be near
1
if the naive
prediction is correct but the knowledge-prompted
prediction is incorrect. The hyperbolic tangent
serves as a smoothed sign function, and provides
a soft interpolation between the two polarity of re-
ward values by taking into account the margin of
the correct answer.
We also experiment with some alternative defini-
tions of the reward function. See Table 4.
Reward normalization.
To stabilize training,
we apply an affine transformation on the rewards
so that initially they are normalized. Before start-
ing Stage II training, we use the imitation model
to generate a knowledge statement for each train-
ing instance, and estimate the population mean and
standard deviation of rewards:
Rinit =r(x, k) : x∈ Dtrain
seen , k pK(·|q;θimit),
µ0=µ(Rinit), σ0=σ(Rinit).(4)
In Stage II training, each reward is normalized as:
r(x, k)r(x, k)µ0
σ0
.(5)
2.3 Inference: Knowledge Prompting and
Aggregation
Following Liu et al. (2022), at inference time we
use RAINIER to generate multiple knowledge per
question, and prompt the QA model by individually
concatenating each knowledge to the question. The
knowledge are generated by RAINIER with nucleus
sampling where p= 0.5(Holtzman et al.,2020),
K(q) = {ε} ∪ km:kmpp=0.5
K(k|q;θ),
m= 1 . . . M,
where
M
is the number of knowledge per question,
and
ε
denotes empty string. We collect a set of out-
puts for prompting with each knowledge. The final
prediction is the candidate answer that receives
maximum confidence,
ˆa= arg max
aA
max
kK(q)PQA(a|qk),
and the prediction is supported by a single knowl-
edge – the selected knowledge,
ˆ
k= arg max
kK(q)
max
aAPQA(a|qk).
Training time model selection.
In Stage II train-
ing, we only generate one knowledge per question
for the validation set.
3
Predictions are made using
the same knowledge prompting method as above,
and the model checkpoint with the maximal accu-
racy on the union of all validation sets is selected.
3
This is for efficiency purposes. We use greedy decoding
here because it is more stable than nucleus sampling when
generating only one knowledge per question.
3 Experiments
Seen datasets.
For both imitation learning and
reinforcement learning, we use the 8 multiple-
choice datasets that
UnifiedQAv2
(Khashabi et al.,
2022) uses for training: OpenBookQA (Mihaylov
et al.,2018), ARC (Clark et al.,2018), AI2Science
(Clark et al.,2018), CommonsenseQA (Talmor
et al.,2019), QASC (Khot et al.,2020), Physi-
calIQA (Bisk et al.,2020), SocialIQA (Sap et al.,
2019), and Winogrande (Sakaguchi et al.,2021).4
Unseen datasets.
We additionally evaluate our
method on the following 4 multiple-choice QA
datasets that our model was not trained on: Nu-
merSense (Lin et al.,2020), RiddleSense (Lin et al.,
2021), QuaRTz (Tafjord et al.,2019), and Hel-
laSwag (Zellers et al.,2019).
Models.
For Stage I training, we get silver knowl-
edge from the GPT-3-Curie (13B) model (Brown
et al.,2020). The knowledge introspector is ini-
tialized with T5-large (Raffel et al.,2019), which
has 0.77B parameters. For Stage II training, we
initialize the value model with T5-large, and re-
place the language modeling head with a value
regression head, which is initialized from scratch;
we use UnifiedQA-large (UQA-large) (Khashabi
et al.,2020) as the QA model that provides reward,
which means the text concatenation function is de-
fined as
qk={q} \n {k}
. We use the same
question formatting as UnifiedQA. See Table 7 for
hyperparameters.
Baselines.
We mainly report performance im-
provements over the vanilla QA baseline (i.e. di-
rect inference with the UnifiedQA-large model
and without prompting RAINIER-generated knowl-
edge). We also consider using knowledge from:
Few-shot GPT-3 (Liu et al.,2022), where
knowledge statements are elicited from the
GPT-3-Curie (13B) model – under the same
prompts used for getting silver knowledge
in Stage I training (§2.1), and the same hy-
perparameter setting for decoding (
M= 10
knowledge per question, with nucleus sam-
pling where p= 0.5).
Self-talk (Shwartz et al.,2020), where we gen-
erate
M= 10
knowledge per question with
GPT-3-Curie and a variety of templates.
4
We exclude MCTest and RACE because most questions
in these reading comprehension datasets are too long to fit into
our model’s input.
摘要:

RAINIER:ReinforcedKnowledgeIntrospectorforCommonsenseQuestionAnsweringJiachengLiu~SkylerHallinan~XimingLu~PengfeiHe~SeanWelleck~HannanehHajishirzi~YejinChoi~~PaulG.AllenSchoolofComputerScience&Engineering,UniversityofWashingtonAllenInstituteforArticialIntelligenceliujc@cs.washington.eduAbstrac...

展开>> 收起<<
RAINIER Reinforced Knowledge Introspector for Commonsense Question Answering Jiacheng LiuSkyler HallinanXiming LuPengfei He.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.25MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注