
without additional engineering effort (e.g. fine-
tuning), (2) produces commonsense knowledge of
high quality and diversity, and (3) is substantially
smaller in size compared to GPT-3, the best knowl-
edge source reported so far.
To train RAINIER, we optimize knowledge intro-
spection for the resulting QA, instead of direct su-
pervision, because there are usually no gold knowl-
edge labels on commonsense datasets. In order
to ensure that our model learns to generate gener-
ically useful knowledge for a broad range of QA
models, we train only RAINIER, the knowledge in-
trospector, without finetuning the QA model. Since
our desired knowledge are sequences of discrete,
non-differentiable word tokens, we adapt a rein-
forcement learning algorithm, Proximal Policy Op-
timization (PPO) (Schulman et al.,2017;Ouyang
et al.,2022), to optimize the knowledge introspec-
tor. Specifically, the reward is defined as the ef-
fect of RAINIER-generated knowledge on the QA
model’s prediction. We train RAINIER in a multi-
task setting on 8 commonsense QA datasets – en-
compassing general, scientific, physical, and social
commonsense – to equip the model with better gen-
eralization to unseen benchmarks.
Experiments show that RAINIER substantially
improves the performance of QA models on 9 com-
monsense benchmarks (5 datasets seen during train-
ing and 4 unseen datasets), and gives larger and
more consistent gains than a few-shot GPT-3 (Liu
et al.,2022) despite being 16x smaller in parameter
size. It also boosts the performance on top of those
QA models that it is not trained against, indicat-
ing that it generates generically useful knowledge
instead of merely hacking into the reward given
by a single QA model. Knowledge generated by
RAINIER can even boost a QA model that is 4x
larger than it, showing the promise of using model-
generated knowledge as a complement to model
scaling in making progress in commonsense rea-
soning. Our analyses show that the knowledge
generated by RAINIER are of high quality, and are
diverse in terms of domain (e.g. scientific, social),
relation expressed (e.g. part of, member of, pur-
pose), and syntactic property (e.g. negation, com-
parison). The effect of these knowledge on the QA
model also aligns well with human judgments. The
success of RAINIER shows that moderately-sized
models can serve as source of high-quality and
useful commonsense knowledge that facilitates rea-
soning. We publicly release the code, the trained
Algorithm 1 Training RAINIER
Input
initial policy model
θ0
, initial value model
φ0
, pre-
trained QA model ψQA
Dimit ←Get silver knowledge on Dseen from GPT-3.
θimit ←Optimize θ0with Eqn 2from Dimit..Section 2.1
θRAINIER ←REINFORCEDLEARNING(Dseen,θimit,φ0,ψQA)
.Section 2.2
procedure REINFORCEDLEARNING(Dseen,θ,φ,ψQA)
θold ←θ,φold ←φ
for iterations = 1, 2, . . . do
Sample a minibatch from Dseen.
for step = 1, 2, . . . , sdo
Compute LPPO on the minibatch with Eqn 3.
Optimize θand φwith LPPO for one step.
θold ←θ,φold ←φ
return θ
Output θRAINIER
RAINIER model, and the commonsense datasets
extended with knowledge generated by RAINIER.
2 Method
Problem Overview.
We focus on the tasks of
multiple-choice commonsense QA, consisting of
instances of format
x= (q, A, a∗)
, where
q
is the
question,
A
is the set of candidate answers, and
a∗∈A
is the correct answer. For full contextu-
alization, we append candidate answers
A
to the
question
q
to form the input to the QA model as
follows:
q={question} (A) {choice_A} (B) {choice_B} ...
Common approaches only train supervised QA
models. As a complement, we train a separate
model, which we refer to as RAINIER, that can
introspect question-specific knowledges that are
useful to prompt a fixed QA model. RAINIER is a
sequence-to-sequence language model,
pK(k|q;θ)
,
and we expect it to generate knowledge statements
(
k
’s) in response to the given question (
q
). How-
ever, the challenge is that we have no gold knowl-
edge labels as supervision.
Training.
Since we do not have gold knowledge
to train RAINIER, we obtain this model by fine-
tuning a pretrained language model in two stages:
(I) imitation learning, and then (II) reinforcement
learning. In Stage I (§2.1), we get silver knowledge
labels on some datasets from GPT-3, and teach our
model to imitate this knowledge-generating GPT-3.
This equips our model with the basic functionality
of knowledge generation. In Stage II (§2.2), we
use reinforcement learning to continue training the