However, even with larger LMs, the closed-book
methods are not competitive with the open-book
methods in term of accuracy (Lewis et al.,2021).
While it has been shown that large pre-
trained LMs store an abundant amount of knowl-
edge (Petroni et al.,2019;Roberts et al.,2020), we
hypothesize the accuracy gaps are largely because
the way of exploiting the parameterized knowl-
edge are not sophisticated enough. Prior works on
CBQA either finetune pretrained LM models on the
entire QA datasets (Roberts et al.,2020;Ye et al.,
2020), or they directly prompt those models us-
ing several few-shot QA pairs (Brown et al.,2020;
Radford et al.,2019). On the contrary, open-book
models use a two-stage pipeline. They first retrieve
relevant contexts from external corpus, then they
extract the answer based on the retrieved contexts.
Therefore, to better exploit the parameterized
knowledge in pretrained LMs and bridge the large
accuracy gaps between the closed-book and open-
book methods, we propose a coarse-to-fine, two-
stage method for CBQA task. The main idea is
to leverage generated contexts as an intermediate
bridge between the huge amount of parameterized
knowledge stored in the LM and the answer that
lies within this knowledge. To the best of our
knowledge, no previous work has been conducted
on generating context from large pretrained LMs
for CBQA and leveraging them to predict answer.
Our proposed framework
CGAP
consists of two
stages. It first performs
C
ontext
G
eneration rele-
vant to a given question by prompting a pretrained
LM. It then prompts the same LM for
A
nswer
P
rediction using the generated context and the
question. In order to improve the accuracies and
to reduce context uncertainties, we generate mul-
tiple contexts for each question and predict the
final answer by majority voting. This step does
not increase the inference cost as we generate the
contexts in parallel by batching in a single infer-
ence call. Figure 1illustrates how our two stage
prompting and majority voting works. For the in-
put question, CGAP generates 3 contexts and 3
predicted answers at the two stages respectively,
and choose the most voted answer as the final
answer. Note that we do not finetune the large
pretrained LMs for context generation or answer
prediction. This facilitates our approach to take
advantage of large LMs such as GPT-3 (Brown
et al.,2020), PALM (Chowdhery et al.,2022) or
Megatron-Turing NLG 530B (Smith et al.,2022),
which are only available through APIs.
We conduct in-depth experimental studies on
three open-domain QA benchmarks, Natural Ques-
tions (Kwiatkowski et al.,2019), WebQues-
tions (Berant et al.,2013), and TriviaQA (Joshi
et al.,2017), and demonstrate significant improve-
ments by our two stage prompting method. Our
contributions are summarized as follows:
•
We propose a simple yet effective few-shot
prompting approach for ODQA that does not
rely on any external knowledge sources or
fine-tuning, but performs significantly better
than existing closed-book approaches (e.g. ex-
act matching 68.6% vs. 55.3%), and is on
par with open-book methods (e.g. 68.6% vs.
68.0%).
•
We show that the generated context can im-
prove standard few-shot prompting based
closed-book QA accuracy at various model
scales (e.g. from 11.7% to 28.5%), and
demonstrate that scaling up the context gen-
eration model further enlarges their accuracy
gaps (e.g. 357M 28.5% vs. 530B 68.6%).
To the best of our knowledge, we are the
first to leverage generated context from large
pretrained LMs for open-domain question an-
swering.
•
We show that generating multiple contexts
without increasing the inference cost by batch-
ing can mitigate errors in answer prediction
caused by variability in the unknown context
(e.g. from 36.3% to 45.7%).
2 Methodology
Our proposed
C
ontext
G
eration and
A
nswer
P
rediction (
CGAP
) framework is illustrated in Fig-
ure 2. CGAP consists of two stages. First, it gener-
ates relevant context to a given question by prompt-
ing a large pretrained LM. In the second stage, it
predicts an answer using the generated context and
the question by prompting the same LM. To ac-
curately predict the answer, we generate multiple
contexts. We run each of the two stages multiple
times in parallel in batch for the same question, gen-
erating different contexts for each, and use majority
voting to select the final answer.
Formally, for our task we have a question
Q
to be answered, and a support repository
D=
{(c1, q1, a1),...,(cn, qn, an)}
that consists of tu-
ples of question
qi
and answer
ai
pairs with map-
ping to the context
ci
. In our experiments, we use