Context Generation Improves Open Domain Question Answering Dan Suz Mostofa Patwaryx Shrimai Prabhumoyex Peng Xux Ryan Prengerx Mohammad Shoeybix Pascale Fungz Anima Anandkumarx Bryan Catanzarox

2025-04-27 0 0 568.98KB 16 页 10玖币
侵权投诉
Context Generation Improves Open Domain Question Answering
Dan Su
, Mostofa Patwary§, Shrimai Prabhumoye§, Peng Xu§, Ryan Prenger§,
Mohammad Shoeybi§, Pascale Fung, Anima Anandkumar§, Bryan Catanzaro§
The Hong Kong University of Science and Technology, §NVIDIA
dsu@connect.ust.hk,mpatwary@nvidia.com
Abstract
Closed-book question answering (QA) re-
quires a model to directly answer an open-
domain question without access to any exter-
nal knowledge. Prior work on closed-book
QA either directly finetunes or prompts a pre-
trained language model (LM) to leverage the
stored knowledge. However, they do not fully
exploit the parameterized knowledge. To ad-
dress this inefficiency, we propose a two-stage,
closed-book QA framework which employs
acoarse-to-fine approach to extract the rele-
vant knowledge and answer a question. We
first generate a related context for a given
question by prompting a pretrained LM. We
then prompt the same LM to generate an an-
swer using the generated context and the ques-
tion. Additionally, we marginalize over the
generated contexts to improve the accuracies
and reduce context uncertainty. Experimen-
tal results on three QA benchmarks show that
our method significantly outperforms previous
closed-book QA methods. For example on
TriviaQA, our method improves exact match
accuracy from 55.3% to 68.6%, and is on
par with open-book QA methods (68.6% vs.
68.0%). Our results show that our new method-
ology is able to better exploit the stored knowl-
edge in pretrained LMs without adding ex-
tra learnable parameters or needing finetuning,
and paves the way for hybrid models that inte-
grate pretrained LMs with external knowledge.
1 Introduction
Open-domain question answering (ODQA) pro-
duces an answer to a given question in the form
of natural language, and the task has been exten-
sively studied in recent years. Significant progress
on ODQA has been made by developing the open-
book QA methods (Chen et al.,2017;Lewis et al.,
2020b;Guu et al.,2020;Izacard and Grave,2021;
This work was done when the first author was an in-
tern at NVIDIA. Corresponding authors: Dan Su, Mostofa
Patwary.
Generated Conext 1: Richard Marx (songwriter) Richard
Marx (born September 16, 1963) is an American adult
contemporary and pop/rock singer, songwriter, musician, and
record producer. … Three of Marx's singles have reached
number one on the U.S. Adult Contemporary chart: "Hold on
to the Nights", "Satisfied", and "Right Here Waiting". …
Generated Context 3: Kim Wilde Kim Wilde Kim Smith
(born Kim Smith on 18 November 1960), better known by
her stage name Kim Wilde, is an English pop singer, author,
DJ and television personality. She first saw success in 1981
with her debut single, "Kids in America". It reached number
two in the UK and was followed by a string of top ten hits. …
Generated Conext 2: Richard Marx Richard Allan Marx (born
September 16, 1963) is … Marx first came into the spotlight
with the release of his debut single, "Don't Mean Nothing",
from his eponymous debut album. The album went to No. 8
and spawned four Top 20 hit singles, including "Hold On to
the Nights", which reached No. 1 on the Billboard Hot 100.
Predicted
Answer:
Richard
Marx
Predicted
Answer:
Richard
Marx
Predicted
Answer:
Kim
Wilde
Question: Who had an 80s No 1 hit with Hold On To The Nights?
Predicted Answer
(standard prompting):
Ross Bagdasarian (𐄂)
Predicted Answer
(CGAP):
Richard Marx ()
Figure 1: An example illustrating our two-stage,
CGAP framework. CGAP generates more accurate an-
swer (e.g. Richard Marx) compared to standard few-
shot prompting (e.g. Ross Bagdasarian).
Lazaridou et al.,2022) that explicitly exploit ex-
ternal knowledge corpus via dense retrieval tech-
niques like DPR (Karpukhin et al.,2020). However,
learning a good retriever requires substantial re-
sources, such as a large number of domain-specific
pairs of question and contexts in the knowledge
corpus (Karpukhin et al.,2020), or intensive com-
pute resources (Lee et al.,2019). In addition, as the
size of the knowledge corpus increases, it becomes
harder to retrieve accurate contexts due to the high
dimensionality of the search space (Reimers and
Gurevych,2021).
Another class of models, known as closed-book
question answering (CBQA), were recently pro-
posed (Roberts et al.,2020). CBQA tries to di-
rectly answer the open-domain questions without
accessing any external knowledge sources, and in-
stead leverages the parametric knowledge stored
in the pretrained language models (LMs) (Raffel
et al.,2020;Brown et al.,2020;Ye et al.,2020).
arXiv:2210.06349v2 [cs.CL] 27 Apr 2023
However, even with larger LMs, the closed-book
methods are not competitive with the open-book
methods in term of accuracy (Lewis et al.,2021).
While it has been shown that large pre-
trained LMs store an abundant amount of knowl-
edge (Petroni et al.,2019;Roberts et al.,2020), we
hypothesize the accuracy gaps are largely because
the way of exploiting the parameterized knowl-
edge are not sophisticated enough. Prior works on
CBQA either finetune pretrained LM models on the
entire QA datasets (Roberts et al.,2020;Ye et al.,
2020), or they directly prompt those models us-
ing several few-shot QA pairs (Brown et al.,2020;
Radford et al.,2019). On the contrary, open-book
models use a two-stage pipeline. They first retrieve
relevant contexts from external corpus, then they
extract the answer based on the retrieved contexts.
Therefore, to better exploit the parameterized
knowledge in pretrained LMs and bridge the large
accuracy gaps between the closed-book and open-
book methods, we propose a coarse-to-fine, two-
stage method for CBQA task. The main idea is
to leverage generated contexts as an intermediate
bridge between the huge amount of parameterized
knowledge stored in the LM and the answer that
lies within this knowledge. To the best of our
knowledge, no previous work has been conducted
on generating context from large pretrained LMs
for CBQA and leveraging them to predict answer.
Our proposed framework
CGAP
consists of two
stages. It first performs
C
ontext
G
eneration rele-
vant to a given question by prompting a pretrained
LM. It then prompts the same LM for
A
nswer
P
rediction using the generated context and the
question. In order to improve the accuracies and
to reduce context uncertainties, we generate mul-
tiple contexts for each question and predict the
final answer by majority voting. This step does
not increase the inference cost as we generate the
contexts in parallel by batching in a single infer-
ence call. Figure 1illustrates how our two stage
prompting and majority voting works. For the in-
put question, CGAP generates 3 contexts and 3
predicted answers at the two stages respectively,
and choose the most voted answer as the final
answer. Note that we do not finetune the large
pretrained LMs for context generation or answer
prediction. This facilitates our approach to take
advantage of large LMs such as GPT-3 (Brown
et al.,2020), PALM (Chowdhery et al.,2022) or
Megatron-Turing NLG 530B (Smith et al.,2022),
which are only available through APIs.
We conduct in-depth experimental studies on
three open-domain QA benchmarks, Natural Ques-
tions (Kwiatkowski et al.,2019), WebQues-
tions (Berant et al.,2013), and TriviaQA (Joshi
et al.,2017), and demonstrate significant improve-
ments by our two stage prompting method. Our
contributions are summarized as follows:
We propose a simple yet effective few-shot
prompting approach for ODQA that does not
rely on any external knowledge sources or
fine-tuning, but performs significantly better
than existing closed-book approaches (e.g. ex-
act matching 68.6% vs. 55.3%), and is on
par with open-book methods (e.g. 68.6% vs.
68.0%).
We show that the generated context can im-
prove standard few-shot prompting based
closed-book QA accuracy at various model
scales (e.g. from 11.7% to 28.5%), and
demonstrate that scaling up the context gen-
eration model further enlarges their accuracy
gaps (e.g. 357M 28.5% vs. 530B 68.6%).
To the best of our knowledge, we are the
first to leverage generated context from large
pretrained LMs for open-domain question an-
swering.
We show that generating multiple contexts
without increasing the inference cost by batch-
ing can mitigate errors in answer prediction
caused by variability in the unknown context
(e.g. from 36.3% to 45.7%).
2 Methodology
Our proposed
C
ontext
G
eration and
A
nswer
P
rediction (
CGAP
) framework is illustrated in Fig-
ure 2. CGAP consists of two stages. First, it gener-
ates relevant context to a given question by prompt-
ing a large pretrained LM. In the second stage, it
predicts an answer using the generated context and
the question by prompting the same LM. To ac-
curately predict the answer, we generate multiple
contexts. We run each of the two stages multiple
times in parallel in batch for the same question, gen-
erating different contexts for each, and use majority
voting to select the final answer.
Formally, for our task we have a question
Q
to be answered, and a support repository
D=
{(c1, q1, a1),...,(cn, qn, an)}
that consists of tu-
ples of question
qi
and answer
ai
pairs with map-
ping to the context
ci
. In our experiments, we use
the training sets of the corresponding datasets as
D.
2.1 Context Generation
As shown in Figure 2, in the first stage, given ques-
tion
Q
, we select the
m
context generation prompts
S={(q1, c1),...,(qm, cm)}
from the support
repository
D
. We then use
S
with
Q
to prompt
pretrained LM to generate
k
contexts, which are
denoted by Cgen ={c1
gen, c2
gen, . . . , ck
gen}.
Sample Selection
Selecting appropriate samples
for the prompts is the key to generate high-quality
context relevant to a given question. Previous work
has shown that leveraging relevant samples helps
the LM to generate contextually relevant and fac-
tually correct context (Liu et al.,2021,2022). We
therefore use a similarity-based retriever to search
relevant samples
S
from the corresponding sup-
porting repository,
D
. We use DPR (Karpukhin
et al.,2020) in our framework. In our DPR setup,
we represent the question and the samples in
D
as
768
-dimensional dense vector representations,
computed via the BERT-based bi-encoder networks.
We rank the documents according to their similarity
score, calculated as:
Score(Q, (qj, cj)) = BERT(Q)T·BERT(qj;cj)
(1)
where
;
denotes concatenation of the tokens of
the question
qj
and the context
cj
. Finally, we get
S={(q1, c1),...,(qm, cm)}
which are the top-m
retrieved samples for question Q.
We would like to emphasize that the selected
samples from
D
are used as examples in the few-
shot prompting to the pretrained LM to generate
context, not as the source of external knowledge
containing the answer.
Prompts Construction
Given the question
Q
and the set of question-context pair samples
S
se-
lected, we use few-shot prompting to condition
pretrained LMs on the samples. We use similar
few-shot prompting technique for closed-book QA
as in (Brown et al.,2020), that considers multiple
<question, answer> pairs. The template we used to
construct prompts is:
Q:
...
A:
.... Thus the con-
structed prompt
P rompt(Q)
for a given question
Qbecomes:
P rompt(Q) =Q:qm\nA:cm\n . . .
Q:q1\nA:c1\nQ:Q\n
We use ’
\n
’ to separate the question, context
and the samples. We investigated the order of sam-
ples to optimize the prompt and find that using the
retrieved samples in reversed order of similarity
yields better accuracies across all datasets
1
. We
now pass
P rompt(Q)
through a pretrained LM to
generate the context as follows:
cgen =LM(P rompt(Q))
To generate a set of
k
contexts,
{c1
gen, ..., ck
gen}
,
we increase the inference batch size to
k
and gen-
erate all the kcontexts in parallel in one inference
call to the LM. Thus, the overall latency remains
the same as using a single context.
2.2 Answer Prediction
In the second stage, we select
m
answer prediction
prompts
S0={(q1, a1, c1),...,(qm, am, cm)}
from
D
and then we prompt the same LM using the
generated context
Cgen
from the first stage, along
with the question
Q
and
S0
. The LM predicts a
set of
k
answers
Ap={a1
p, a2
p, ..., ak
p}
each cor-
responding to the
k
contexts in
Cgen
. The final
answer Ais selected by majority voting on Ap.
Sample Selection
Constrained by the maximum
sequence length of the LM, we can feed the LM
only a few
(c, q, a)
samples. Thus, it could be
difficult for the LM to learn how to predict the
answer for the given question conditioned on the
context, unless similar examples have been pro-
vided. For example, if we were asking the question
’who is the current director of the us mint?’, the
example that answering the question ’who is the fbi
director of the united states?’ from the provided
context will be more helpful, than the example
that is answering ’how many episodes are there
in ‘Dragon Ball Z’?’ from the given context. We
therefore use the same criteria for answer predic-
tion as has been used for context generation. We
use the same set of samples as selected in the first
stage as described in Equation 1and denote as
S0={(q1, c1, a1),...,(qm, cm, am)}.
Prompt Construction
We are prompting LMs
with few-shot examples to predict answer for the
question conditioned on the generated context. To
equip the LM with this capability, we constructed
intuitive prompts for the selected examples and
feed them into the LM. Specifically, the template
1
We show an concrete example of
P rompt(Q)
in Ap-
pendix Table 12
Pretrained LM Pretrained LM
Generated Context
Generated Context
Generated Context
Question
Q
Context
Generation
Prompts
...
Answer
Answer
Answer
Answer
Prediction
Prompts
Answer
Context Generation Answer Prediction
...
Figure 2: Overview architecture of our CGAP framework. It first does Context Generation by prompting large
pretrained LMs, then it further prompts the LMs for Answer Prediction by feeding the generated context to the
LM models alongside the question. kcontexts are generated and the final answer Ais chosen by majority voting.
(If computation capability allows, it could prompt multiple (k) LMs in parallel at both two stages to speed up.)
we used to construct answer prediction prompts is:
C:
...
Q:
...
A:
... . Thus, the constructed prompt
for a given question Q and the
i
-th generated con-
text ci
gen is:
P rompt(ci
gen, Q) =C:cm\nQ:qm\nA:am\n
. . .
C:c1\nQ:q1\nA:a1\n
C:ci
gen\nQ:Q\n
(2)
We then feed
P rompt(ci
gen, Q)
into the pre-
trained LM to predict the answer:
ai
p=LM(P rompt(ci
gen, Q))) (3)
where we use
ai
p
to denote the
i
-th answer predicted
by the LM. The
k
generated contexts in
cgen
will
yield a set of answers Ap={a1
p, ..., ak
p}.
2.3 Context Marginalization
The large pretrained LM can generate impressively
fluent and relevant context given input, it also has a
tendency to generate factually incorrect statements,
ranging from subtle inaccuracies to wild halluci-
nations (Shuster et al.,2021;Krishna et al.,2021;
Su et al.,2022). Answers conditioned solely on
hallucinated or erroneous statements are likely to
be incorrect (Equation 3). Thus, we would like
to remove the variability in the answer due to any
particular generated context.
Ideally, we could marginalize over this unknown
context by producing an answer for every possible
context, weighting each answer by the probabil-
ity of the context. Here we approximate this by
generating a set of contexts, and selecting the final
answer based on majority voting. Suppose there
are
T
unique answers
{A1
p, ..., AT
p}
from the
k
pre-
dicted answer from Equation 3where
T <=k
,
then we select the
J
-th answer that receives the
highest number of votes from the
T
different an-
swers via:
J= argmax
j∈{1,2,...,T }
k
X
i=1
( (ai
p=Aj
p)) (4)
as the final answer
A
. As
k
gets larger, the final
answer
A
will converge to the answer that would
be produced marginalizing over all possible con-
texts. We refer to this majority vote over multiple
generated contexts as context marginalization.
3 Experimental Setup
3.1 Datasets
We evaluated our experiments on three open-
domain QA benchmark datasets: Natural Ques-
tions (NQ) (Kwiatkowski et al.,2019), TriviaQA
(TQA) (Joshi et al.,2017), and WebQuestions
(WQ) (Berant et al.,2013), using the same data
splits for train, validation and test as in Lee et al.
(2019); Izacard and Grave (2021).
NQ contains questions from Google search
queries; TQA contains a collection of questions
from trivia and quiz-league websites, and we use
their unfiltered set; while questions of WQ were
from Google Suggest API. For NQ and TQA, we
use the processed data provided by Izacard and
Grave (2021), in which each question-answer pair
is accompanied by a 100-words Wikipedia passage
摘要:

ContextGenerationImprovesOpenDomainQuestionAnsweringDanSuz,MostofaPatwaryx,ShrimaiPrabhumoyex,PengXux,RyanPrengerx,MohammadShoeybix,PascaleFungz,AnimaAnandkumarx,BryanCatanzaroxzTheHongKongUniversityofScienceandTechnology,xNVIDIAdsu@connect.ust.hk,mpatwary@nvidia.comAbstractClosed-bookquestionanswe...

展开>> 收起<<
Context Generation Improves Open Domain Question Answering Dan Suz Mostofa Patwaryx Shrimai Prabhumoyex Peng Xux Ryan Prengerx Mohammad Shoeybix Pascale Fungz Anima Anandkumarx Bryan Catanzarox.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:568.98KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注