Context Generation Improves Open Domain Question Answering Dan Suz Mostofa Patwaryx Shrimai Prabhumoyex Peng Xux Ryan Prengerx Mohammad Shoeybix Pascale Fungz Anima Anandkumarx Bryan Catanzarox

2025-04-27 0 0 568.98KB 16 页 10玖币

侵权投诉

Context Generation Improves Open Domain Question Answering

Dan Su‡∗

, Mostofa Patwary§, Shrimai Prabhumoye§, Peng Xu§, Ryan Prenger§,

Mohammad Shoeybi§, Pascale Fung‡, Anima Anandkumar§, Bryan Catanzaro§

‡The Hong Kong University of Science and Technology, §NVIDIA

dsu@connect.ust.hk,mpatwary@nvidia.com

Abstract

Closed-book question answering (QA) re-

quires a model to directly answer an open-

domain question without access to any exter-

nal knowledge. Prior work on closed-book

QA either directly ﬁnetunes or prompts a pre-

trained language model (LM) to leverage the

stored knowledge. However, they do not fully

exploit the parameterized knowledge. To ad-

dress this inefﬁciency, we propose a two-stage,

closed-book QA framework which employs

acoarse-to-ﬁne approach to extract the rele-

vant knowledge and answer a question. We

ﬁrst generate a related context for a given

question by prompting a pretrained LM. We

then prompt the same LM to generate an an-

swer using the generated context and the ques-

tion. Additionally, we marginalize over the

generated contexts to improve the accuracies

and reduce context uncertainty. Experimen-

tal results on three QA benchmarks show that

our method signiﬁcantly outperforms previous

closed-book QA methods. For example on

TriviaQA, our method improves exact match

accuracy from 55.3% to 68.6%, and is on

par with open-book QA methods (68.6% vs.

68.0%). Our results show that our new method-

ology is able to better exploit the stored knowl-

edge in pretrained LMs without adding ex-

tra learnable parameters or needing ﬁnetuning,

and paves the way for hybrid models that inte-

grate pretrained LMs with external knowledge.

1 Introduction

Open-domain question answering (ODQA) pro-

duces an answer to a given question in the form

of natural language, and the task has been exten-

sively studied in recent years. Signiﬁcant progress

on ODQA has been made by developing the open-

book QA methods (Chen et al.,2017;Lewis et al.,

2020b;Guu et al.,2020;Izacard and Grave,2021;

∗

This work was done when the ﬁrst author was an in-

tern at NVIDIA. Corresponding authors: Dan Su, Mostofa

Patwary.

Generated Conext 1: Richard Marx (songwriter) Richard

Marx (born September 16, 1963) is an American adult

contemporary and pop/rock singer, songwriter, musician, and

record producer. … Three of Marx's singles have reached

number one on the U.S. Adult Contemporary chart: "Hold on

to the Nights", "Satisfied", and "Right Here Waiting". …

Generated Context 3: Kim Wilde Kim Wilde Kim Smith

(born Kim Smith on 18 November 1960), better known by

her stage name Kim Wilde, is an English pop singer, author,

DJ and television personality. She first saw success in 1981

with her debut single, "Kids in America". It reached number

two in the UK and was followed by a string of top ten hits. …

Generated Conext 2: Richard Marx Richard Allan Marx (born

September 16, 1963) is … Marx first came into the spotlight

with the release of his debut single, "Don't Mean Nothing",

from his eponymous debut album. The album went to No. 8

and spawned four Top 20 hit singles, including "Hold On to

the Nights", which reached No. 1 on the Billboard Hot 100.

Predicted

Answer:

Richard

Marx

Predicted

Answer:

Richard

Marx

Predicted

Answer:

Kim

Wilde

Question: Who had an 80s No 1 hit with Hold On To The Nights?

Predicted Answer

(standard prompting):

Ross Bagdasarian (𐄂)

Predicted Answer

(CGAP):

Richard Marx (✓)

Figure 1: An example illustrating our two-stage,

CGAP framework. CGAP generates more accurate an-

swer (e.g. Richard Marx) compared to standard few-

shot prompting (e.g. Ross Bagdasarian).

Lazaridou et al.,2022) that explicitly exploit ex-

ternal knowledge corpus via dense retrieval tech-

niques like DPR (Karpukhin et al.,2020). However,

learning a good retriever requires substantial re-

sources, such as a large number of domain-speciﬁc

pairs of question and contexts in the knowledge

corpus (Karpukhin et al.,2020), or intensive com-

pute resources (Lee et al.,2019). In addition, as the

size of the knowledge corpus increases, it becomes

harder to retrieve accurate contexts due to the high

dimensionality of the search space (Reimers and

Gurevych,2021).

Another class of models, known as closed-book

question answering (CBQA), were recently pro-

posed (Roberts et al.,2020). CBQA tries to di-

rectly answer the open-domain questions without

accessing any external knowledge sources, and in-

stead leverages the parametric knowledge stored

in the pretrained language models (LMs) (Raffel

et al.,2020;Brown et al.,2020;Ye et al.,2020).

arXiv:2210.06349v2 [cs.CL] 27 Apr 2023

However, even with larger LMs, the closed-book

methods are not competitive with the open-book

methods in term of accuracy (Lewis et al.,2021).

While it has been shown that large pre-

trained LMs store an abundant amount of knowl-

edge (Petroni et al.,2019;Roberts et al.,2020), we

hypothesize the accuracy gaps are largely because

the way of exploiting the parameterized knowl-

edge are not sophisticated enough. Prior works on

CBQA either ﬁnetune pretrained LM models on the

entire QA datasets (Roberts et al.,2020;Ye et al.,

2020), or they directly prompt those models us-

ing several few-shot QA pairs (Brown et al.,2020;

Radford et al.,2019). On the contrary, open-book

models use a two-stage pipeline. They ﬁrst retrieve

relevant contexts from external corpus, then they

extract the answer based on the retrieved contexts.

Therefore, to better exploit the parameterized

knowledge in pretrained LMs and bridge the large

accuracy gaps between the closed-book and open-

book methods, we propose a coarse-to-ﬁne, two-

stage method for CBQA task. The main idea is

to leverage generated contexts as an intermediate

bridge between the huge amount of parameterized

knowledge stored in the LM and the answer that

lies within this knowledge. To the best of our

knowledge, no previous work has been conducted

on generating context from large pretrained LMs

for CBQA and leveraging them to predict answer.

Our proposed framework

CGAP

consists of two

stages. It ﬁrst performs

ontext

eneration rele-

vant to a given question by prompting a pretrained

LM. It then prompts the same LM for

nswer

rediction using the generated context and the

question. In order to improve the accuracies and

to reduce context uncertainties, we generate mul-

tiple contexts for each question and predict the

ﬁnal answer by majority voting. This step does

not increase the inference cost as we generate the

contexts in parallel by batching in a single infer-

ence call. Figure 1illustrates how our two stage

prompting and majority voting works. For the in-

put question, CGAP generates 3 contexts and 3

predicted answers at the two stages respectively,

and choose the most voted answer as the ﬁnal

answer. Note that we do not ﬁnetune the large

pretrained LMs for context generation or answer

prediction. This facilitates our approach to take

advantage of large LMs such as GPT-3 (Brown

et al.,2020), PALM (Chowdhery et al.,2022) or

Megatron-Turing NLG 530B (Smith et al.,2022),

which are only available through APIs.

We conduct in-depth experimental studies on

three open-domain QA benchmarks, Natural Ques-

tions (Kwiatkowski et al.,2019), WebQues-

tions (Berant et al.,2013), and TriviaQA (Joshi

et al.,2017), and demonstrate signiﬁcant improve-

ments by our two stage prompting method. Our

contributions are summarized as follows:

•

We propose a simple yet effective few-shot

prompting approach for ODQA that does not

rely on any external knowledge sources or

ﬁne-tuning, but performs signiﬁcantly better

than existing closed-book approaches (e.g. ex-

act matching 68.6% vs. 55.3%), and is on

par with open-book methods (e.g. 68.6% vs.

68.0%).

•

We show that the generated context can im-

prove standard few-shot prompting based

closed-book QA accuracy at various model

scales (e.g. from 11.7% to 28.5%), and

demonstrate that scaling up the context gen-

eration model further enlarges their accuracy

gaps (e.g. 357M 28.5% vs. 530B 68.6%).

To the best of our knowledge, we are the

ﬁrst to leverage generated context from large

pretrained LMs for open-domain question an-

swering.

•

We show that generating multiple contexts

without increasing the inference cost by batch-

ing can mitigate errors in answer prediction

caused by variability in the unknown context

(e.g. from 36.3% to 45.7%).

2 Methodology

Our proposed

ontext

eration and

nswer

rediction (

CGAP

) framework is illustrated in Fig-

ure 2. CGAP consists of two stages. First, it gener-

ates relevant context to a given question by prompt-

ing a large pretrained LM. In the second stage, it

predicts an answer using the generated context and

the question by prompting the same LM. To ac-

curately predict the answer, we generate multiple

contexts. We run each of the two stages multiple

times in parallel in batch for the same question, gen-

erating different contexts for each, and use majority

voting to select the ﬁnal answer.

Formally, for our task we have a question

to be answered, and a support repository

{(c1, q1, a1),...,(cn, qn, an)}

that consists of tu-

ples of question

and answer

pairs with map-

ping to the context

. In our experiments, we use

the training sets of the corresponding datasets as

2.1 Context Generation

As shown in Figure 2, in the ﬁrst stage, given ques-

tion

, we select the

context generation prompts

S={(q1, c1),...,(qm, cm)}

from the support

repository

. We then use

with

to prompt

pretrained LM to generate

contexts, which are

denoted by Cgen ={c1

gen, c2

gen, . . . , ck

gen}.

Sample Selection

Selecting appropriate samples

for the prompts is the key to generate high-quality

context relevant to a given question. Previous work

has shown that leveraging relevant samples helps

the LM to generate contextually relevant and fac-

tually correct context (Liu et al.,2021,2022). We

therefore use a similarity-based retriever to search

relevant samples

from the corresponding sup-

porting repository,

. We use DPR (Karpukhin

et al.,2020) in our framework. In our DPR setup,

we represent the question and the samples in

768

-dimensional dense vector representations,

computed via the BERT-based bi-encoder networks.

We rank the documents according to their similarity

score, calculated as:

Score(Q, (qj, cj)) = BERT(Q)T·BERT(qj;cj)

(1)

where

;

denotes concatenation of the tokens of

the question

and the context

. Finally, we get

S={(q1, c1),...,(qm, cm)}

which are the top-m

retrieved samples for question Q.

We would like to emphasize that the selected

samples from

are used as examples in the few-

shot prompting to the pretrained LM to generate

context, not as the source of external knowledge

containing the answer.

Prompts Construction

Given the question

and the set of question-context pair samples

se-

lected, we use few-shot prompting to condition

pretrained LMs on the samples. We use similar

few-shot prompting technique for closed-book QA

as in (Brown et al.,2020), that considers multiple

<question, answer> pairs. The template we used to

construct prompts is:

...

.... Thus the con-

structed prompt

P rompt(Q)

for a given question

Qbecomes:

P rompt(Q) =Q:qm\nA:cm\n . . .

Q:q1\nA:c1\nQ:Q\n

We use ’

’ to separate the question, context

and the samples. We investigated the order of sam-

ples to optimize the prompt and ﬁnd that using the

retrieved samples in reversed order of similarity

yields better accuracies across all datasets

. We

now pass

P rompt(Q)

through a pretrained LM to

generate the context as follows:

cgen =LM(P rompt(Q))

To generate a set of

contexts,

{c1

gen, ..., ck

gen}

we increase the inference batch size to

and gen-

erate all the kcontexts in parallel in one inference

call to the LM. Thus, the overall latency remains

the same as using a single context.

2.2 Answer Prediction

In the second stage, we select

answer prediction

prompts

S0={(q1, a1, c1),...,(qm, am, cm)}

from

and then we prompt the same LM using the

generated context

Cgen

from the ﬁrst stage, along

with the question

and

. The LM predicts a

set of

answers

Ap={a1

p, a2

p, ..., ak

each cor-

responding to the

contexts in

Cgen

. The ﬁnal

answer Ais selected by majority voting on Ap.

Sample Selection

Constrained by the maximum

sequence length of the LM, we can feed the LM

only a few

(c, q, a)

samples. Thus, it could be

difﬁcult for the LM to learn how to predict the

answer for the given question conditioned on the

context, unless similar examples have been pro-

vided. For example, if we were asking the question

’who is the current director of the us mint?’, the

example that answering the question ’who is the fbi

director of the united states?’ from the provided

context will be more helpful, than the example

that is answering ’how many episodes are there

in ‘Dragon Ball Z’?’ from the given context. We

therefore use the same criteria for answer predic-

tion as has been used for context generation. We

use the same set of samples as selected in the ﬁrst

stage as described in Equation 1and denote as

S0={(q1, c1, a1),...,(qm, cm, am)}.

Prompt Construction

We are prompting LMs

with few-shot examples to predict answer for the

question conditioned on the generated context. To

equip the LM with this capability, we constructed

intuitive prompts for the selected examples and

feed them into the LM. Speciﬁcally, the template

We show an concrete example of

P rompt(Q)

in Ap-

pendix Table 12

Pretrained LM Pretrained LM

Generated Context

Question

Context

Generation

Prompts

...

Answer

Prediction

Prompts

Answer

Context Generation Answer Prediction

...

Figure 2: Overview architecture of our CGAP framework. It ﬁrst does Context Generation by prompting large

pretrained LMs, then it further prompts the LMs for Answer Prediction by feeding the generated context to the

LM models alongside the question. kcontexts are generated and the ﬁnal answer Ais chosen by majority voting.

(If computation capability allows, it could prompt multiple (k) LMs in parallel at both two stages to speed up.)

we used to construct answer prediction prompts is:

...

... . Thus, the constructed prompt

for a given question Q and the

-th generated con-

text ci

gen is:

P rompt(ci

gen, Q) =C:cm\nQ:qm\nA:am\n

. . .

C:c1\nQ:q1\nA:a1\n

C:ci

gen\nQ:Q\n

(2)

We then feed

P rompt(ci

gen, Q)

into the pre-

trained LM to predict the answer:

p=LM(P rompt(ci

gen, Q))) (3)

where we use

to denote the

-th answer predicted

by the LM. The

generated contexts in

cgen

will

yield a set of answers Ap={a1

p, ..., ak

p}.

2.3 Context Marginalization

The large pretrained LM can generate impressively

ﬂuent and relevant context given input, it also has a

tendency to generate factually incorrect statements,

ranging from subtle inaccuracies to wild halluci-

nations (Shuster et al.,2021;Krishna et al.,2021;

Su et al.,2022). Answers conditioned solely on

hallucinated or erroneous statements are likely to

be incorrect (Equation 3). Thus, we would like

to remove the variability in the answer due to any

particular generated context.

Ideally, we could marginalize over this unknown

context by producing an answer for every possible

context, weighting each answer by the probabil-

ity of the context. Here we approximate this by

generating a set of contexts, and selecting the ﬁnal

answer based on majority voting. Suppose there

are

unique answers

{A1

p, ..., AT

from the

pre-

dicted answer from Equation 3where

T <=k

then we select the

-th answer that receives the

highest number of votes from the

different an-

swers via:

J= argmax

j∈{1,2,...,T }

i=1

( (ai

p=Aj

p)) (4)

as the ﬁnal answer

. As

gets larger, the ﬁnal

answer

will converge to the answer that would

be produced marginalizing over all possible con-

texts. We refer to this majority vote over multiple

generated contexts as context marginalization.

3 Experimental Setup

3.1 Datasets

We evaluated our experiments on three open-

domain QA benchmark datasets: Natural Ques-

tions (NQ) (Kwiatkowski et al.,2019), TriviaQA

(TQA) (Joshi et al.,2017), and WebQuestions

(WQ) (Berant et al.,2013), using the same data

splits for train, validation and test as in Lee et al.

(2019); Izacard and Grave (2021).

NQ contains questions from Google search

queries; TQA contains a collection of questions

from trivia and quiz-league websites, and we use

their unﬁltered set; while questions of WQ were

from Google Suggest API. For NQ and TQA, we

use the processed data provided by Izacard and

Grave (2021), in which each question-answer pair

is accompanied by a 100-words Wikipedia passage

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ContextGenerationImprovesOpenDomainQuestionAnsweringDanSuz,MostofaPatwaryx,ShrimaiPrabhumoyex,PengXux,RyanPrengerx,MohammadShoeybix,PascaleFungz,AnimaAnandkumarx,BryanCatanzaroxzTheHongKongUniversityofScienceandTechnology,xNVIDIAdsu@connect.ust.hk,mpatwary@nvidia.comAbstractClosed-bookquestionanswe...

展开>> 收起<<

Context Generation Improves Open Domain Question Answering Dan Suz Mostofa Patwaryx Shrimai Prabhumoyex Peng Xux Ryan Prengerx Mohammad Shoeybix Pascale Fungz Anima Anandkumarx Bryan Catanzarox.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Context Generation Improves Open Domain Question Answering Dan Suz Mostofa Patwaryx Shrimai Prabhumoyex Peng Xux Ryan Prengerx Mohammad Shoeybix Pascale Fungz Anima Anandkumarx Bryan Catanzarox

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: