Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering Zhengbao Jiangy Jun Arakiz Haibo Dingz Graham Neubigy

2025-05-06 0 0 534.89KB 11 页 10玖币

侵权投诉

Understanding and Improving Zero-shot Multi-hop Reasoning

in Generative Question Answering

Zhengbao Jiang†, Jun Araki‡, Haibo Ding‡∗

, Graham Neubig†

†Languages Technologies Institute, Carnegie Mellon University

‡Bosch Research

{zhengbaj,gneubig}@cs.cmu.edu

{jun.araki,haibo.ding}@us.bosch.com

Abstract

Generative question answering (QA) models

generate answers to questions either solely

based on the parameters of the model (the

closed-book setting) or additionally retriev-

ing relevant evidence (the open-book setting).

Generative QA models can answer some rela-

tively complex questions, but the mechanism

through which they do so is still poorly under-

stood. We perform several studies aimed at

better understanding the multi-hop reasoning

capabilities of generative QA models. First,

we decompose multi-hop questions into mul-

tiple corresponding single-hop questions, and

ﬁnd marked inconsistency in QA models’ an-

swers on these pairs of ostensibly identical

question chains. Second, we ﬁnd that mod-

els lack zero-shot multi-hop reasoning abil-

ity: when trained only on single-hop questions,

models generalize poorly to multi-hop ques-

tions. Finally, we demonstrate that it is pos-

sible to improve models’ zero-shot multi-hop

reasoning capacity through two methods that

approximate real multi-hop natural language

(NL) questions by training on either concate-

nation of single-hop questions or logical forms

(SPARQL). In sum, these results demonstrate

that multi-hop reasoning does not emerge nat-

urally in generative QA models, but can be en-

couraged by advances in training or modeling

techniques.1

1 Introduction

Empowered by large-scale pre-trained language

models (LMs) (Devlin et al.,2019;Liu et al.,2019;

Lewis et al.,2020a;Raffel et al.,2020), recent

years have seen much progress on generative ques-

tion answering (QA), where LMs generate answers

given questions in an end-to-end fashion. While

most works only demonstrate the performance of

such generative QA models on simple questions

(Joshi et al.,2017;Kwiatkowski et al.,2019), there

∗Haibo Ding is now at Amazon.

1Code is available at https://github.com/jzbjyb/multihop.

Closed-book

QA model

!!Return the artist who recorded

Party Ain't Over.

!"Where in Georgia does Usher live?

!Which part of Georgia does the artist

that recorded Party Ain't Over live?

Open-book

QA model

Questions Contexts

#$!Rihanna

#$"Atlanta

#$ Atlanta

Chris Brown ✗(Usher)

Atlanta ✔

Predictions:

Figure 1: Probing generative closed- and open-book

QA models with both multi-hop (q) and their compo-

nent single-hop questions (q1,q2).

has been some indication that these models can

also answer complex questions that theoretically

require multi-hop reasoning (Xiong et al.,2020),

sometimes to an impressive degree. For exam-

ple, Brown et al. (2020) demonstrate strong perfor-

mance of LMs on multi-hop reasoning tasks such

as DROP (Dua et al.,2019) which requires discrete

reasoning and numeracy. On the other hand, many

argue that LM-based QA models are not actually

performing any reasoning, and rather performing

(sophisticated) pattern matching and data memo-

rization (Marcus and Davis,2020). Simultaneously,

in the context of extractive QA models that select

answers from the provided context, several works

have demonstrated that they can leverage superﬁ-

cial signals to return correct answers even when

the context does not contain all the supporting facts

(Chen and Durrett,2019;Min et al.,2019a)

In this paper, we perform a closer examination of

the multi-hop reasoning capabilities of generative

QA models. To do so, we take multi-hop questions

and their component single-hop questions to di-

rectly query generative QA models, studying their

multi-hop reasoning ability. Speciﬁcally, we use

multi-hop questions from the ComplexWebQues-

tions (Talmor and Berant,2018) and HotpotQA

(Yang et al.,2018;Tang et al.,2021) datasets as

arXiv:2210.04234v1 [cs.CL] 9 Oct 2022

the testbed, and generate decomposed single-hop

questions using heuristics (§ 2.2). We examine

two types of generative QA models, namely closed-

book (Roberts et al.,2020;Khashabi et al.,2020)

and open-book (Guu et al.,2020;Lewis et al.,

2020b;Izacard and Grave,2021;Xiong et al.,2020)

QA models that either do not or do refer to exter-

nal knowledge when generating the answer respec-

tively. Speciﬁcally, we use UniﬁedQA (Khashabi

et al.,2020) as a representative closed-book model,

and RAG (Lewis et al.,2020b) as a representative

open-book model (§ 2.1). We ﬁrst ask:

RQ1

Is the correctness of decomposed single-hop

questions a necessary and sufﬁcient condition

for correctness of multi-hop questions? (§ 3.2)

Are answers to multi-hop and chains of de-

composed questions consistent? (§ 3.3)

RQ2

Do models trained on single-hop questions

demonstrate zero-shot generalization to multi-

hop questions? (§ 4)

We ﬁnd that generative QA models, even those

close to the state-of-the-art, generally do not

demonstrate robust multi-hop reasoning abilities,

with success on multi-hop questions largely a result

of taking shortcuts rather than true multi-hop rea-

soning. Zero-shot multi-hop reasoning ability does

not emerge naturally from training on single-hop

questions, which motivates our ﬁnal question:

RQ3

Can we improve models’ zero-shot multi-

hop reasoning capacity by training on approx-

imations of real multi-hop questions? (§ 4)

Motivated by the fact that pre-training on massive

text endows LMs with the ability to identify seman-

tically similar expressions, our ﬁrst method uses

concatenated decomposed single-hop questions to

approximate real multi-hop questions. Our second

method is inspired by recent work teaching LMs

complex reasoning capabilities through neural ex-

ecution of logical forms, e.g. by training neural

models to execute SQL queries (Liu et al.,2021).

We hypothesize that the ability to perform multi-

hop reasoning can also be potentially learned from

logical forms without reliance on NL questions. To

this end, we propose to use SPARQL, a standard

query language over knowledge bases, as our logi-

cal forms to endow generative QA models with the

ability to perform multi-hop reasoning, and exam-

ine whether learning to execute SPARQL transfers

to the ability to answer NL multi-hop questions.

Both methods lead to signiﬁcant improvement on

zero-shot multi-hop reasoning performance, and

further improvements are obtained when both are

combined, opening possibilities for future work

(§ 6).

2 Generative Question Answering

In this section, we brieﬂy introduce generative QA

models and multi-hop QA datasets. Then we elab-

orate on how we use multi-hop and decomposed

questions to perform experiments.

2.1 Generative QA Models

There are two main classes of generative QA mod-

els: closed-book and open-book. Closed-book QA

models usually consist of a sequence-to-sequence

model that takes in a question

and calculates the

probability of an answer

based on model parame-

ters

(Roberts et al.,2020;Khashabi et al.,2020):

P(a|q;θ) =

|a|

i=1

P(ai|q,a<i;θ),

Because these models can only refer to model pa-

rameters, any relevant information must be stored

in the parameters (Roberts et al.,2020). Open-

book QA models ﬁrst retrieve relevant context

from external resources, then generate answers us-

ing both questions and context (Guu et al.,2020;

Lewis et al.,2020b;Izacard and Grave,2021):

P(a|c,q;θ) =

|a|

i=1

P(ai|c,q,a<i;θ),

We examine both types of models since we hypoth-

esize that the difference in inputs might lead to

different mechanisms of multi-hop reasoning.

Speciﬁcally, as our example of a closed-book

model we use the UniﬁedQA model of Khashabi

et al. (2020). The UniﬁedQA model is based on

the T5 model (Raffel et al.,2020), which is an

encoder-decoder model trained on the Colossal

Clean Crawled Corpus (C4) by a denoising ob-

jective. It further ﬁne-tunes on a variety of QA

datasets by converting different QA formats into a

uniﬁed sequence-to-sequence format.

We use the RAG model of Lewis et al. (2020b)

as our example of an open-book QA model, which

consists of a retriever for searching relevant pas-

sages

, and a generator which generates answers

given both

and

. The retriever is based on the

dense passage retrieval model (DPR) (Karpukhin

et al.,2020), and the generator is based on BART

Type Questions (hop1, hop2, and multi-hop) Answers

Composition

Return the country where Limonese Creole is spoken. Costa Rica

Which continent is Costa Rica located? North America

On which continent is Limonese Creole spoken? North America

Conjunction

What team is Reggie Bush on 2011? Miami Dolphins, New Orleans Saints

Which one of the following is the team won the super bowl XLIV championship: Miami Dolphins, New Orleans Saints? New Orleans Saints

What team that won the super bowl XLIV championship was Reggie Bush in 2011? New Orleans Saints

Superlative

What countries does the Niger River ﬂow through? Benin, Guinea, Mali, Niger Nigeria

Which one of the following country calling code is smallest: Benin, Guinea, Mali, Niger, Nigeria? Mali

What country with the smallest calling code does the Niger River ﬂow through? Mali

Comparative

What were Hitler’s parents names? Alois Hitler, Klara Hitler

Which one of the following person’s date of death is after 1903-01-03: Alois Hitler, Klara Hitler? Klara Hitler

Which of Hitler’s parents died after 3 January 1903? Klara Hitler

Table 1: Each multi-hop question qfrom ComplexWebQuestions is decomposed into two single-hop questions q1

and q2. Underlined entities in the second single-hop questions are actually the answer to the ﬁrst hop.

(Lewis et al.,2020a), which is also an encoder-

decoder model that encodes both context and ques-

tion, and generates answers autoregressively.

2.2 Multi-hop Questions and Decompositions

To understand multi-hop reasoning in generative

QA models, we propose to query models using

both multi-hop questions and their decompositions

into multiple single-hop questions, and perform

analysis based on the predictions.

To this end, we choose the

ComplexWebQues-

tions

dataset (Talmor and Berant,2018) as our

major testbed, as it contains multi-hop questions

based on simple questions from the WebQuestion-

sSP dataset (Yih et al.,2016), and we can leverage

simple heuristics to obtain decomposed single-hop

questions and corresponding answers. Another ad-

vantage of ComplexWebQuestions is that it con-

tains four types of questions: composition, con-

junction, superlative, and comparative. This allows

us to perform ﬁne-grained analysis over these cate-

gories. Speciﬁcally, we follow heuristics in Talmor

and Berant (2018) to generate decompositions. For

the composition type, they use questions from We-

bQuestionsSP as the second hop, and replace an

entity in it with a relational phrase to generate multi-

hop questions. We revert this process to get the

ﬁrst-hop question. For the other three types, they

use questions from WebQuestionsSP with multiple

answers as the ﬁrst hop, and add additional condi-

tions to form the multi-hop questions. We extract

those conditions and use the following template

to generate the second hop question: “Which one

of the following [condition]: [candidate answers]”.

Tab. 1 includes examples of multi-hop questions

and their decompositions of four types.

We also use another small dataset from Tang

et al. (2021) to test the generality of models, where

a subset of multi-hop questions from

HotpotQA

(Yang et al.,2018) are manually annotated with

decompositions. This dataset only contains a sin-

gle type of question, which is composition. Com-

plexWebQuestions has 27,639/3,519 questions in

the training/development set, and HotpotQA has

1,000 questions in the development set.2

2.3 Answer Generation and Evaluation

We use

qt, t ∈ {1, ..., T }

to denote the

-th decom-

posed single-hop question for a multi-hop question

with

hops. Correspondingly, we use

to de-

note answers and

to denote retrieved context for

the single-hop question

. Since the last single-

hop question always has the same answer as the cor-

responding multi-hop question,

aT=a

. We use

to denote the predictions from single-/multi-

hop questions generated with greedy decoding:

= arg max

Py



[c,]q

[ct,]qt

;θ.

We query models using all decomposed questions

and multi-hop questions

which are concate-

nated with the corresponding context (

) for

open-book settings to get predicted answers. All

questions from ComplexWebQuestions and Hot-

potQA have two hops (i.e.,

T= 2

), thus in the

following sections we always use T= 2.

Pseudo-gold context for oracle-book models

Previous work clearly demonstrates that a better

retrieval component usually implies higher open-

book QA performance, as it results in more re-

trieved contexts with answers (Chen et al.,2017;

Lee et al.,2019;Karpukhin et al.,2020). There-

fore, we ablate out the inﬂuence of the retrieval

Since the test sets of both datasets are hidden, we use

development sets for evaluation purposes. Break (Wolfson

et al.,2020) is another testbed with multi-hop questions and

manually decomposed questions. However, the decomposed

questions are not annotated with answers, making it less ap-

propriate for our study.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingandImprovingZero-shotMulti-hopReasoninginGenerativeQuestionAnsweringZhengbaoJiangy,JunArakiz,HaiboDingz,GrahamNeubigyyLanguagesTechnologiesInstitute,CarnegieMellonUniversityzBoschResearch{zhengbaj,gneubig}@cs.cmu.edu{jun.araki,haibo.ding}@us.bosch.comAbstractGenerativequestionanswering...

展开>> 收起<<

Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering Zhengbao Jiangy Jun Arakiz Haibo Dingz Graham Neubigy.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering Zhengbao Jiangy Jun Arakiz Haibo Dingz Graham Neubigy

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: