Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering Zhengbao Jiangy Jun Arakiz Haibo Dingz Graham Neubigy

2025-05-06 0 0 534.89KB 11 页 10玖币
侵权投诉
Understanding and Improving Zero-shot Multi-hop Reasoning
in Generative Question Answering
Zhengbao Jiang, Jun Araki, Haibo Ding
, Graham Neubig
Languages Technologies Institute, Carnegie Mellon University
Bosch Research
{zhengbaj,gneubig}@cs.cmu.edu
{jun.araki,haibo.ding}@us.bosch.com
Abstract
Generative question answering (QA) models
generate answers to questions either solely
based on the parameters of the model (the
closed-book setting) or additionally retriev-
ing relevant evidence (the open-book setting).
Generative QA models can answer some rela-
tively complex questions, but the mechanism
through which they do so is still poorly under-
stood. We perform several studies aimed at
better understanding the multi-hop reasoning
capabilities of generative QA models. First,
we decompose multi-hop questions into mul-
tiple corresponding single-hop questions, and
find marked inconsistency in QA models’ an-
swers on these pairs of ostensibly identical
question chains. Second, we find that mod-
els lack zero-shot multi-hop reasoning abil-
ity: when trained only on single-hop questions,
models generalize poorly to multi-hop ques-
tions. Finally, we demonstrate that it is pos-
sible to improve models’ zero-shot multi-hop
reasoning capacity through two methods that
approximate real multi-hop natural language
(NL) questions by training on either concate-
nation of single-hop questions or logical forms
(SPARQL). In sum, these results demonstrate
that multi-hop reasoning does not emerge nat-
urally in generative QA models, but can be en-
couraged by advances in training or modeling
techniques.1
1 Introduction
Empowered by large-scale pre-trained language
models (LMs) (Devlin et al.,2019;Liu et al.,2019;
Lewis et al.,2020a;Raffel et al.,2020), recent
years have seen much progress on generative ques-
tion answering (QA), where LMs generate answers
given questions in an end-to-end fashion. While
most works only demonstrate the performance of
such generative QA models on simple questions
(Joshi et al.,2017;Kwiatkowski et al.,2019), there
Haibo Ding is now at Amazon.
1Code is available at https://github.com/jzbjyb/multihop.
Closed-book
QA model
!!Return the artist who recorded
Party Ain't Over.
!"Where in Georgia does Usher live?
!Which part of Georgia does the artist
that recorded Party Ain't Over live?
Open-book
QA model
"!
"!
""
""
Questions Contexts
#$!Rihanna
#$"Atlanta
#$ Atlanta
Chris Brown (Usher)
Atlanta
Atlanta
Predictions:
Figure 1: Probing generative closed- and open-book
QA models with both multi-hop (q) and their compo-
nent single-hop questions (q1,q2).
has been some indication that these models can
also answer complex questions that theoretically
require multi-hop reasoning (Xiong et al.,2020),
sometimes to an impressive degree. For exam-
ple, Brown et al. (2020) demonstrate strong perfor-
mance of LMs on multi-hop reasoning tasks such
as DROP (Dua et al.,2019) which requires discrete
reasoning and numeracy. On the other hand, many
argue that LM-based QA models are not actually
performing any reasoning, and rather performing
(sophisticated) pattern matching and data memo-
rization (Marcus and Davis,2020). Simultaneously,
in the context of extractive QA models that select
answers from the provided context, several works
have demonstrated that they can leverage superfi-
cial signals to return correct answers even when
the context does not contain all the supporting facts
(Chen and Durrett,2019;Min et al.,2019a)
In this paper, we perform a closer examination of
the multi-hop reasoning capabilities of generative
QA models. To do so, we take multi-hop questions
and their component single-hop questions to di-
rectly query generative QA models, studying their
multi-hop reasoning ability. Specifically, we use
multi-hop questions from the ComplexWebQues-
tions (Talmor and Berant,2018) and HotpotQA
(Yang et al.,2018;Tang et al.,2021) datasets as
arXiv:2210.04234v1 [cs.CL] 9 Oct 2022
the testbed, and generate decomposed single-hop
questions using heuristics (§ 2.2). We examine
two types of generative QA models, namely closed-
book (Roberts et al.,2020;Khashabi et al.,2020)
and open-book (Guu et al.,2020;Lewis et al.,
2020b;Izacard and Grave,2021;Xiong et al.,2020)
QA models that either do not or do refer to exter-
nal knowledge when generating the answer respec-
tively. Specifically, we use UnifiedQA (Khashabi
et al.,2020) as a representative closed-book model,
and RAG (Lewis et al.,2020b) as a representative
open-book model (§ 2.1). We first ask:
RQ1
Is the correctness of decomposed single-hop
questions a necessary and sufficient condition
for correctness of multi-hop questions? (§ 3.2)
Are answers to multi-hop and chains of de-
composed questions consistent? (§ 3.3)
RQ2
Do models trained on single-hop questions
demonstrate zero-shot generalization to multi-
hop questions? (§ 4)
We find that generative QA models, even those
close to the state-of-the-art, generally do not
demonstrate robust multi-hop reasoning abilities,
with success on multi-hop questions largely a result
of taking shortcuts rather than true multi-hop rea-
soning. Zero-shot multi-hop reasoning ability does
not emerge naturally from training on single-hop
questions, which motivates our final question:
RQ3
Can we improve models’ zero-shot multi-
hop reasoning capacity by training on approx-
imations of real multi-hop questions? (§ 4)
Motivated by the fact that pre-training on massive
text endows LMs with the ability to identify seman-
tically similar expressions, our first method uses
concatenated decomposed single-hop questions to
approximate real multi-hop questions. Our second
method is inspired by recent work teaching LMs
complex reasoning capabilities through neural ex-
ecution of logical forms, e.g. by training neural
models to execute SQL queries (Liu et al.,2021).
We hypothesize that the ability to perform multi-
hop reasoning can also be potentially learned from
logical forms without reliance on NL questions. To
this end, we propose to use SPARQL, a standard
query language over knowledge bases, as our logi-
cal forms to endow generative QA models with the
ability to perform multi-hop reasoning, and exam-
ine whether learning to execute SPARQL transfers
to the ability to answer NL multi-hop questions.
Both methods lead to significant improvement on
zero-shot multi-hop reasoning performance, and
further improvements are obtained when both are
combined, opening possibilities for future work
(§ 6).
2 Generative Question Answering
In this section, we briefly introduce generative QA
models and multi-hop QA datasets. Then we elab-
orate on how we use multi-hop and decomposed
questions to perform experiments.
2.1 Generative QA Models
There are two main classes of generative QA mod-
els: closed-book and open-book. Closed-book QA
models usually consist of a sequence-to-sequence
model that takes in a question
q
and calculates the
probability of an answer
a
based on model parame-
ters
θ
(Roberts et al.,2020;Khashabi et al.,2020):
P(a|q;θ) =
|a|
Y
i=1
P(ai|q,a<i;θ),
Because these models can only refer to model pa-
rameters, any relevant information must be stored
in the parameters (Roberts et al.,2020). Open-
book QA models first retrieve relevant context
c
from external resources, then generate answers us-
ing both questions and context (Guu et al.,2020;
Lewis et al.,2020b;Izacard and Grave,2021):
P(a|c,q;θ) =
|a|
Y
i=1
P(ai|c,q,a<i;θ),
We examine both types of models since we hypoth-
esize that the difference in inputs might lead to
different mechanisms of multi-hop reasoning.
Specifically, as our example of a closed-book
model we use the UnifiedQA model of Khashabi
et al. (2020). The UnifiedQA model is based on
the T5 model (Raffel et al.,2020), which is an
encoder-decoder model trained on the Colossal
Clean Crawled Corpus (C4) by a denoising ob-
jective. It further fine-tunes on a variety of QA
datasets by converting different QA formats into a
unified sequence-to-sequence format.
We use the RAG model of Lewis et al. (2020b)
as our example of an open-book QA model, which
consists of a retriever for searching relevant pas-
sages
p
, and a generator which generates answers
a
given both
p
and
q
. The retriever is based on the
dense passage retrieval model (DPR) (Karpukhin
et al.,2020), and the generator is based on BART
Type Questions (hop1, hop2, and multi-hop) Answers
Composition
Return the country where Limonese Creole is spoken. Costa Rica
Which continent is Costa Rica located? North America
On which continent is Limonese Creole spoken? North America
Conjunction
What team is Reggie Bush on 2011? Miami Dolphins, New Orleans Saints
Which one of the following is the team won the super bowl XLIV championship: Miami Dolphins, New Orleans Saints? New Orleans Saints
What team that won the super bowl XLIV championship was Reggie Bush in 2011? New Orleans Saints
Superlative
What countries does the Niger River flow through? Benin, Guinea, Mali, Niger Nigeria
Which one of the following country calling code is smallest: Benin, Guinea, Mali, Niger, Nigeria? Mali
What country with the smallest calling code does the Niger River flow through? Mali
Comparative
What were Hitler’s parents names? Alois Hitler, Klara Hitler
Which one of the following person’s date of death is after 1903-01-03: Alois Hitler, Klara Hitler? Klara Hitler
Which of Hitler’s parents died after 3 January 1903? Klara Hitler
Table 1: Each multi-hop question qfrom ComplexWebQuestions is decomposed into two single-hop questions q1
and q2. Underlined entities in the second single-hop questions are actually the answer to the first hop.
(Lewis et al.,2020a), which is also an encoder-
decoder model that encodes both context and ques-
tion, and generates answers autoregressively.
2.2 Multi-hop Questions and Decompositions
To understand multi-hop reasoning in generative
QA models, we propose to query models using
both multi-hop questions and their decompositions
into multiple single-hop questions, and perform
analysis based on the predictions.
To this end, we choose the
ComplexWebQues-
tions
dataset (Talmor and Berant,2018) as our
major testbed, as it contains multi-hop questions
based on simple questions from the WebQuestion-
sSP dataset (Yih et al.,2016), and we can leverage
simple heuristics to obtain decomposed single-hop
questions and corresponding answers. Another ad-
vantage of ComplexWebQuestions is that it con-
tains four types of questions: composition, con-
junction, superlative, and comparative. This allows
us to perform fine-grained analysis over these cate-
gories. Specifically, we follow heuristics in Talmor
and Berant (2018) to generate decompositions. For
the composition type, they use questions from We-
bQuestionsSP as the second hop, and replace an
entity in it with a relational phrase to generate multi-
hop questions. We revert this process to get the
first-hop question. For the other three types, they
use questions from WebQuestionsSP with multiple
answers as the first hop, and add additional condi-
tions to form the multi-hop questions. We extract
those conditions and use the following template
to generate the second hop question: “Which one
of the following [condition]: [candidate answers]”.
Tab. 1 includes examples of multi-hop questions
and their decompositions of four types.
We also use another small dataset from Tang
et al. (2021) to test the generality of models, where
a subset of multi-hop questions from
HotpotQA
(Yang et al.,2018) are manually annotated with
decompositions. This dataset only contains a sin-
gle type of question, which is composition. Com-
plexWebQuestions has 27,639/3,519 questions in
the training/development set, and HotpotQA has
1,000 questions in the development set.2
2.3 Answer Generation and Evaluation
We use
qt, t ∈ {1, ..., T }
to denote the
t
-th decom-
posed single-hop question for a multi-hop question
q
with
T
hops. Correspondingly, we use
at
to de-
note answers and
ct
to denote retrieved context for
the single-hop question
qt
. Since the last single-
hop question always has the same answer as the cor-
responding multi-hop question,
aT=a
. We use
ˆ
at
/
ˆ
a
to denote the predictions from single-/multi-
hop questions generated with greedy decoding:
ˆ
a
ˆ
at
= arg max
y
Py
[c,]q
[ct,]qt
;θ.
We query models using all decomposed questions
qt
and multi-hop questions
q
which are concate-
nated with the corresponding context (
ct
or
c
) for
open-book settings to get predicted answers. All
questions from ComplexWebQuestions and Hot-
potQA have two hops (i.e.,
T= 2
), thus in the
following sections we always use T= 2.
Pseudo-gold context for oracle-book models
Previous work clearly demonstrates that a better
retrieval component usually implies higher open-
book QA performance, as it results in more re-
trieved contexts with answers (Chen et al.,2017;
Lee et al.,2019;Karpukhin et al.,2020). There-
fore, we ablate out the influence of the retrieval
2
Since the test sets of both datasets are hidden, we use
development sets for evaluation purposes. Break (Wolfson
et al.,2020) is another testbed with multi-hop questions and
manually decomposed questions. However, the decomposed
questions are not annotated with answers, making it less ap-
propriate for our study.
摘要:

UnderstandingandImprovingZero-shotMulti-hopReasoninginGenerativeQuestionAnsweringZhengbaoJiangy,JunArakiz,HaiboDingz,GrahamNeubigyyLanguagesTechnologiesInstitute,CarnegieMellonUniversityzBoschResearch{zhengbaj,gneubig}@cs.cmu.edu{jun.araki,haibo.ding}@us.bosch.comAbstractGenerativequestionanswering...

展开>> 收起<<
Understanding and Improving Zero-shot Multi-hop Reasoning in Generative Question Answering Zhengbao Jiangy Jun Arakiz Haibo Dingz Graham Neubigy.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:534.89KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注