
Type Questions (hop1, hop2, and multi-hop) Answers
Composition
Return the country where Limonese Creole is spoken. Costa Rica
Which continent is Costa Rica located? North America
On which continent is Limonese Creole spoken? North America
Conjunction
What team is Reggie Bush on 2011? Miami Dolphins, New Orleans Saints
Which one of the following is the team won the super bowl XLIV championship: Miami Dolphins, New Orleans Saints? New Orleans Saints
What team that won the super bowl XLIV championship was Reggie Bush in 2011? New Orleans Saints
Superlative
What countries does the Niger River flow through? Benin, Guinea, Mali, Niger Nigeria
Which one of the following country calling code is smallest: Benin, Guinea, Mali, Niger, Nigeria? Mali
What country with the smallest calling code does the Niger River flow through? Mali
Comparative
What were Hitler’s parents names? Alois Hitler, Klara Hitler
Which one of the following person’s date of death is after 1903-01-03: Alois Hitler, Klara Hitler? Klara Hitler
Which of Hitler’s parents died after 3 January 1903? Klara Hitler
Table 1: Each multi-hop question qfrom ComplexWebQuestions is decomposed into two single-hop questions q1
and q2. Underlined entities in the second single-hop questions are actually the answer to the first hop.
(Lewis et al.,2020a), which is also an encoder-
decoder model that encodes both context and ques-
tion, and generates answers autoregressively.
2.2 Multi-hop Questions and Decompositions
To understand multi-hop reasoning in generative
QA models, we propose to query models using
both multi-hop questions and their decompositions
into multiple single-hop questions, and perform
analysis based on the predictions.
To this end, we choose the
ComplexWebQues-
tions
dataset (Talmor and Berant,2018) as our
major testbed, as it contains multi-hop questions
based on simple questions from the WebQuestion-
sSP dataset (Yih et al.,2016), and we can leverage
simple heuristics to obtain decomposed single-hop
questions and corresponding answers. Another ad-
vantage of ComplexWebQuestions is that it con-
tains four types of questions: composition, con-
junction, superlative, and comparative. This allows
us to perform fine-grained analysis over these cate-
gories. Specifically, we follow heuristics in Talmor
and Berant (2018) to generate decompositions. For
the composition type, they use questions from We-
bQuestionsSP as the second hop, and replace an
entity in it with a relational phrase to generate multi-
hop questions. We revert this process to get the
first-hop question. For the other three types, they
use questions from WebQuestionsSP with multiple
answers as the first hop, and add additional condi-
tions to form the multi-hop questions. We extract
those conditions and use the following template
to generate the second hop question: “Which one
of the following [condition]: [candidate answers]”.
Tab. 1 includes examples of multi-hop questions
and their decompositions of four types.
We also use another small dataset from Tang
et al. (2021) to test the generality of models, where
a subset of multi-hop questions from
HotpotQA
(Yang et al.,2018) are manually annotated with
decompositions. This dataset only contains a sin-
gle type of question, which is composition. Com-
plexWebQuestions has 27,639/3,519 questions in
the training/development set, and HotpotQA has
1,000 questions in the development set.2
2.3 Answer Generation and Evaluation
We use
qt, t ∈ {1, ..., T }
to denote the
t
-th decom-
posed single-hop question for a multi-hop question
q
with
T
hops. Correspondingly, we use
at
to de-
note answers and
ct
to denote retrieved context for
the single-hop question
qt
. Since the last single-
hop question always has the same answer as the cor-
responding multi-hop question,
aT=a
. We use
ˆ
at
/
ˆ
a
to denote the predictions from single-/multi-
hop questions generated with greedy decoding:
ˆ
a
ˆ
at
= arg max
y
Py
[c,]q
[ct,]qt
;θ.
We query models using all decomposed questions
qt
and multi-hop questions
q
which are concate-
nated with the corresponding context (
ct
or
c
) for
open-book settings to get predicted answers. All
questions from ComplexWebQuestions and Hot-
potQA have two hops (i.e.,
T= 2
), thus in the
following sections we always use T= 2.
Pseudo-gold context for oracle-book models
Previous work clearly demonstrates that a better
retrieval component usually implies higher open-
book QA performance, as it results in more re-
trieved contexts with answers (Chen et al.,2017;
Lee et al.,2019;Karpukhin et al.,2020). There-
fore, we ablate out the influence of the retrieval
2
Since the test sets of both datasets are hidden, we use
development sets for evaluation purposes. Break (Wolfson
et al.,2020) is another testbed with multi-hop questions and
manually decomposed questions. However, the decomposed
questions are not annotated with answers, making it less ap-
propriate for our study.