
vide a set of triples as the evidence information. For
example, for the question in Figure 1, the evidence
regards the dates of birth and death of two people,
e.g., (Maceo, date of death, July 4, 2001). We ar-
gue that simply requiring the models to detect a set
of triples, in this case, cannot explain the answer to
the question and cannot describe the full path from
the question to the answer; additional operations,
including calculations and comparisons, need to be
performed to obtain the final answer.
To deal with this issue, we introduce a dataset,
HieraDate,
1
consisting of the three probing tasks.
(1) The extraction task poses sub-questions that are
created by converting evidence triples into natu-
ral language questions. (2) The reasoning task is
pertinent to the combination of triples, involving
comparison and numerical reasoning that precisely
evaluate the reasoning path of the main questions.
(3) The robustness task consists of examples gener-
ated by slightly changing the semantics (e.g., born
first to born later) of the original main questions.
The purpose of the robustness task is to ensure that
the models do not exploit superficial features in
answering questions.
Our dataset is created by extending two exist-
ing multi-hop datasets, HotpotQA and 2Wiki. As
the first step of the proof of concept, we start with
the date information through comparison questions
because this information is available and straight-
forward to handle. Moreover, based on the clas-
sification of comparison questions in Min et al.
(2019a), all comparison questions on date informa-
tion require multi-hop reasoning for answering. We
then use our dataset to evaluate two leading models,
HGN (Fang et al.,2020) and NumNet+ (Ran et al.,
2019) on two settings: with and without fine-tuning
on our dataset. We also conduct experiments to in-
vestigate whether our probing questions are useful
for improving QA performance and whether our
dataset can be used for data augmentation.
Our experimental results reveal that existing
multi-hop models perform well in the extraction
and robustness tasks but fail in the reasoning task
when the models are not fine-tuned on our dataset.
We observe that with fine-tuning, HGN can per-
form well in the comparison reasoning task; mean-
while, NumNet+ struggles with subtracting two
dates, although it can subtract two numbers. Our
analysis shows that questions that require both nu-
1
Our data and code are available at
https://github.
com/Alab-NII/HieraDate.
merical and comparison reasoning are more diffi-
cult than questions that require only comparison
reasoning. We also find that training with our
probing questions boosts QA performance in our
dataset, showing improvement from 77.1 to 82.7
F1 in HGN and from 84.6 to 94.9 F1 in NumNet+.
Moreover, our dataset can be used as augmenta-
tion data for HotpotQA, 2Wiki, and DROP (Dua
et al.,2019), which contributes to improving the
robustness of the models trained on these datasets.
Our results suggest that a more complete evaluation
of the reasoning path may be necessary for better
understanding of multi-hop models’ behavior. We
encourage future research to integrate our probing
questions when training and evaluating the models.
2 Related Work
In addition to Tang et al. (2021), Al-Negheimish
et al. (2021) and Geva et al. (2022) are similar to
our study. Al-Negheimish et al. (2021) evaluated
the previous models on the DROP dataset to test
their numerical reasoning ability. However, they
did not investigate the internal reasoning processes
of those models. Geva et al. (2022) proposed a
framework for creating new examples using the
perturbation of the reasoning path. Our work is
different in that their focus was on creating a frame-
work, and it does not necessarily ensure the quality
of all generated perturbation samples. Moreover,
we investigate the QA process in-depth, while Geva
et al. (2022) do not include all detailed questions
(e.g., they do not include extraction task and com-
parison reasoning questions in Figure 1).
3 Dataset Construction
Our dataset is generated by using the two existing
multi-hop datasets, HotpotQA and 2Wiki (more
details are in Appendix B.1).
Obtain Date Questions
We first sampled the
comparison questions in HotpotQA and 2Wiki. We
then used a set of predefined keywords, such as
born first and lived longer, to obtain questions re-
garding the date information. From the train and
dev. split, respectively, we obtained 119 (after an-
notating, only use 114 samples) and 878 samples
in HotpotQA, and 984 and 8,745 samples in 2Wiki.
Generate Sub-questions and Sub-answers
In
2Wiki, we used the evidence in the form of triples
(e.g., (Maceo, date of death, July 4, 2001)) to auto-
matically generate sub-questions and sub-answers