How Well Do Multi-hop Reading Comprehension Models Understand Date Information Xanh HoSaku Sugawaraand Akiko Aizawa

2025-04-29 0 0 297.83KB 10 页 10玖币
侵权投诉
How Well Do Multi-hop Reading Comprehension Models Understand
Date Information?
Xanh Ho,,Saku Sugawara,and Akiko Aizawa,
The Graduate University for Advanced Studies, Kanagawa, Japan
National Institute of Informatics, Tokyo, Japan
{xanh, saku, aizawa}@nii.ac.jp
Abstract
Several multi-hop reading comprehension
datasets have been proposed to resolve the is-
sue of reasoning shortcuts by which questions
can be answered without performing multi-
hop reasoning. However, the ability of multi-
hop models to perform step-by-step reasoning
when finding an answer to a comparison ques-
tion remains unclear. It is also unclear how
questions about the internal reasoning process
are useful for training and evaluating question-
answering (QA) systems. To evaluate the
model precisely in a hierarchical manner, we
first propose a dataset, HieraDate, with three
probing tasks in addition to the main question:
extraction, reasoning, and robustness. Our
dataset is created by enhancing two previous
multi-hop datasets, HotpotQA and 2WikiMul-
tiHopQA, focusing on multi-hop questions on
date information that involve both comparison
and numerical reasoning. We then evaluate the
ability of existing models to understand date
information. Our experimental results reveal
that the multi-hop models do not have the abil-
ity to subtract two dates even when they per-
form well in date comparison and number sub-
traction tasks. Other results reveal that our
probing questions can help to improve the per-
formance of the models (e.g., by +10.3 F1) on
the main QA task and our dataset can be used
for data augmentation to improve the robust-
ness of the models.
1 Introduction
Multi-hop reading comprehension (RC) requires a
model to read and aggregate information from mul-
tiple paragraphs to answer a given question (Welbl
et al.,2018). Several datasets have been proposed
for this task, such as HotpotQA (Yang et al.,2018)
and 2WikiMultiHopQA (2Wiki; Ho et al.,2020).
Although the proposed models show promising
performances, previous studies (Jiang and Bansal,
2019;Chen and Durrett,2019;Min et al.,2019a;
Tang et al.,2021) have demonstrated that existing
Question: Who lived longer, Maceo Anderson or Jacek Karpiński?
Paragraph A: Maceo Anderson
[1] Maceo Anderson (September 3, 1910 – July 4, 2001 in Los
Angeles, California) expressed an interest in dancing at … . [2] ....
Paragraph B: Jacek Karpiński
[3] Jacek Karpiński (9 April 1927 – 21 February 2010) was a Polish
pioneer in computer engineering and … . [4] ....
Answer: Maceo Anderson
What is the date of birth of Maceo Anderson?
What is the date of death of Maceo Anderson?
What is the date of birth of Jacek Karpiński?
What is the date of death of Jacek Karpiński?
Reasoning Task:
How old was Maceo Anderson when they died?
How old was Jacek Karpiński when they died?
Full-date version: Is a 90-year-10-month-1-day-old person older
than a 82-year-10-month-12-day-old person?
Year-only: Is a 90-year-old person older than a 82-year-old person?
Robustness Task:
Who lived shorter, Maceo Anderson or Jacek Karpiński?
Extraction Task
Figure 1: Example of a question in our dataset.
multi-hop datasets contain reasoning shortcuts, in
which the model can answer the question without
performing multi-hop reasoning.
There are two main types of questions in the pre-
vious multi-hop datasets: bridge and comparison.
Tang et al. (2021) explored sub-questions in the
question answering (QA) process for model evalua-
tion. However, they only used the bridge questions
in HotpotQA and did not fine-tune the previous
multi-hop models on their dataset when perform-
ing the evaluation. Therefore, it is still unclear
about the ability of multi-hop models to perform
step-by-step reasoning when finding an answer to
a comparison question.
HotpotQA provides sentence-level supporting
facts (SFs) to explain the answer. However, as dis-
cussed in Inoue et al. (2020) and Ho et al. (2020),
the sentence-level SFs cannot fully evaluate the rea-
soning ability of the models; to solve this issue, in
addition to sentence-level SFs, these studies pro-
arXiv:2210.05208v1 [cs.CL] 11 Oct 2022
vide a set of triples as the evidence information. For
example, for the question in Figure 1, the evidence
regards the dates of birth and death of two people,
e.g., (Maceo, date of death, July 4, 2001). We ar-
gue that simply requiring the models to detect a set
of triples, in this case, cannot explain the answer to
the question and cannot describe the full path from
the question to the answer; additional operations,
including calculations and comparisons, need to be
performed to obtain the final answer.
To deal with this issue, we introduce a dataset,
HieraDate,
1
consisting of the three probing tasks.
(1) The extraction task poses sub-questions that are
created by converting evidence triples into natu-
ral language questions. (2) The reasoning task is
pertinent to the combination of triples, involving
comparison and numerical reasoning that precisely
evaluate the reasoning path of the main questions.
(3) The robustness task consists of examples gener-
ated by slightly changing the semantics (e.g., born
first to born later) of the original main questions.
The purpose of the robustness task is to ensure that
the models do not exploit superficial features in
answering questions.
Our dataset is created by extending two exist-
ing multi-hop datasets, HotpotQA and 2Wiki. As
the first step of the proof of concept, we start with
the date information through comparison questions
because this information is available and straight-
forward to handle. Moreover, based on the clas-
sification of comparison questions in Min et al.
(2019a), all comparison questions on date informa-
tion require multi-hop reasoning for answering. We
then use our dataset to evaluate two leading models,
HGN (Fang et al.,2020) and NumNet+ (Ran et al.,
2019) on two settings: with and without fine-tuning
on our dataset. We also conduct experiments to in-
vestigate whether our probing questions are useful
for improving QA performance and whether our
dataset can be used for data augmentation.
Our experimental results reveal that existing
multi-hop models perform well in the extraction
and robustness tasks but fail in the reasoning task
when the models are not fine-tuned on our dataset.
We observe that with fine-tuning, HGN can per-
form well in the comparison reasoning task; mean-
while, NumNet+ struggles with subtracting two
dates, although it can subtract two numbers. Our
analysis shows that questions that require both nu-
1
Our data and code are available at
https://github.
com/Alab-NII/HieraDate.
merical and comparison reasoning are more diffi-
cult than questions that require only comparison
reasoning. We also find that training with our
probing questions boosts QA performance in our
dataset, showing improvement from 77.1 to 82.7
F1 in HGN and from 84.6 to 94.9 F1 in NumNet+.
Moreover, our dataset can be used as augmenta-
tion data for HotpotQA, 2Wiki, and DROP (Dua
et al.,2019), which contributes to improving the
robustness of the models trained on these datasets.
Our results suggest that a more complete evaluation
of the reasoning path may be necessary for better
understanding of multi-hop models’ behavior. We
encourage future research to integrate our probing
questions when training and evaluating the models.
2 Related Work
In addition to Tang et al. (2021), Al-Negheimish
et al. (2021) and Geva et al. (2022) are similar to
our study. Al-Negheimish et al. (2021) evaluated
the previous models on the DROP dataset to test
their numerical reasoning ability. However, they
did not investigate the internal reasoning processes
of those models. Geva et al. (2022) proposed a
framework for creating new examples using the
perturbation of the reasoning path. Our work is
different in that their focus was on creating a frame-
work, and it does not necessarily ensure the quality
of all generated perturbation samples. Moreover,
we investigate the QA process in-depth, while Geva
et al. (2022) do not include all detailed questions
(e.g., they do not include extraction task and com-
parison reasoning questions in Figure 1).
3 Dataset Construction
Our dataset is generated by using the two existing
multi-hop datasets, HotpotQA and 2Wiki (more
details are in Appendix B.1).
Obtain Date Questions
We first sampled the
comparison questions in HotpotQA and 2Wiki. We
then used a set of predefined keywords, such as
born first and lived longer, to obtain questions re-
garding the date information. From the train and
dev. split, respectively, we obtained 119 (after an-
notating, only use 114 samples) and 878 samples
in HotpotQA, and 984 and 8,745 samples in 2Wiki.
Generate Sub-questions and Sub-answers
In
2Wiki, we used the evidence in the form of triples
(e.g., (Maceo, date of death, July 4, 2001)) to auto-
matically generate sub-questions and sub-answers
摘要:

HowWellDoMulti-hopReadingComprehensionModelsUnderstandDateInformation?XanhHo,};~SakuSugawara,~andAkikoAizawa};~}TheGraduateUniversityforAdvancedStudies,Kanagawa,Japan~NationalInstituteofInformatics,Tokyo,Japan{xanh,saku,aizawa}@nii.ac.jpAbstractSeveralmulti-hopreadingcomprehensiondatasetshavebeenpro...

展开>> 收起<<
How Well Do Multi-hop Reading Comprehension Models Understand Date Information Xanh HoSaku Sugawaraand Akiko Aizawa.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:297.83KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注