How Well Do Multi-hop Reading Comprehension Models Understand Date Information Xanh HoSaku Sugawaraand Akiko Aizawa

2025-04-29 0 0 297.83KB 10 页 10玖币

侵权投诉

How Well Do Multi-hop Reading Comprehension Models Understand

Date Information?

Xanh Ho,♦,♥Saku Sugawara,♥and Akiko Aizawa♦,♥

♦The Graduate University for Advanced Studies, Kanagawa, Japan

♥National Institute of Informatics, Tokyo, Japan

{xanh, saku, aizawa}@nii.ac.jp

Abstract

Several multi-hop reading comprehension

datasets have been proposed to resolve the is-

sue of reasoning shortcuts by which questions

can be answered without performing multi-

hop reasoning. However, the ability of multi-

hop models to perform step-by-step reasoning

when ﬁnding an answer to a comparison ques-

tion remains unclear. It is also unclear how

questions about the internal reasoning process

are useful for training and evaluating question-

answering (QA) systems. To evaluate the

model precisely in a hierarchical manner, we

ﬁrst propose a dataset, HieraDate, with three

probing tasks in addition to the main question:

extraction, reasoning, and robustness. Our

dataset is created by enhancing two previous

multi-hop datasets, HotpotQA and 2WikiMul-

tiHopQA, focusing on multi-hop questions on

date information that involve both comparison

and numerical reasoning. We then evaluate the

ability of existing models to understand date

information. Our experimental results reveal

that the multi-hop models do not have the abil-

ity to subtract two dates even when they per-

form well in date comparison and number sub-

traction tasks. Other results reveal that our

probing questions can help to improve the per-

formance of the models (e.g., by +10.3 F1) on

the main QA task and our dataset can be used

for data augmentation to improve the robust-

ness of the models.

1 Introduction

Multi-hop reading comprehension (RC) requires a

model to read and aggregate information from mul-

tiple paragraphs to answer a given question (Welbl

et al.,2018). Several datasets have been proposed

for this task, such as HotpotQA (Yang et al.,2018)

and 2WikiMultiHopQA (2Wiki; Ho et al.,2020).

Although the proposed models show promising

performances, previous studies (Jiang and Bansal,

2019;Chen and Durrett,2019;Min et al.,2019a;

Tang et al.,2021) have demonstrated that existing

Question: Who lived longer, Maceo Anderson or Jacek Karpiński?

Paragraph A: Maceo Anderson

[1] Maceo Anderson (September 3, 1910 – July 4, 2001 in Los

Angeles, California) expressed an interest in dancing at … . [2] ....

Paragraph B: Jacek Karpiński

[3] Jacek Karpiński (9 April 1927 – 21 February 2010) was a Polish

pioneer in computer engineering and … . [4] ....

Answer: Maceo Anderson

What is the date of birth of Maceo Anderson?

What is the date of death of Maceo Anderson?

What is the date of birth of Jacek Karpiński?

What is the date of death of Jacek Karpiński?

Reasoning Task:

How old was Maceo Anderson when they died?

How old was Jacek Karpiński when they died?

Full-date version: Is a 90-year-10-month-1-day-old person older

than a 82-year-10-month-12-day-old person?

Year-only: Is a 90-year-old person older than a 82-year-old person?

Robustness Task:

Who lived shorter, Maceo Anderson or Jacek Karpiński?

Extraction Task

Figure 1: Example of a question in our dataset.

multi-hop datasets contain reasoning shortcuts, in

which the model can answer the question without

performing multi-hop reasoning.

There are two main types of questions in the pre-

vious multi-hop datasets: bridge and comparison.

Tang et al. (2021) explored sub-questions in the

question answering (QA) process for model evalua-

tion. However, they only used the bridge questions

in HotpotQA and did not ﬁne-tune the previous

multi-hop models on their dataset when perform-

ing the evaluation. Therefore, it is still unclear

about the ability of multi-hop models to perform

step-by-step reasoning when ﬁnding an answer to

a comparison question.

HotpotQA provides sentence-level supporting

facts (SFs) to explain the answer. However, as dis-

cussed in Inoue et al. (2020) and Ho et al. (2020),

the sentence-level SFs cannot fully evaluate the rea-

soning ability of the models; to solve this issue, in

addition to sentence-level SFs, these studies pro-

arXiv:2210.05208v1 [cs.CL] 11 Oct 2022

vide a set of triples as the evidence information. For

example, for the question in Figure 1, the evidence

regards the dates of birth and death of two people,

e.g., (Maceo, date of death, July 4, 2001). We ar-

gue that simply requiring the models to detect a set

of triples, in this case, cannot explain the answer to

the question and cannot describe the full path from

the question to the answer; additional operations,

including calculations and comparisons, need to be

performed to obtain the ﬁnal answer.

To deal with this issue, we introduce a dataset,

HieraDate,

consisting of the three probing tasks.

(1) The extraction task poses sub-questions that are

created by converting evidence triples into natu-

ral language questions. (2) The reasoning task is

pertinent to the combination of triples, involving

comparison and numerical reasoning that precisely

evaluate the reasoning path of the main questions.

(3) The robustness task consists of examples gener-

ated by slightly changing the semantics (e.g., born

ﬁrst to born later) of the original main questions.

The purpose of the robustness task is to ensure that

the models do not exploit superﬁcial features in

answering questions.

Our dataset is created by extending two exist-

ing multi-hop datasets, HotpotQA and 2Wiki. As

the ﬁrst step of the proof of concept, we start with

the date information through comparison questions

because this information is available and straight-

forward to handle. Moreover, based on the clas-

siﬁcation of comparison questions in Min et al.

(2019a), all comparison questions on date informa-

tion require multi-hop reasoning for answering. We

then use our dataset to evaluate two leading models,

HGN (Fang et al.,2020) and NumNet+ (Ran et al.,

2019) on two settings: with and without ﬁne-tuning

on our dataset. We also conduct experiments to in-

vestigate whether our probing questions are useful

for improving QA performance and whether our

dataset can be used for data augmentation.

Our experimental results reveal that existing

multi-hop models perform well in the extraction

and robustness tasks but fail in the reasoning task

when the models are not ﬁne-tuned on our dataset.

We observe that with ﬁne-tuning, HGN can per-

form well in the comparison reasoning task; mean-

while, NumNet+ struggles with subtracting two

dates, although it can subtract two numbers. Our

analysis shows that questions that require both nu-

Our data and code are available at

https://github.

com/Alab-NII/HieraDate.

merical and comparison reasoning are more difﬁ-

cult than questions that require only comparison

reasoning. We also ﬁnd that training with our

probing questions boosts QA performance in our

dataset, showing improvement from 77.1 to 82.7

F1 in HGN and from 84.6 to 94.9 F1 in NumNet+.

Moreover, our dataset can be used as augmenta-

tion data for HotpotQA, 2Wiki, and DROP (Dua

et al.,2019), which contributes to improving the

robustness of the models trained on these datasets.

Our results suggest that a more complete evaluation

of the reasoning path may be necessary for better

understanding of multi-hop models’ behavior. We

encourage future research to integrate our probing

questions when training and evaluating the models.

2 Related Work

In addition to Tang et al. (2021), Al-Negheimish

et al. (2021) and Geva et al. (2022) are similar to

our study. Al-Negheimish et al. (2021) evaluated

the previous models on the DROP dataset to test

their numerical reasoning ability. However, they

did not investigate the internal reasoning processes

of those models. Geva et al. (2022) proposed a

framework for creating new examples using the

perturbation of the reasoning path. Our work is

different in that their focus was on creating a frame-

work, and it does not necessarily ensure the quality

of all generated perturbation samples. Moreover,

we investigate the QA process in-depth, while Geva

et al. (2022) do not include all detailed questions

(e.g., they do not include extraction task and com-

parison reasoning questions in Figure 1).

3 Dataset Construction

Our dataset is generated by using the two existing

multi-hop datasets, HotpotQA and 2Wiki (more

details are in Appendix B.1).

Obtain Date Questions

We ﬁrst sampled the

comparison questions in HotpotQA and 2Wiki. We

then used a set of predeﬁned keywords, such as

born ﬁrst and lived longer, to obtain questions re-

garding the date information. From the train and

dev. split, respectively, we obtained 119 (after an-

notating, only use 114 samples) and 878 samples

in HotpotQA, and 984 and 8,745 samples in 2Wiki.

Generate Sub-questions and Sub-answers

2Wiki, we used the evidence in the form of triples

(e.g., (Maceo, date of death, July 4, 2001)) to auto-

matically generate sub-questions and sub-answers

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HowWellDoMulti-hopReadingComprehensionModelsUnderstandDateInformation?XanhHo,};~SakuSugawara,~andAkikoAizawa};~}TheGraduateUniversityforAdvancedStudies,Kanagawa,Japan~NationalInstituteofInformatics,Tokyo,Japan{xanh,saku,aizawa}@nii.ac.jpAbstractSeveralmulti-hopreadingcomprehensiondatasetshavebeenpro...

展开>> 收起<<

How Well Do Multi-hop Reading Comprehension Models Understand Date Information Xanh HoSaku Sugawaraand Akiko Aizawa.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

How Well Do Multi-hop Reading Comprehension Models Understand Date Information Xanh HoSaku Sugawaraand Akiko Aizawa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: