IDK-MRC Unanswerable Questions for Indonesian Machine Reading Comprehension Rifki Aﬁna Putri

2025-05-08 0 0 1.49MB 16 页 10玖币

侵权投诉

IDK-MRC: Unanswerable Questions for Indonesian Machine Reading

Comprehension

Rifki Aﬁna Putri

School of Computing

KAIST, South Korea

rifkiaputri@kaist.ac.kr

Alice Oh

School of Computing

KAIST, South Korea

alice.oh@kaist.edu

Abstract

Machine Reading Comprehension (MRC) has

become one of the essential tasks in Natural

Language Understanding (NLU) as it is often

included in several NLU benchmarks (Liang

et al.,2020;Wilie et al.,2020). However, most

MRC datasets only have answerable question

type, overlooking the importance of unanswer-

able questions. MRC models trained only on

answerable questions will select the span that

is most likely to be the answer, even when

the answer does not actually exist in the given

passage (Rajpurkar et al.,2018). This prob-

lem especially remains in medium- to low-

resource languages like Indonesian. Existing

Indonesian MRC datasets (Purwarianti et al.,

2007;Clark et al.,2020) are still inadequate

because of the small size and limited question

types, i.e., they only cover answerable ques-

tions. To ﬁll this gap, we build a new Indone-

sian MRC dataset called I(n)don’tKnow-MRC

(IDK-MRC) by combining the automatic and

manual unanswerable question generation to

minimize the cost of manual dataset con-

struction while maintaining the dataset quality.

Combined with the existing answerable ques-

tions, IDK-MRC consists of more than 10K

questions in total. Our analysis shows that our

dataset signiﬁcantly improves the performance

of Indonesian MRC models, showing a large

improvement for unanswerable questions1.

1 Introduction

Machine Reading Comprehension (MRC) is a task

where a machine is asked to read a given pas-

sage and answer a question based on the passage.

Several English MRC datasets have been widely

used, including SQuAD (Rajpurkar et al.,2016)

and NewsQA (Trischler et al.,2017). However,

MRC models that do well on those datasets are not

guaranteed to be robust. Rajpurkar et al. (2018)

highlights the problem of the SQuAD dataset that

The code and dataset of IDK-MRC are available at

https:

//github.com/rifkiaputri/IDK-MRC

Figure 1: Our dataset collection pipeline.

only focuses on answerable questions, making the

model trained on this dataset tends to select the

span without carefully checking whether the pas-

sage actually has the answer. SQuAD 2.0 is then

built by manually adding new unanswerable ques-

tions to the existing SQuAD dataset (Rajpurkar

et al.,2018).

While SQuAD 2.0 is widely used for evalua-

tion of English models, similar datasets for other

languages are still limited, hindering the progress

of MRC task for these languages. Indonesian

has around 198 million speakers

, but despite its

popularity, there exists an insufﬁcient amount of

Indonesian MRC datasets. For instance, FacQA

dataset (Purwarianti et al.,2007) has only around

3K samples, and TyDiQA-GoldP dataset (Clark

et al.,2020) has around 5K samples. Furthermore,

both datasets only have answerable question type,

ignoring the importance of incorporating unanswer-

able questions. Therefore, building an Indonesian

MRC dataset that covers unanswerable questions

is necessary.

One alternative to construct a new dataset is man-

2https://www.babbel.com/en/magazine/how-man

y-people-speak- indonesian-where- is-it-spoken

(Accessed Jan 2022)

arXiv:2210.13778v1 [cs.CL] 25 Oct 2022

Type Description Example

Negation Negation word

inserted or

removed

Context

Kambing memiliki lemak dalam kandungan susunya. (Goats have fat in

their milk.)

Ans Q

Apakah kandungan yang ada dalam susu kambing? (What are the ingredi-

ents in goat’s milk?)

UnAns Q

Apakah kandungan yang

tidak

ada dalam susu kambing? (What are the

ingredients that do not exist in goat’s milk?)

Antonym Antonym used Context

Aristokrasi adalah sebuah kelas sosial yang

tertinggi

di masyarakat. (Aris-

tocracy is the highest social class in society.)

Ans Q

Apakah nama kelas sosial

tertinggi

?(What is the name of the

highest

social class?)

UnAns Q

Apa nama kelas sosial

terendah

?(What is the name of the

lowest

social

class?)

Entity Swap Entity, date,

number, or term

replaced with

other entity, date,

number, or term

Context

Salah satu kandidat standar untuk 4G yang dikomersilkan di dunia yaitu

standar Long Term Evolution (LTE) (Swedia sejak 2009). (One of the

standards for 4G commercialized in the world is the Long Term Evolution

(LTE) standard (Sweden since 2009).)

Ans Q

Di manakah

LTE

pertama kali diciptakan? (Where was

LTE

ﬁrst in-

vented?)

UnAns Q Di manakah 3G pertama diciptakan? (Where was 3G ﬁrst invented?)

Question

Tag Swap

Question tag re-

placed with other

question tag

Context

Suaka margasatwa Muara Angke adalah sebuah kawasan konservasi di

wilayah hutan bakau di pesisir utara Jakarta. (Muara Angke Wildlife Sanc-

tuary is a conservation area in the mangrove forest area on the north coast

of Jakarta.)

Ans Q Di mana

Suaka margasatwa Muara Angke dibangun? (

Where

was Muara

Angke Wildlife Sanctuary built?)

UnAns Q Kapan

Suaka margasatwa Muara Angke dibangun? (

When

was Muara

Angke Wildlife Sanctuary built?)

Speciﬁc

Condition

Asks for speciﬁc

condition that is

not satisﬁed by

the information

in the paragraph

Context

Bon Jovi terdiri dari Vokalis Jon Bon Jovi, Keyboardist David Bryan,

Drummer Tico Torres, Gitaris Phil X, dan Bassist Hugh McDonald. (Bon

Jovi consists of Vocalist Jon Bon Jovi, Keyboardist David Bryan, Drummer

Tico Torres, Guitarist Phil X, and Bassist Hugh McDonald.)

Ans Q Siapa personil Bon Jovi? (Who are the members of Bon Jovi?)

UnAns Q

Siapa personil Bon Jovi

yang paling jarang dikenal

?(Who is

the least

known member of Bon Jovi?)

Other Other cases

where the

paragraph does

not imply any

answer

Context

Patrick Star adalah seekor bintang laut yang bersahabat dengan Spongebob.

(Patrick Star is a starﬁsh whose best friend is Spongebob.)

Ans Q

Siapakah

teman baik

karakter SpongeBob SquarePants? (Who is Sponge-

Bob SquarePants’ best friend?)

UnAns Q

Siapa

teman kecil

karakter Spongebob SquarePants? (Who is Spongebob

SquarePants’ childhood friend?)

Table 1: Unanswerable question types that are covered in IDK-MRC.

ually adding the unanswerable questions. This,

however, is expensive and time-consuming. Sev-

eral Question Generation (QG) approaches have

been proposed to mitigate this, but most are focused

on generating answerable questions (Heilman and

Smith,2010;Du et al.,2017;Du and Cardie,2018;

Klein and Nabi,2019;Alberti et al.,2019;Kumar

et al.,2019;Puri et al.,2020;Shakeri et al.,2020),

with only one generating unanswerable questions

(Zhu et al.,2019). These models can quickly gener-

ate many questions, but the resulting questions are

usually less ﬂuent and less relevant to the passage

than human-written questions.

This work intends to combine the best of both

worlds by incorporating humans into the automatic

dataset generation pipeline. Figure 1shows our

pipeline, which consists of three phases: automatic

generation, validation, and manual generation. To

sum up, our contributions are as follows:

•

We construct a new Indonesian MRC dataset

called I(n)don’tKnow-MRC (IDK-MRC),

consisting of over 5K unanswerable questions

with diverse question types, as shown in Table

1. To the best of our knowledge, IDK-MRC

is the ﬁrst Indonesian MRC dataset covering

answerable and unanswerable questions.

•

We propose a simple dataset collection

pipeline consisting of automatic and manual

dataset generation. We show that relying only

on automatic generation results in highly im-

balanced question type distribution; our man-

ual generation method covers this limitation.

•

We validate our dataset on the downstream

task and show that it effectively improves the

MRC models’ performance, especially in pre-

dicting the answer to the unanswerable ques-

tions.

2 Related Work

Existing Indonesian MRC Dataset

While

many MRC datasets are available in English (Ra-

jpurkar et al.,2016;Trischler et al.,2017;Ra-

jpurkar et al.,2018), the number of publicly avail-

able Indonesian MRC datasets is very limited. A

shortcut to obtain Indonesian MRC data is by ma-

chine translating English MRC dataset (Muis and

Purwarianti,2020), but it will result in translation

artifacts. We may avoid this by recruiting human

annotators to translate them manually; still, it leads

to translationese, where the translated text appears

awkward or unnatural (Clark et al.,2020). FacQA

(Purwarianti et al.,2007) is part of the IndoNLU

benchmark (Wilie et al.,2020) built from a news

article. It has around 3K answerable questions,

with limited categories of questions: date, location,

name, organization, person, and quantitative. An-

other dataset called TyDiQA-GoldP (Clark et al.,

2020), a multilingual QA dataset constructed from

Wikipedia, has about 5K Indonesian samples. It

also only focuses on answerable questions. To this

date, there are no publicly available Indonesian

MRC datasets that include unanswerable question

type.

Human-Model Dataset Construction

Combin-

ing human and model in dataset construction is

mainly applied to adversarial data, such as Ad-

versarialQA (Bartolo et al.,2020) and Adversari-

alNLI (Nie et al.,2020). In this dynamic adversar-

ial data collection, human annotators are asked to

construct adversarial questions to fool the model.

Such human-model annotation pipeline has not

been tried for unanswerable questions. Wang et al.

(2021) analyzed the cost of different dataset label-

ing strategies, including the combination of GPT-3

(Brown et al.,2020) and human labeling. Although

they included the MRC task in their experiment,

they only focused on SQuAD 1.1 (Rajpurkar et al.,

2016), which only has answerable questions. The

effectiveness of the human-model labeling in the

context of unanswerable questions remains unclear.

Unanswerable Question Generation

Various

approaches have been proposed for generating an-

swerable questions in English (Heilman and Smith,

2010;Du et al.,2017;Du and Cardie,2018;Lewis

et al.,2019;Klein and Nabi,2019;Alberti et al.,

2019;Puri et al.,2020;Shakeri et al.,2020), In-

donesian (Muis and Purwarianti,2020), and cross-

or multi-lingual (Kumar et al.,2019;Chi et al.,

2020;Shakeri et al.,2021;Riabi et al.,2021). The

question generation technique also applied to gen-

erate adversarial questions (Bartolo et al.,2021).

However, for unanswerable question generation,

the number of works are limited. Zhu et al. (2019)

proposed Pair-to-Sequence (Pair2Seq) model that

uses separate encoders for the paragraph and an-

swerable question. They utilized English word em-

bedding (i.e., GloVe (Pennington et al.,2014)) and

character embedding as the feature and bi-LSTM

(Hochreiter and Schmidhuber,1997) as the encoder.

Although their model performed better compared to

the rule-based and TF-IDF baselines, it still relied

on a traditional word embedding representation as

the feature. Differing from their approach, we uti-

lized mT5 model (Xue et al.,2021) that covers con-

textual representation of 101 languages, including

Indonesian. Our experiment (§5.1) conﬁrms that

our model outperforms Pair2Seq model, demon-

strating the advantage of our approach.

3 Dataset Collection Pipeline

We build IDK-MRC dataset by combining model-

generated unanswerable questions with human-

written questions. As shown in Figure 1, our

dataset collection has three stages: automatic gen-

eration, validation, and manual generation.

3.1 Automatic Generation

In this stage, we automatically construct unanswer-

able questions using a Question Generation (QG)

model. We use translated SQuAD 2.0 (Rajpurkar

et al.,2018) as the training data of the QG model.

In the inference step, we use the answerable ques-

tions from TyDiQA-GoldP (Clark et al.,2020) as

a starting point to add more unanswerable ques-

tions for our dataset. Our QG model architecture is

illustrated in Figure 2.

Candidate Generation

We utilize mT5 model

(Xue et al.,2021) to generate the unanswerable

question candidates. We apply

generate unans

Figure 2: Our proposed question generation model.

preﬁx, followed by context,answerable question,

and answer as the input. Then, using top-p and

top-k sampling as the decoding method, the model

produces several output candidates.

QA Filter

Since not all output candidates are

valid unanswerable questions, we ﬁlter out invalid

questions using an ensemble of six

Question An-

swering (QA) models. We ﬁne-tuned XLM-R

(Conneau et al.,2020) on translated SQuAD 2.0

dataset using different random seeds and used them

as the QA models. Based on the prediction of these

models, we keep the question if four or more mod-

els give an empty answer (i.e., unanswerable) or

if four or more models return non-empty answers

and all these answers are different.

Similarity Function

Finally, we apply a similar-

ity function to all remaining output candidates to

make sure that the ﬁnal output is relevant to its

corresponding paragraph and answerable question.

We calculate similarity between the original an-

swerable question and the remaining question can-

didates using BLEU score to get the unanswerable

question with highest n-gram overlap. We pick the

candidate with the highest score as the ﬁnal output.

6 was chosen based on related work in Adversarial QA

(Bartolo et al.,2021).

3.2 Validation

After obtaining the automatically generated unan-

swerable questions, we validate them to ensure that

they do not have noise or error. We recruit four

Indonesian native speakers with 2+ years of experi-

ence in Indonesian NLP dataset annotation. Each

annotator is asked to give a score to the generated

questions with three criteria, adopted from Zhu

et al. (2019) and re-deﬁned as follows:

•Unanswerability

: whether the answer can be

found in the given paragraph. The score is 1

if the answer cannot be found, 0 otherwise.

•Relevancy

: whether the question is relevant

to the paragraph and the answerable question.

3 if the question is relevant to both, 2 if it is

only relevant to the paragraph or the answer-

able question, and 1 if it is not relevant to

either.

•Fluency

: whether the question is ﬂuent. 3 if

the collective quality of all words in the ques-

tion is ﬂuent and coherent; 2 if the question is

semi-coherent, has a minor typo, or grammat-

ical errors; and 1 if the question is incoherent

or incomprehensible.

Each question is validated by one annotator, with

each annotator validating the same number of ques-

tions. Then, we apply cross-checking method to

minimize human errors and to ensure consistency

of the criteria across the annotators. Suppose that

we have four annotators

(a1, ..., a4)

, who have eval-

uated some set of samples

(s1, ..., s4)

. Each sample

consists of a set of paragraph,answerable,unan-

swerable question, along with the unanswerability,

relevancy, and ﬂuency scores of the unanswerable

question. In the cross-checking phase,

is as-

signed to check the scores of

is assigned to

check the scores of

, and so on. The disagree-

ment

is resolved by discussion among the anno-

tators to ensure each annotator has the same level

of task understanding and thus resulting in high

quality and consistent annotation.

Finally, we keep the questions with perfect unan-

swerability, relevancy, and ﬂuency scores (i.e.,

questions with scores of 1, 3, 3). We also keep the

questions with scores of (1, 3, 2) and ask the anno-

tators to make minor corrections to those questions.

Overall, the disagreement percentage is roughly around

10–20%, with

∼

84% of the disagreement are categorized as

narrow disagreement (1 vs 2 or 2 vs 3).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IDK-MRC:UnanswerableQuestionsforIndonesianMachineReadingComprehensionRifkiAnaPutriSchoolofComputingKAIST,SouthKorearifkiaputri@kaist.ac.krAliceOhSchoolofComputingKAIST,SouthKoreaalice.oh@kaist.eduAbstractMachineReadingComprehension(MRC)hasbecomeoneoftheessentialtasksinNaturalLanguageUnderstanding(N...

展开>> 收起<<

IDK-MRC Unanswerable Questions for Indonesian Machine Reading Comprehension Rifki Aﬁna Putri.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IDK-MRC Unanswerable Questions for Indonesian Machine Reading Comprehension Rifki Aﬁna Putri

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: