IDK-MRC Unanswerable Questions for Indonesian Machine Reading Comprehension Rifki Afina Putri

2025-05-08 0 0 1.49MB 16 页 10玖币
侵权投诉
IDK-MRC: Unanswerable Questions for Indonesian Machine Reading
Comprehension
Rifki Afina Putri
School of Computing
KAIST, South Korea
rifkiaputri@kaist.ac.kr
Alice Oh
School of Computing
KAIST, South Korea
alice.oh@kaist.edu
Abstract
Machine Reading Comprehension (MRC) has
become one of the essential tasks in Natural
Language Understanding (NLU) as it is often
included in several NLU benchmarks (Liang
et al.,2020;Wilie et al.,2020). However, most
MRC datasets only have answerable question
type, overlooking the importance of unanswer-
able questions. MRC models trained only on
answerable questions will select the span that
is most likely to be the answer, even when
the answer does not actually exist in the given
passage (Rajpurkar et al.,2018). This prob-
lem especially remains in medium- to low-
resource languages like Indonesian. Existing
Indonesian MRC datasets (Purwarianti et al.,
2007;Clark et al.,2020) are still inadequate
because of the small size and limited question
types, i.e., they only cover answerable ques-
tions. To fill this gap, we build a new Indone-
sian MRC dataset called I(n)don’tKnow-MRC
(IDK-MRC) by combining the automatic and
manual unanswerable question generation to
minimize the cost of manual dataset con-
struction while maintaining the dataset quality.
Combined with the existing answerable ques-
tions, IDK-MRC consists of more than 10K
questions in total. Our analysis shows that our
dataset significantly improves the performance
of Indonesian MRC models, showing a large
improvement for unanswerable questions1.
1 Introduction
Machine Reading Comprehension (MRC) is a task
where a machine is asked to read a given pas-
sage and answer a question based on the passage.
Several English MRC datasets have been widely
used, including SQuAD (Rajpurkar et al.,2016)
and NewsQA (Trischler et al.,2017). However,
MRC models that do well on those datasets are not
guaranteed to be robust. Rajpurkar et al. (2018)
highlights the problem of the SQuAD dataset that
1
The code and dataset of IDK-MRC are available at
https:
//github.com/rifkiaputri/IDK-MRC
Figure 1: Our dataset collection pipeline.
only focuses on answerable questions, making the
model trained on this dataset tends to select the
span without carefully checking whether the pas-
sage actually has the answer. SQuAD 2.0 is then
built by manually adding new unanswerable ques-
tions to the existing SQuAD dataset (Rajpurkar
et al.,2018).
While SQuAD 2.0 is widely used for evalua-
tion of English models, similar datasets for other
languages are still limited, hindering the progress
of MRC task for these languages. Indonesian
has around 198 million speakers
2
, but despite its
popularity, there exists an insufficient amount of
Indonesian MRC datasets. For instance, FacQA
dataset (Purwarianti et al.,2007) has only around
3K samples, and TyDiQA-GoldP dataset (Clark
et al.,2020) has around 5K samples. Furthermore,
both datasets only have answerable question type,
ignoring the importance of incorporating unanswer-
able questions. Therefore, building an Indonesian
MRC dataset that covers unanswerable questions
is necessary.
One alternative to construct a new dataset is man-
2https://www.babbel.com/en/magazine/how-man
y-people-speak- indonesian-where- is-it-spoken
(Accessed Jan 2022)
arXiv:2210.13778v1 [cs.CL] 25 Oct 2022
Type Description Example
Negation Negation word
inserted or
removed
Context
Kambing memiliki lemak dalam kandungan susunya. (Goats have fat in
their milk.)
Ans Q
Apakah kandungan yang ada dalam susu kambing? (What are the ingredi-
ents in goat’s milk?)
UnAns Q
Apakah kandungan yang
tidak
ada dalam susu kambing? (What are the
ingredients that do not exist in goat’s milk?)
Antonym Antonym used Context
Aristokrasi adalah sebuah kelas sosial yang
tertinggi
di masyarakat. (Aris-
tocracy is the highest social class in society.)
Ans Q
Apakah nama kelas sosial
tertinggi
?(What is the name of the
highest
social class?)
UnAns Q
Apa nama kelas sosial
terendah
?(What is the name of the
lowest
social
class?)
Entity Swap Entity, date,
number, or term
replaced with
other entity, date,
number, or term
Context
Salah satu kandidat standar untuk 4G yang dikomersilkan di dunia yaitu
standar Long Term Evolution (LTE) (Swedia sejak 2009). (One of the
standards for 4G commercialized in the world is the Long Term Evolution
(LTE) standard (Sweden since 2009).)
Ans Q
Di manakah
LTE
pertama kali diciptakan? (Where was
LTE
first in-
vented?)
UnAns Q Di manakah 3G pertama diciptakan? (Where was 3G first invented?)
Question
Tag Swap
Question tag re-
placed with other
question tag
Context
Suaka margasatwa Muara Angke adalah sebuah kawasan konservasi di
wilayah hutan bakau di pesisir utara Jakarta. (Muara Angke Wildlife Sanc-
tuary is a conservation area in the mangrove forest area on the north coast
of Jakarta.)
Ans Q Di mana
Suaka margasatwa Muara Angke dibangun? (
Where
was Muara
Angke Wildlife Sanctuary built?)
UnAns Q Kapan
Suaka margasatwa Muara Angke dibangun? (
When
was Muara
Angke Wildlife Sanctuary built?)
Specific
Condition
Asks for specific
condition that is
not satisfied by
the information
in the paragraph
Context
Bon Jovi terdiri dari Vokalis Jon Bon Jovi, Keyboardist David Bryan,
Drummer Tico Torres, Gitaris Phil X, dan Bassist Hugh McDonald. (Bon
Jovi consists of Vocalist Jon Bon Jovi, Keyboardist David Bryan, Drummer
Tico Torres, Guitarist Phil X, and Bassist Hugh McDonald.)
Ans Q Siapa personil Bon Jovi? (Who are the members of Bon Jovi?)
UnAns Q
Siapa personil Bon Jovi
yang paling jarang dikenal
?(Who is
the least
known member of Bon Jovi?)
Other Other cases
where the
paragraph does
not imply any
answer
Context
Patrick Star adalah seekor bintang laut yang bersahabat dengan Spongebob.
(Patrick Star is a starfish whose best friend is Spongebob.)
Ans Q
Siapakah
teman baik
karakter SpongeBob SquarePants? (Who is Sponge-
Bob SquarePants’ best friend?)
UnAns Q
Siapa
teman kecil
karakter Spongebob SquarePants? (Who is Spongebob
SquarePants’ childhood friend?)
Table 1: Unanswerable question types that are covered in IDK-MRC.
ually adding the unanswerable questions. This,
however, is expensive and time-consuming. Sev-
eral Question Generation (QG) approaches have
been proposed to mitigate this, but most are focused
on generating answerable questions (Heilman and
Smith,2010;Du et al.,2017;Du and Cardie,2018;
Klein and Nabi,2019;Alberti et al.,2019;Kumar
et al.,2019;Puri et al.,2020;Shakeri et al.,2020),
with only one generating unanswerable questions
(Zhu et al.,2019). These models can quickly gener-
ate many questions, but the resulting questions are
usually less fluent and less relevant to the passage
than human-written questions.
This work intends to combine the best of both
worlds by incorporating humans into the automatic
dataset generation pipeline. Figure 1shows our
pipeline, which consists of three phases: automatic
generation, validation, and manual generation. To
sum up, our contributions are as follows:
We construct a new Indonesian MRC dataset
called I(n)don’tKnow-MRC (IDK-MRC),
consisting of over 5K unanswerable questions
with diverse question types, as shown in Table
1. To the best of our knowledge, IDK-MRC
is the first Indonesian MRC dataset covering
answerable and unanswerable questions.
We propose a simple dataset collection
pipeline consisting of automatic and manual
dataset generation. We show that relying only
on automatic generation results in highly im-
balanced question type distribution; our man-
ual generation method covers this limitation.
We validate our dataset on the downstream
task and show that it effectively improves the
MRC models’ performance, especially in pre-
dicting the answer to the unanswerable ques-
tions.
2 Related Work
Existing Indonesian MRC Dataset
While
many MRC datasets are available in English (Ra-
jpurkar et al.,2016;Trischler et al.,2017;Ra-
jpurkar et al.,2018), the number of publicly avail-
able Indonesian MRC datasets is very limited. A
shortcut to obtain Indonesian MRC data is by ma-
chine translating English MRC dataset (Muis and
Purwarianti,2020), but it will result in translation
artifacts. We may avoid this by recruiting human
annotators to translate them manually; still, it leads
to translationese, where the translated text appears
awkward or unnatural (Clark et al.,2020). FacQA
(Purwarianti et al.,2007) is part of the IndoNLU
benchmark (Wilie et al.,2020) built from a news
article. It has around 3K answerable questions,
with limited categories of questions: date, location,
name, organization, person, and quantitative. An-
other dataset called TyDiQA-GoldP (Clark et al.,
2020), a multilingual QA dataset constructed from
Wikipedia, has about 5K Indonesian samples. It
also only focuses on answerable questions. To this
date, there are no publicly available Indonesian
MRC datasets that include unanswerable question
type.
Human-Model Dataset Construction
Combin-
ing human and model in dataset construction is
mainly applied to adversarial data, such as Ad-
versarialQA (Bartolo et al.,2020) and Adversari-
alNLI (Nie et al.,2020). In this dynamic adversar-
ial data collection, human annotators are asked to
construct adversarial questions to fool the model.
Such human-model annotation pipeline has not
been tried for unanswerable questions. Wang et al.
(2021) analyzed the cost of different dataset label-
ing strategies, including the combination of GPT-3
(Brown et al.,2020) and human labeling. Although
they included the MRC task in their experiment,
they only focused on SQuAD 1.1 (Rajpurkar et al.,
2016), which only has answerable questions. The
effectiveness of the human-model labeling in the
context of unanswerable questions remains unclear.
Unanswerable Question Generation
Various
approaches have been proposed for generating an-
swerable questions in English (Heilman and Smith,
2010;Du et al.,2017;Du and Cardie,2018;Lewis
et al.,2019;Klein and Nabi,2019;Alberti et al.,
2019;Puri et al.,2020;Shakeri et al.,2020), In-
donesian (Muis and Purwarianti,2020), and cross-
or multi-lingual (Kumar et al.,2019;Chi et al.,
2020;Shakeri et al.,2021;Riabi et al.,2021). The
question generation technique also applied to gen-
erate adversarial questions (Bartolo et al.,2021).
However, for unanswerable question generation,
the number of works are limited. Zhu et al. (2019)
proposed Pair-to-Sequence (Pair2Seq) model that
uses separate encoders for the paragraph and an-
swerable question. They utilized English word em-
bedding (i.e., GloVe (Pennington et al.,2014)) and
character embedding as the feature and bi-LSTM
(Hochreiter and Schmidhuber,1997) as the encoder.
Although their model performed better compared to
the rule-based and TF-IDF baselines, it still relied
on a traditional word embedding representation as
the feature. Differing from their approach, we uti-
lized mT5 model (Xue et al.,2021) that covers con-
textual representation of 101 languages, including
Indonesian. Our experiment (§5.1) confirms that
our model outperforms Pair2Seq model, demon-
strating the advantage of our approach.
3 Dataset Collection Pipeline
We build IDK-MRC dataset by combining model-
generated unanswerable questions with human-
written questions. As shown in Figure 1, our
dataset collection has three stages: automatic gen-
eration, validation, and manual generation.
3.1 Automatic Generation
In this stage, we automatically construct unanswer-
able questions using a Question Generation (QG)
model. We use translated SQuAD 2.0 (Rajpurkar
et al.,2018) as the training data of the QG model.
In the inference step, we use the answerable ques-
tions from TyDiQA-GoldP (Clark et al.,2020) as
a starting point to add more unanswerable ques-
tions for our dataset. Our QG model architecture is
illustrated in Figure 2.
Candidate Generation
We utilize mT5 model
(Xue et al.,2021) to generate the unanswerable
question candidates. We apply
generate unans
Figure 2: Our proposed question generation model.
prefix, followed by context,answerable question,
and answer as the input. Then, using top-p and
top-k sampling as the decoding method, the model
produces several output candidates.
QA Filter
Since not all output candidates are
valid unanswerable questions, we filter out invalid
questions using an ensemble of six
3
Question An-
swering (QA) models. We fine-tuned XLM-R
(Conneau et al.,2020) on translated SQuAD 2.0
dataset using different random seeds and used them
as the QA models. Based on the prediction of these
models, we keep the question if four or more mod-
els give an empty answer (i.e., unanswerable) or
if four or more models return non-empty answers
and all these answers are different.
Similarity Function
Finally, we apply a similar-
ity function to all remaining output candidates to
make sure that the final output is relevant to its
corresponding paragraph and answerable question.
We calculate similarity between the original an-
swerable question and the remaining question can-
didates using BLEU score to get the unanswerable
question with highest n-gram overlap. We pick the
candidate with the highest score as the final output.
3
6 was chosen based on related work in Adversarial QA
(Bartolo et al.,2021).
3.2 Validation
After obtaining the automatically generated unan-
swerable questions, we validate them to ensure that
they do not have noise or error. We recruit four
Indonesian native speakers with 2+ years of experi-
ence in Indonesian NLP dataset annotation. Each
annotator is asked to give a score to the generated
questions with three criteria, adopted from Zhu
et al. (2019) and re-defined as follows:
Unanswerability
: whether the answer can be
found in the given paragraph. The score is 1
if the answer cannot be found, 0 otherwise.
Relevancy
: whether the question is relevant
to the paragraph and the answerable question.
3 if the question is relevant to both, 2 if it is
only relevant to the paragraph or the answer-
able question, and 1 if it is not relevant to
either.
Fluency
: whether the question is fluent. 3 if
the collective quality of all words in the ques-
tion is fluent and coherent; 2 if the question is
semi-coherent, has a minor typo, or grammat-
ical errors; and 1 if the question is incoherent
or incomprehensible.
Each question is validated by one annotator, with
each annotator validating the same number of ques-
tions. Then, we apply cross-checking method to
minimize human errors and to ensure consistency
of the criteria across the annotators. Suppose that
we have four annotators
(a1, ..., a4)
, who have eval-
uated some set of samples
(s1, ..., s4)
. Each sample
si
consists of a set of paragraph,answerable,unan-
swerable question, along with the unanswerability,
relevancy, and fluency scores of the unanswerable
question. In the cross-checking phase,
a1
is as-
signed to check the scores of
s2
,
a2
is assigned to
check the scores of
s1
, and so on. The disagree-
ment
4
is resolved by discussion among the anno-
tators to ensure each annotator has the same level
of task understanding and thus resulting in high
quality and consistent annotation.
Finally, we keep the questions with perfect unan-
swerability, relevancy, and fluency scores (i.e.,
questions with scores of 1, 3, 3). We also keep the
questions with scores of (1, 3, 2) and ask the anno-
tators to make minor corrections to those questions.
4
Overall, the disagreement percentage is roughly around
10–20%, with
84% of the disagreement are categorized as
narrow disagreement (1 vs 2 or 2 vs 3).
摘要:

IDK-MRC:UnanswerableQuestionsforIndonesianMachineReadingComprehensionRifkiAnaPutriSchoolofComputingKAIST,SouthKorearifkiaputri@kaist.ac.krAliceOhSchoolofComputingKAIST,SouthKoreaalice.oh@kaist.eduAbstractMachineReadingComprehension(MRC)hasbecomeoneoftheessentialtasksinNaturalLanguageUnderstanding(N...

展开>> 收起<<
IDK-MRC Unanswerable Questions for Indonesian Machine Reading Comprehension Rifki Afina Putri.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.49MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注