on automatic generation results in highly im-
balanced question type distribution; our man-
ual generation method covers this limitation.
•
We validate our dataset on the downstream
task and show that it effectively improves the
MRC models’ performance, especially in pre-
dicting the answer to the unanswerable ques-
tions.
2 Related Work
Existing Indonesian MRC Dataset
While
many MRC datasets are available in English (Ra-
jpurkar et al.,2016;Trischler et al.,2017;Ra-
jpurkar et al.,2018), the number of publicly avail-
able Indonesian MRC datasets is very limited. A
shortcut to obtain Indonesian MRC data is by ma-
chine translating English MRC dataset (Muis and
Purwarianti,2020), but it will result in translation
artifacts. We may avoid this by recruiting human
annotators to translate them manually; still, it leads
to translationese, where the translated text appears
awkward or unnatural (Clark et al.,2020). FacQA
(Purwarianti et al.,2007) is part of the IndoNLU
benchmark (Wilie et al.,2020) built from a news
article. It has around 3K answerable questions,
with limited categories of questions: date, location,
name, organization, person, and quantitative. An-
other dataset called TyDiQA-GoldP (Clark et al.,
2020), a multilingual QA dataset constructed from
Wikipedia, has about 5K Indonesian samples. It
also only focuses on answerable questions. To this
date, there are no publicly available Indonesian
MRC datasets that include unanswerable question
type.
Human-Model Dataset Construction
Combin-
ing human and model in dataset construction is
mainly applied to adversarial data, such as Ad-
versarialQA (Bartolo et al.,2020) and Adversari-
alNLI (Nie et al.,2020). In this dynamic adversar-
ial data collection, human annotators are asked to
construct adversarial questions to fool the model.
Such human-model annotation pipeline has not
been tried for unanswerable questions. Wang et al.
(2021) analyzed the cost of different dataset label-
ing strategies, including the combination of GPT-3
(Brown et al.,2020) and human labeling. Although
they included the MRC task in their experiment,
they only focused on SQuAD 1.1 (Rajpurkar et al.,
2016), which only has answerable questions. The
effectiveness of the human-model labeling in the
context of unanswerable questions remains unclear.
Unanswerable Question Generation
Various
approaches have been proposed for generating an-
swerable questions in English (Heilman and Smith,
2010;Du et al.,2017;Du and Cardie,2018;Lewis
et al.,2019;Klein and Nabi,2019;Alberti et al.,
2019;Puri et al.,2020;Shakeri et al.,2020), In-
donesian (Muis and Purwarianti,2020), and cross-
or multi-lingual (Kumar et al.,2019;Chi et al.,
2020;Shakeri et al.,2021;Riabi et al.,2021). The
question generation technique also applied to gen-
erate adversarial questions (Bartolo et al.,2021).
However, for unanswerable question generation,
the number of works are limited. Zhu et al. (2019)
proposed Pair-to-Sequence (Pair2Seq) model that
uses separate encoders for the paragraph and an-
swerable question. They utilized English word em-
bedding (i.e., GloVe (Pennington et al.,2014)) and
character embedding as the feature and bi-LSTM
(Hochreiter and Schmidhuber,1997) as the encoder.
Although their model performed better compared to
the rule-based and TF-IDF baselines, it still relied
on a traditional word embedding representation as
the feature. Differing from their approach, we uti-
lized mT5 model (Xue et al.,2021) that covers con-
textual representation of 101 languages, including
Indonesian. Our experiment (§5.1) confirms that
our model outperforms Pair2Seq model, demon-
strating the advantage of our approach.
3 Dataset Collection Pipeline
We build IDK-MRC dataset by combining model-
generated unanswerable questions with human-
written questions. As shown in Figure 1, our
dataset collection has three stages: automatic gen-
eration, validation, and manual generation.
3.1 Automatic Generation
In this stage, we automatically construct unanswer-
able questions using a Question Generation (QG)
model. We use translated SQuAD 2.0 (Rajpurkar
et al.,2018) as the training data of the QG model.
In the inference step, we use the answerable ques-
tions from TyDiQA-GoldP (Clark et al.,2020) as
a starting point to add more unanswerable ques-
tions for our dataset. Our QG model architecture is
illustrated in Figure 2.
Candidate Generation
We utilize mT5 model
(Xue et al.,2021) to generate the unanswerable
question candidates. We apply
generate unans