Learning to Reuse Distractors to support Multiple Choice Question Generation in Education Semere Kiros Bitew1Amir Hadifar1Lucas SterckxyJohannes Deleu1Chris Develder1

2025-05-02 0 0 601.81KB 24 页 10玖币
侵权投诉
Learning to Reuse Distractors to support
Multiple Choice Question Generation in Education
Semere Kiros Bitew1Amir Hadifar1Lucas SterckxJohannes Deleu1Chris Develder1
Thomas Demeester1
1IDLab, Ghent University – imec,
Technologiepark Zwijnaarde 126, 9052 Ghent, Belgium
{semerekiros.bitew,firstname.lastname}@ugent.be
{lucassterckx}@gmail.com
Abstract
Multiple choice questions (MCQs) are widely used in digital learning systems, as
they allow for automating the assessment process. However, due to the increased
digital literacy of students and the advent of social media platforms, MCQ tests
are widely shared online, and teachers are continuously challenged to create new
questions, which is an expensive and time-consuming task. A particularly sensitive
aspect of MCQ creation is to devise relevant distractors, i.e., wrong answers that are
not easily identifiable as being wrong. This paper studies how a large existing set of
manually created answers and distractors for questions over a variety of domains,
subjects, and languages can be leveraged to help teachers in creating new MCQs,
by the smart reuse of existing distractors. We built several data-driven models
based on context-aware question and distractor representations, and compared
them with static feature-based models. The proposed models are evaluated with
automated metrics and in a realistic user test with teachers. Both automatic and
human evaluations indicate that context-aware models consistently outperform a
static feature-based approach. For our best-performing context-aware model, on
average 3 distractors out of the 10 shown to teachers were rated as high-quality
distractors. We create a performance benchmark, and make it public, to enable
comparison between different approaches and to introduce a more standardized
evaluation of the task. The benchmark contains a test of 298 educational questions
covering multiple subjects & languages and a 77k multilingual pool of distractor
vocabulary for future research.
1 Introduction
Online learning has become an indispensable part of educational institutions. It has emerged as a
necessary resource for students and schools all over the globe. The recent COVID-19 pandemic
has made the transition to online learning even more pressing. One very important aspect of
online learning is the need to generate homework, test, and exam exercises to aid and evaluate
the learning progress of students [
14
]. Multiple choice questions (MCQs) are the most common
form of exercises [
18
] in online education as they can easily be scored automatically. However, the
construction of MCQs is time consuming [
12
] and there is a need to continuously generate new
(variants of) questions, especially for testing, since students tend to share questions and correct
answers from MCQs online (e.g., through social media).
Corresponding author
Lucas Sterckx, who is currently at LynxCare, contributed to this work while working in the T2K team,
Ghent University-imec.
Accepted in IEEE transactions on Learning technologies doi:10.1109/TLT.2022.3226523
arXiv:2210.13964v2 [cs.CL] 13 Dec 2022
The rapid digitization of educational resources opens up opportunities to adopt artificial intelligence
(AI) to automate the process of MCQ construction. A substantial number of questions already exist
in a digital format, thus providing the required data as a first step toward building AI systems. The
automation of MCQ construction could support both teachers and learners. Teachers could benefit
from an increased efficiency in creating questions, in their already high workload. Students’ learning
experience could improve due to increased practice opportunities based on automatically generated
exercises, and if these systems are sufficiently accurate, they could power personalized learning [
41
].
A crucial step in MCQ creation is the generation of distractors [
39
]. Distractors are incorrect options
that are related to the answer to some degree. The quality of an MCQ heavily depends on the
quality of distractors [
12
]. If the distractors do not sufficiently challenge learners, picking the correct
answer becomes easy, ultimately degrading the discriminative power of the question. The automatic
suggestion of distractors will be the focus of this paper.
Several works have already proposed distractor generation techniques for automatic MCQ creation,
mostly based on selecting distractors according to their similarity to the correct answer. In general,
two approaches are used to measure the similarity between distractors and an answer: graph-based and
corpus-based methods. Graph-based approaches use the semantic distance between concepts in the
graph as a similarity measure. In language learning applications, typically WordNet [
46
,
54
] is used
to generate distractors, while for factoid questions domain-specific (ontologies) are used to generate
distractors [
51
,
16
,
34
,
2
]. In corpus-based methods, similarity between distractors and answers has
been defined as having similar frequency count [
11
], belonging to the same POS class [
20
], having a
high co-occurrence likelihood [
25
], having similar phonetic and morphological features [
54
], and
being nearby in embedding spaces [
31
,
22
,
26
]. Other works such as [
39
,
35
,
36
,
38
] use machine
learning models to generate distractors by using a combination of the previous features and other
types of information such as tf-idf scores.
While the current state-of-the-art in MCQ creation is promising, we see a number of limitations.
First of all, existing models are often domain specific. Indeed, the proposed techniques are tailored
to the application and distractor types. In language learning, such as vocabulary, grammar or tense
usage exercises, typically similarity based on basic syntactic and statistical information works well:
frequency, POS information, etc. In other domains, such as science, health, history, geography, etc.,
distractors should be selected on deeper understanding of context and semantics, and the current
methods fail to capture such information.
The second limitation, language dependency, is especially applicable to factoids. Models should be
agnostic to language because facts do not change with languages. Moreover, building a new model
for each language could be daunting task as it would require enough training data for each language.
In this work, we study how the automatic retrieval of distractors can facilitate the efficient construction
of MCQs. We use a high-quality large dataset of question, answer, distractor triples that are diverse in
terms of language, domain, and type of questions. Our dataset was made available by a commercial
organization active in the field of e-assessment (see Section 3.2), and is therefore representative
for the educational domain, with a total of 62k MCQ, none of them identical, encompassing only
92k different answers and distractors. Despite an average of 2.4 distractors per question, there is a
large reuse of distractors over different questions. This motivates our premise to retrieve and reuse
distractors for new questions. We make use of the latest data-driven Natural Language Processing
(NLP) techniques to retrieve candidate distractors. We propose context-aware multilingual models
that are based on deep neural network models that select distractors by taking into account the context
of the question. They are also able to handle variety of distractors in terms of length and type. We
compare our proposed models to a competitive feature-based baseline that is based on classical
machine learning methods trained on several handcrafted features.
The methods are evaluated for distractor quality using automated metrics and a real-world user
test with teachers. Both the automatic evaluation and the user study with teachers indicate that the
proposed context-aware methods outperform the feature-based baseline. Our contribution can be
summarized as follows:
We built three multilingual Transformer-based distractor retrieval models that suggest distrac-
tors to teachers for multiple subjects in different languages. The first model (Section 3.4.3)
requires similar distractors to have similar semantic representations, while the second (Sec-
2
tion 3.4.2) learns similar representations for similar questions, and the last combines the
complementary advantages of of these two models (Section 3.4.3).
We performed a user study with teachers to evaluate the quality of distractors proposed by
the models, based on a four-level annotation scheme designed for that purpose.
The evaluation of our best model on in-distribution held-out data reveals an average increase
of 20.4% in terms of recall at 10, compared to our baseline model adapted from [
36
]. The
teacher-based annotations on language learning exercises show an increase by 4.3% in the
fraction of good distractors among the top 10 results, compared to teacher annotations for the
same baseline. For factoid questions, the fraction of quality distractors more than doubles
w.r.t. the baseline, with an improvement of 15.3%.
We released
3
a test-set of educational questions of 6 subjects with 50 MCQs per subject and
annotated distractors, and 77k size distractor vocabulary as benchmark to stimulate further
research. The dataset, which is made by experts, contains multilingual and multi-domain
distractors.
The remainder of the paper is organized as follows: Section 2 describes the relevant work in MCQs
in general and distractor generation in particular. Section 3 introduces the dataset, explains the details
of the proposed methods and the evaluation setup of the user study with teachers. In Section 5, the
results of both the user study and automated evaluations is reported. And finally, in Section 6, we
present the conclusion, lines for future work, and limitations of our proposed models.
2 Related work
2.1 MCQs in Education
Multiple choice questions (MCQs) are widely used forms of exercises that require students to select
the best possible answer from a set of given options. They are used in the context of learning, and
assessing learners’ knowledge and skills. MCQs are categorized as objective types of questions
because they primarily deal with the facts or knowledge embedded in a text rather than subjective
opinions [
7
]. It has been shown that recalling information in response to a multiple-choice test
question bolsters memorizing capability, which leads to better retention of that information over time.
It can also change the way information is represented in memory, potentially resulting in deeper
understanding [6] of concepts.
An MCQ item consists of three elements:
stem: is the question, statement, or lead-in to the question.
key: the correct answer.
distractors: alternative answers meant to challenge students’ understanding of the topic.
For example, consider the MCQ in the first row of Table 3: the stem of the MCQ is “Which
inhabitants are not happy with Ethiopia’s plans of the Nile?". Four potential answers are given with
the question. Among these, the correct answer is “Egyptians", which is the key. The alternatives are
the distractors.
MCQs are used in several teaching domains such as information technology [
66
], health [
5
,
10
],
historical knowledge [
4
], etc. They are also commonly used in standardized tests such as GRE
and TOEFL. MCQs are preferred to other question formats because they are easy to score, and
students can also answer them relatively quickly since typing responses is not required. Moreover,
MCQs enable a high level of test validity if they are drawn from a representative sample of the
content areas that make up the pre-determined learning outcomes [
10
]. The most time-consuming and
non-trivial task in constructing MCQ is distractor generation [
12
,
36
]. Distractors should be plausible
enough to force learners to put some thought before selecting the correct answer. Preparing good
multiple-choice questions is a skill that requires formal training [
1
,
49
]. Moreover, several MCQ
item writing guidelines are used by content specialists when they prepare educational tests. These
guidelines also include recommendations for developing and using distractors [
23
,
24
,
48
]. Despite
3https://dx.doi.org/10.21227/gnpy-d910 or https://github.com/semerekiros/dist-retrieval
3
these guidelines, inexperienced teachers may still construct poor MCQs due to lack of training and
limited time [63].
Besides reducing teachers’ workloads, the automation of the distractor generation could potentially
correct some minor mistakes made by teachers. For example, one of the rules suggested by [
23
]
says: “the length of distractors and the key should be about the same”. Such property could be easily
integrated in the automation process.
MCQs also have drawbacks; they are typically used to measure lower-order levels of knowledge, and
guesswork can be a factor in answering a question with a limited number of alternatives. Furthermore,
because of a few missing details, learners’ partial understanding of a topic may not be sufficient
to correctly answer a question, resulting in partial knowledge not being credited by MCQs [
6
].
Nonetheless, MCQs are still extensively utilized in large-scale tests since they are efficient to
administer and easy to score objectively [18].
2.2 Distractor Generation
Many strategies have been developed for generating distractors for a given question. The most
common approach is to select a distractor based on its similarity to the key for a given question. Many
researchers approximate the similarity between distractor and key according to WordNet [
45
,
47
,
37
].
WordNet [
44
] is a lexical database that groups words into sets of synonyms, and concepts semantically
close to the key are used as distractors. The usage of such lexical databases is sound for language or
vocabulary learning but not for factoid type questions. We instead provide a more general approach
that could be used for both tasks, and instead of only using the key as the source of information while
suggesting distractors, we also make use of the stem.
For learning factual knowledge, several works rely on the use of specific domain ontology as a proxy
for similarity. Papasalouros et al. [
51
] employ several ontology-based strategies to generate distractors
for MCQ questionnaires. For example, they generate “Brussels is a mountain" as a good distractor
for an answer “Everest is a mountain" because both concept City and concept Mountain share the
parent concept Location. Another very similar work by Lopetegui et al. [
40
] selects distractors
that are declared siblings of the answer in a domain-specific ontology. The work by Leo et al. [
34
]
improves upon the previous works by generating multi-word distractors from an ontology in the
medical domain. Other works that rely on knowledge bases apply query relaxation methods, where
the queries used to generate the keys were slightly relaxed to generate distractors that share similar
features with the key [
56
,
16
,
60
]. While the methods in these works are dependent on their respective
ontologies, we provide an approach that is ontology-agnostic and instead uses contextual similarity
between distractors and questions.
Another line of works for distractor generation uses machine-learning models. Liu et al. [
39
] use
a regression model based on characteristics such as character glyph, phonological, and semantic
similarity for generating distractors in Chinese. Liang et al. [
36
] use two methods to rank distractors
in the domain of school sciences. The first method adopts machine learning classifiers on manually
engineered features (i.e., edit distance, POS similarity, etc.) to rank distractors. The second uses
generative adversarial networks to rank distractors. Our baseline method is inspired by their first
approach but was made to account for the multilingual nature of our dataset by extending the feature
set.
There have also been a number of works on generating distractors in the context of machine compre-
hension [
33
]. Distractor generation strategies that fall in this category assume access to a contextual
resource such as a book chapter, an article or a wikipedia page where the MCQ was produced from.
The aim is then to generate a distractor that takes into account the reading comprehension text, and a
pair composed of the question and its correct answer that originated from the text [
17
,
68
,
9
]. This
line of work is incomparable to our work because we do not have access to an external contextual
resource the questions were prepared from.
In this paper, we focus on building one model that is able to suggest candidate distractors for teachers
both in the context of language and factual knowledge learning. Unlike previous methods, we tackle
distractor generation with a multilingual dataset. Our distractors are diverse both in terms of domain
and language. Moreover, the distractors are not limited to single words only.
4
Table 1: The statistics of our dataset
Train Validation Test
# Questions 61758 600 500
# Distractors per question 2.4 2.3 2.3
Avg question length 27.8 tokens 28.1 tokens 27.6 tokens
Avg distractor length 2.2 tokens 2.3 tokens 2.1 tokens
Avg answer length 2.2 tokens 2.3 tokens 2.2 tokens
Total # distractors 94,205 - -
Total # distractors 6 tokens 77,505 - -
3 Methodology
In this section, we formally define distractor generation as a ranking problem; describe our datasets;
describe in detail the feature-based baseline and proposed context-aware models including their
training strategies & prediction mechanisms.
3.1 Task Definition: Distractor Retrieval
We assume access to a distractor candidate set
D
and a training MCQ dataset
M
. Note that
D
can
be obtained by pooling all answers (keys and distractors) from
M
(as in our experimental setting),
but could also be augmented, for example, with keywords extracted from particular source texts.
We formally write
M={(si,ki,Di)|i= 1, . . . ,N}
. where for each item
i
among all
N
available
MCQs,
si
refers to the question stem,
ki
is the correct answer key, and
Di=d(1)
i, ..., d(mi)
i⊆ D
are the distractors in the MCQ linked to
si
and
ki
. The aim of the distractor retrieval task is to learn
a point-wise ranking score
ri(d):(si,ki,d)[0, 1]
for all
d∈ D
, such that distractors in
Di
are
ranked higher than those in D \ Di, when sorted according to the decreasing score ri(d).
This task definition resembles the metric learning [
30
] problem in information retrieval. To learn the
ranking function, we propose two types of models: feature-based models and context-aware neural
networks.
3.2 Data
In this section, we describe our datasets, namely: (i) Televic dataset, a big dataset that we used to
train our models. (ii) Wezooz dataset, a small-scale external test set used for evaluation.
3.2.1 Televic dataset
This data is gathered through Televic Education’s platform assessmentQ.
4
The tool is a comprehensive
online platform for interactive workforce learning and high-stakes exams. It allows teachers to
compose their questions and answers for practice and assessment. As a result, the dataset is made up
of a large and high-quality set of questions, answers and distractors, manually created by experts in
their respective fields. It encompasses a wide range of domains, subjects, and languages, without
however any metadata on the particular course subjects that apply to the individual items.
We randomly divide our dataset into train/validation/test splits. We discard distractors with more than
6 tokens as they are very rare and unlikely to be reused in different contexts. We keep questions with
at least one distractor. Table 1 summarizes the statistics of our dataset. The dataset contains around
62k MCQs in total. The size of the dataset is relatively large when compared to previously reported
educational MCQ datasets such as SCiQ [
64
], and MCQL [
36
] which contain 13.7K and 7.1K MCQs
respectively. On average, a question has more than 2 distractors, and contains exactly one answer. We
use all the answer keys and distractors in the preprocessed dataset as the pool of candidate distractors
(i.e., list of 77,505 filtered distractors) for proposing distractors for any new question.
The distractors in the dataset are not limited to single word distractors. More than 65% of the
distractors contain two or more words as can be seen in Fig. 1a.
4https://www.televic-education.com/en/assessmentq
5We used ISO 639-1:2002 standard for names of languages.
5
摘要:

LearningtoReuseDistractorstosupportMultipleChoiceQuestionGenerationinEducationSemereKirosBitew1AmirHadifar1LucasSterckxyJohannesDeleu1ChrisDevelder1ThomasDemeester11IDLab,GhentUniversity–imec,TechnologieparkZwijnaarde126,9052Ghent,Belgium{semerekiros.bitew,rstname.lastname}@ugent.be{lucassterckx}@...

展开>> 收起<<
Learning to Reuse Distractors to support Multiple Choice Question Generation in Education Semere Kiros Bitew1Amir Hadifar1Lucas SterckxyJohannes Deleu1Chris Develder1.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:601.81KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注