Learning to Reuse Distractors to support Multiple Choice Question Generation in Education Semere Kiros Bitew1Amir Hadifar1Lucas SterckxyJohannes Deleu1Chris Develder1

2025-05-02 0 0 601.81KB 24 页 10玖币

侵权投诉

Learning to Reuse Distractors to support

Multiple Choice Question Generation in Education

Semere Kiros Bitew1∗Amir Hadifar1Lucas Sterckx†Johannes Deleu1Chris Develder1

Thomas Demeester1

1IDLab, Ghent University – imec,

Technologiepark Zwijnaarde 126, 9052 Ghent, Belgium

{semerekiros.bitew,ﬁrstname.lastname}@ugent.be

{lucassterckx}@gmail.com

Abstract

Multiple choice questions (MCQs) are widely used in digital learning systems, as

they allow for automating the assessment process. However, due to the increased

digital literacy of students and the advent of social media platforms, MCQ tests

are widely shared online, and teachers are continuously challenged to create new

questions, which is an expensive and time-consuming task. A particularly sensitive

aspect of MCQ creation is to devise relevant distractors, i.e., wrong answers that are

not easily identiﬁable as being wrong. This paper studies how a large existing set of

manually created answers and distractors for questions over a variety of domains,

subjects, and languages can be leveraged to help teachers in creating new MCQs,

by the smart reuse of existing distractors. We built several data-driven models

based on context-aware question and distractor representations, and compared

them with static feature-based models. The proposed models are evaluated with

automated metrics and in a realistic user test with teachers. Both automatic and

human evaluations indicate that context-aware models consistently outperform a

static feature-based approach. For our best-performing context-aware model, on

average 3 distractors out of the 10 shown to teachers were rated as high-quality

distractors. We create a performance benchmark, and make it public, to enable

comparison between different approaches and to introduce a more standardized

evaluation of the task. The benchmark contains a test of 298 educational questions

covering multiple subjects & languages and a 77k multilingual pool of distractor

vocabulary for future research.

1 Introduction

Online learning has become an indispensable part of educational institutions. It has emerged as a

necessary resource for students and schools all over the globe. The recent COVID-19 pandemic

has made the transition to online learning even more pressing. One very important aspect of

online learning is the need to generate homework, test, and exam exercises to aid and evaluate

the learning progress of students [

]. Multiple choice questions (MCQs) are the most common

form of exercises [

] in online education as they can easily be scored automatically. However, the

construction of MCQs is time consuming [

] and there is a need to continuously generate new

(variants of) questions, especially for testing, since students tend to share questions and correct

answers from MCQs online (e.g., through social media).

∗Corresponding author

†

Lucas Sterckx, who is currently at LynxCare, contributed to this work while working in the T2K team,

Ghent University-imec.

Accepted in IEEE transactions on Learning technologies doi:10.1109/TLT.2022.3226523

arXiv:2210.13964v2 [cs.CL] 13 Dec 2022

The rapid digitization of educational resources opens up opportunities to adopt artiﬁcial intelligence

(AI) to automate the process of MCQ construction. A substantial number of questions already exist

in a digital format, thus providing the required data as a ﬁrst step toward building AI systems. The

automation of MCQ construction could support both teachers and learners. Teachers could beneﬁt

from an increased efﬁciency in creating questions, in their already high workload. Students’ learning

experience could improve due to increased practice opportunities based on automatically generated

exercises, and if these systems are sufﬁciently accurate, they could power personalized learning [

A crucial step in MCQ creation is the generation of distractors [

]. Distractors are incorrect options

that are related to the answer to some degree. The quality of an MCQ heavily depends on the

quality of distractors [

]. If the distractors do not sufﬁciently challenge learners, picking the correct

answer becomes easy, ultimately degrading the discriminative power of the question. The automatic

suggestion of distractors will be the focus of this paper.

Several works have already proposed distractor generation techniques for automatic MCQ creation,

mostly based on selecting distractors according to their similarity to the correct answer. In general,

two approaches are used to measure the similarity between distractors and an answer: graph-based and

corpus-based methods. Graph-based approaches use the semantic distance between concepts in the

graph as a similarity measure. In language learning applications, typically WordNet [

] is used

to generate distractors, while for factoid questions domain-speciﬁc (ontologies) are used to generate

distractors [

]. In corpus-based methods, similarity between distractors and answers has

been deﬁned as having similar frequency count [

], belonging to the same POS class [

], having a

high co-occurrence likelihood [

], having similar phonetic and morphological features [

], and

being nearby in embedding spaces [

]. Other works such as [

] use machine

learning models to generate distractors by using a combination of the previous features and other

types of information such as tf-idf scores.

While the current state-of-the-art in MCQ creation is promising, we see a number of limitations.

First of all, existing models are often domain speciﬁc. Indeed, the proposed techniques are tailored

to the application and distractor types. In language learning, such as vocabulary, grammar or tense

usage exercises, typically similarity based on basic syntactic and statistical information works well:

frequency, POS information, etc. In other domains, such as science, health, history, geography, etc.,

distractors should be selected on deeper understanding of context and semantics, and the current

methods fail to capture such information.

The second limitation, language dependency, is especially applicable to factoids. Models should be

agnostic to language because facts do not change with languages. Moreover, building a new model

for each language could be daunting task as it would require enough training data for each language.

In this work, we study how the automatic retrieval of distractors can facilitate the efﬁcient construction

of MCQs. We use a high-quality large dataset of question, answer, distractor triples that are diverse in

terms of language, domain, and type of questions. Our dataset was made available by a commercial

organization active in the ﬁeld of e-assessment (see Section 3.2), and is therefore representative

for the educational domain, with a total of 62k MCQ, none of them identical, encompassing only

92k different answers and distractors. Despite an average of 2.4 distractors per question, there is a

large reuse of distractors over different questions. This motivates our premise to retrieve and reuse

distractors for new questions. We make use of the latest data-driven Natural Language Processing

(NLP) techniques to retrieve candidate distractors. We propose context-aware multilingual models

that are based on deep neural network models that select distractors by taking into account the context

of the question. They are also able to handle variety of distractors in terms of length and type. We

compare our proposed models to a competitive feature-based baseline that is based on classical

machine learning methods trained on several handcrafted features.

The methods are evaluated for distractor quality using automated metrics and a real-world user

test with teachers. Both the automatic evaluation and the user study with teachers indicate that the

proposed context-aware methods outperform the feature-based baseline. Our contribution can be

summarized as follows:

•

We built three multilingual Transformer-based distractor retrieval models that suggest distrac-

tors to teachers for multiple subjects in different languages. The ﬁrst model (Section 3.4.3)

requires similar distractors to have similar semantic representations, while the second (Sec-

tion 3.4.2) learns similar representations for similar questions, and the last combines the

complementary advantages of of these two models (Section 3.4.3).

•

We performed a user study with teachers to evaluate the quality of distractors proposed by

the models, based on a four-level annotation scheme designed for that purpose.

•

The evaluation of our best model on in-distribution held-out data reveals an average increase

of 20.4% in terms of recall at 10, compared to our baseline model adapted from [

]. The

teacher-based annotations on language learning exercises show an increase by 4.3% in the

fraction of good distractors among the top 10 results, compared to teacher annotations for the

same baseline. For factoid questions, the fraction of quality distractors more than doubles

w.r.t. the baseline, with an improvement of 15.3%.

•

We released

a test-set of educational questions of 6 subjects with 50 MCQs per subject and

annotated distractors, and 77k size distractor vocabulary as benchmark to stimulate further

research. The dataset, which is made by experts, contains multilingual and multi-domain

distractors.

The remainder of the paper is organized as follows: Section 2 describes the relevant work in MCQs

in general and distractor generation in particular. Section 3 introduces the dataset, explains the details

of the proposed methods and the evaluation setup of the user study with teachers. In Section 5, the

results of both the user study and automated evaluations is reported. And ﬁnally, in Section 6, we

present the conclusion, lines for future work, and limitations of our proposed models.

2 Related work

2.1 MCQs in Education

Multiple choice questions (MCQs) are widely used forms of exercises that require students to select

the best possible answer from a set of given options. They are used in the context of learning, and

assessing learners’ knowledge and skills. MCQs are categorized as objective types of questions

because they primarily deal with the facts or knowledge embedded in a text rather than subjective

opinions [

]. It has been shown that recalling information in response to a multiple-choice test

question bolsters memorizing capability, which leads to better retention of that information over time.

It can also change the way information is represented in memory, potentially resulting in deeper

understanding [6] of concepts.

An MCQ item consists of three elements:

•stem: is the question, statement, or lead-in to the question.

•key: the correct answer.

•distractors: alternative answers meant to challenge students’ understanding of the topic.

For example, consider the MCQ in the ﬁrst row of Table 3: the stem of the MCQ is “Which

inhabitants are not happy with Ethiopia’s plans of the Nile?". Four potential answers are given with

the question. Among these, the correct answer is “Egyptians", which is the key. The alternatives are

the distractors.

MCQs are used in several teaching domains such as information technology [

], health [

historical knowledge [

], etc. They are also commonly used in standardized tests such as GRE

and TOEFL. MCQs are preferred to other question formats because they are easy to score, and

students can also answer them relatively quickly since typing responses is not required. Moreover,

MCQs enable a high level of test validity if they are drawn from a representative sample of the

content areas that make up the pre-determined learning outcomes [

]. The most time-consuming and

non-trivial task in constructing MCQ is distractor generation [

]. Distractors should be plausible

enough to force learners to put some thought before selecting the correct answer. Preparing good

multiple-choice questions is a skill that requires formal training [

]. Moreover, several MCQ

item writing guidelines are used by content specialists when they prepare educational tests. These

guidelines also include recommendations for developing and using distractors [

]. Despite

3https://dx.doi.org/10.21227/gnpy-d910 or https://github.com/semerekiros/dist-retrieval

these guidelines, inexperienced teachers may still construct poor MCQs due to lack of training and

limited time [63].

Besides reducing teachers’ workloads, the automation of the distractor generation could potentially

correct some minor mistakes made by teachers. For example, one of the rules suggested by [

]

says: “the length of distractors and the key should be about the same”. Such property could be easily

integrated in the automation process.

MCQs also have drawbacks; they are typically used to measure lower-order levels of knowledge, and

guesswork can be a factor in answering a question with a limited number of alternatives. Furthermore,

because of a few missing details, learners’ partial understanding of a topic may not be sufﬁcient

to correctly answer a question, resulting in partial knowledge not being credited by MCQs [

Nonetheless, MCQs are still extensively utilized in large-scale tests since they are efﬁcient to

administer and easy to score objectively [18].

2.2 Distractor Generation

Many strategies have been developed for generating distractors for a given question. The most

common approach is to select a distractor based on its similarity to the key for a given question. Many

researchers approximate the similarity between distractor and key according to WordNet [

WordNet [

] is a lexical database that groups words into sets of synonyms, and concepts semantically

close to the key are used as distractors. The usage of such lexical databases is sound for language or

vocabulary learning but not for factoid type questions. We instead provide a more general approach

that could be used for both tasks, and instead of only using the key as the source of information while

suggesting distractors, we also make use of the stem.

For learning factual knowledge, several works rely on the use of speciﬁc domain ontology as a proxy

for similarity. Papasalouros et al. [

] employ several ontology-based strategies to generate distractors

for MCQ questionnaires. For example, they generate “Brussels is a mountain" as a good distractor

for an answer “Everest is a mountain" because both concept City and concept Mountain share the

parent concept Location. Another very similar work by Lopetegui et al. [

] selects distractors

that are declared siblings of the answer in a domain-speciﬁc ontology. The work by Leo et al. [

]

improves upon the previous works by generating multi-word distractors from an ontology in the

medical domain. Other works that rely on knowledge bases apply query relaxation methods, where

the queries used to generate the keys were slightly relaxed to generate distractors that share similar

features with the key [

]. While the methods in these works are dependent on their respective

ontologies, we provide an approach that is ontology-agnostic and instead uses contextual similarity

between distractors and questions.

Another line of works for distractor generation uses machine-learning models. Liu et al. [

] use

a regression model based on characteristics such as character glyph, phonological, and semantic

similarity for generating distractors in Chinese. Liang et al. [

] use two methods to rank distractors

in the domain of school sciences. The ﬁrst method adopts machine learning classiﬁers on manually

engineered features (i.e., edit distance, POS similarity, etc.) to rank distractors. The second uses

generative adversarial networks to rank distractors. Our baseline method is inspired by their ﬁrst

approach but was made to account for the multilingual nature of our dataset by extending the feature

set.

There have also been a number of works on generating distractors in the context of machine compre-

hension [

]. Distractor generation strategies that fall in this category assume access to a contextual

resource such as a book chapter, an article or a wikipedia page where the MCQ was produced from.

The aim is then to generate a distractor that takes into account the reading comprehension text, and a

pair composed of the question and its correct answer that originated from the text [

]. This

line of work is incomparable to our work because we do not have access to an external contextual

resource the questions were prepared from.

In this paper, we focus on building one model that is able to suggest candidate distractors for teachers

both in the context of language and factual knowledge learning. Unlike previous methods, we tackle

distractor generation with a multilingual dataset. Our distractors are diverse both in terms of domain

and language. Moreover, the distractors are not limited to single words only.

Table 1: The statistics of our dataset

Train Validation Test

# Questions 61758 600 500

# Distractors per question 2.4 2.3 2.3

Avg question length 27.8 tokens 28.1 tokens 27.6 tokens

Avg distractor length 2.2 tokens 2.3 tokens 2.1 tokens

Avg answer length 2.2 tokens 2.3 tokens 2.2 tokens

Total # distractors 94,205 - -

Total # distractors ≤6 tokens 77,505 - -

3 Methodology

In this section, we formally deﬁne distractor generation as a ranking problem; describe our datasets;

describe in detail the feature-based baseline and proposed context-aware models including their

training strategies & prediction mechanisms.

3.1 Task Deﬁnition: Distractor Retrieval

We assume access to a distractor candidate set

and a training MCQ dataset

. Note that

can

be obtained by pooling all answers (keys and distractors) from

(as in our experimental setting),

but could also be augmented, for example, with keywords extracted from particular source texts.

We formally write

M={(si,ki,Di)|i= 1, . . . ,N}

. where for each item

among all

available

MCQs,

refers to the question stem,

is the correct answer key, and

Di=d(1)

i, ..., d(mi)

i⊆ D

are the distractors in the MCQ linked to

and

. The aim of the distractor retrieval task is to learn

a point-wise ranking score

ri(d):(si,ki,d)→[0, 1]

for all

d∈ D

, such that distractors in

are

ranked higher than those in D \ Di, when sorted according to the decreasing score ri(d).

This task deﬁnition resembles the metric learning [

] problem in information retrieval. To learn the

ranking function, we propose two types of models: feature-based models and context-aware neural

networks.

3.2 Data

In this section, we describe our datasets, namely: (i) Televic dataset, a big dataset that we used to

train our models. (ii) Wezooz dataset, a small-scale external test set used for evaluation.

3.2.1 Televic dataset

This data is gathered through Televic Education’s platform assessmentQ.

The tool is a comprehensive

online platform for interactive workforce learning and high-stakes exams. It allows teachers to

compose their questions and answers for practice and assessment. As a result, the dataset is made up

of a large and high-quality set of questions, answers and distractors, manually created by experts in

their respective ﬁelds. It encompasses a wide range of domains, subjects, and languages, without

however any metadata on the particular course subjects that apply to the individual items.

We randomly divide our dataset into train/validation/test splits. We discard distractors with more than

6 tokens as they are very rare and unlikely to be reused in different contexts. We keep questions with

at least one distractor. Table 1 summarizes the statistics of our dataset. The dataset contains around

62k MCQs in total. The size of the dataset is relatively large when compared to previously reported

educational MCQ datasets such as SCiQ [

], and MCQL [

] which contain 13.7K and 7.1K MCQs

respectively. On average, a question has more than 2 distractors, and contains exactly one answer. We

use all the answer keys and distractors in the preprocessed dataset as the pool of candidate distractors

(i.e., list of 77,505 ﬁltered distractors) for proposing distractors for any new question.

The distractors in the dataset are not limited to single word distractors. More than 65% of the

distractors contain two or more words as can be seen in Fig. 1a.

4https://www.televic-education.com/en/assessmentq

5We used ISO 639-1:2002 standard for names of languages.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningtoReuseDistractorstosupportMultipleChoiceQuestionGenerationinEducationSemereKirosBitew1AmirHadifar1LucasSterckxyJohannesDeleu1ChrisDevelder1ThomasDemeester11IDLab,GhentUniversityimec,TechnologieparkZwijnaarde126,9052Ghent,Belgium{semerekiros.bitew,rstname.lastname}@ugent.be{lucassterckx}@...

展开>> 收起<<

Learning to Reuse Distractors to support Multiple Choice Question Generation in Education Semere Kiros Bitew1Amir Hadifar1Lucas SterckxyJohannes Deleu1Chris Develder1.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning to Reuse Distractors to support Multiple Choice Question Generation in Education Semere Kiros Bitew1Amir Hadifar1Lucas SterckxyJohannes Deleu1Chris Develder1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: