QAScore - An Unsupervised Unreferenced Metric for the Question
Generation Evaluation
Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou 2Yvette Graham 3
1ADAPT Centre
2School of Computing, Dublin City University, Ireland
3School of Computer Science and Statistics, Trinity College Dublin, Ireland
{tianbo.ji,yvette.graham,gareth.jones}@adaptcentre.com
chenyang.lyu2@mail.dcu.ie,liting.zhou@dcu.ie
Abstract
Question Generation (QG) aims to automate
the task of composing questions for a passage
with a set of chosen answers found within the
passage. In recent years, the introduction of
neural generation models has resulted in sub-
stantial improvements of automatically gener-
ated questions in terms of quality, especially
compared to traditional approaches that em-
ploy manually crafted heuristics. However,
the metrics commonly applied in QG evalua-
tions have been criticized for their low agree-
ment with human judgement. We therefore
propose a new reference-free evaluation met-
ric that has the potential to provide a better
mechanism for evaluating QG systems, called
QAScore. Instead of fine-tuning a language
model to maximize its correlation with human
judgements, QAScore evaluates a question by
computing the cross entropy according to the
probability that the language model can cor-
rectly generate the masked words in the an-
swer to that question. Furthermore, we con-
duct a new crowd-sourcing human evaluation
experiment for the QG evaluation to investi-
gate how QAScore and other metrics can cor-
relate with human judgements. Experiments
show that QAScore obtains a stronger correla-
tion with the results of our proposed human
evaluation method compared to existing tra-
ditional word-overlap-based metrics such as
BLEU and ROUGE, as well as the existing
pretrained-model-based metric BERTScore.
1 Introduction
Question Generation (QG) commonly comprises
automatic composition of an appropriate question
given a passage of text and answer located within
that text. QG is highly related to the task of ma-
chine reading comprehension (MRC), which is a
sub-task of question answering (QA) (Du et al.,
2017;Xie et al.,2020;Pan et al.,2020a;Puri et al.,
2020;Lyu et al.,2021;Chen et al.,2019;Li et al.,
2021). Both QG and MRC receive similar input, a
(set of) document(s), while the two tasks diverge
on the output they produce, as QG systems gener-
ate questions for a predetermined answer within
the text while conversely MRC systems aim to an-
swer a prescribed set of questions. Recent QG
research suggests that the direct employment of
MRC datasets for QG tasks is advantageous (Kim
et al.,2019;Wang et al.,2020;Cho et al.,2021).
In terms of QG evaluation, widely-applied met-
rics can be categorized into two main classes:
word overlap metrics (e.g., BLEU (Papineni et al.,
2002a) and Answerability (Nema and Khapra,
2018)) and metrics that employ large pre-trained
language models (BLEURT (Sellam et al.,2020)
and BERTScore (Zhang et al.,2020a)). Evalua-
tion via automatic metrics still face a number of
challenges, however. Firstly, most existing metrics
are not specifically designed to evaluate QG sys-
tems as they are borrowed from other NLP tasks.
Since such metrics have been criticized for poor
correlation with human assessment in the evalua-
tion of their own NLP tasks such as machine trans-
lation (MT) and dialogue systems (Reiter,2018;
Graham,2015;Graham and Liu,2016;Ji et al.,
2022), thus that raises questions about the valid-
ity of results based on such metrics designed for
other tasks. Another challenge lies in the fact that
existing automatic evaluation metrics rely on com-
parison of a candidate with a ground-truth refer-
ence. Such approaches ignore the one-to-many
nature of QG ignoring the fact that a QG system is
capable of generating legitimately plausible ques-
tions that will be harshly penalised simply for di-
verging from ground-truth questions. For example,
with a passage describing Ireland, the country lo-
cated in western Europe, two questions
Q1
and
Q2
,
where
Q1
=“What is the capital of Ireland?” and
Q2
=“Which city in the Leinster province has the
largest population?”, can share the same answer
“Dublin”. In other words, it is fairly appropriate
for a QG system to generate either
Q1
or
Q2
given
arXiv:2210.04320v1 [cs.CL] 9 Oct 2022