QAScore - An Unsupervised Unreferenced Metric for the Question Generation Evaluation Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou2Yvette Graham3

2025-05-02 0 0 579.56KB 19 页 10玖币
侵权投诉
QAScore - An Unsupervised Unreferenced Metric for the Question
Generation Evaluation
Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou 2Yvette Graham 3
1ADAPT Centre
2School of Computing, Dublin City University, Ireland
3School of Computer Science and Statistics, Trinity College Dublin, Ireland
{tianbo.ji,yvette.graham,gareth.jones}@adaptcentre.com
chenyang.lyu2@mail.dcu.ie,liting.zhou@dcu.ie
Abstract
Question Generation (QG) aims to automate
the task of composing questions for a passage
with a set of chosen answers found within the
passage. In recent years, the introduction of
neural generation models has resulted in sub-
stantial improvements of automatically gener-
ated questions in terms of quality, especially
compared to traditional approaches that em-
ploy manually crafted heuristics. However,
the metrics commonly applied in QG evalua-
tions have been criticized for their low agree-
ment with human judgement. We therefore
propose a new reference-free evaluation met-
ric that has the potential to provide a better
mechanism for evaluating QG systems, called
QAScore. Instead of fine-tuning a language
model to maximize its correlation with human
judgements, QAScore evaluates a question by
computing the cross entropy according to the
probability that the language model can cor-
rectly generate the masked words in the an-
swer to that question. Furthermore, we con-
duct a new crowd-sourcing human evaluation
experiment for the QG evaluation to investi-
gate how QAScore and other metrics can cor-
relate with human judgements. Experiments
show that QAScore obtains a stronger correla-
tion with the results of our proposed human
evaluation method compared to existing tra-
ditional word-overlap-based metrics such as
BLEU and ROUGE, as well as the existing
pretrained-model-based metric BERTScore.
1 Introduction
Question Generation (QG) commonly comprises
automatic composition of an appropriate question
given a passage of text and answer located within
that text. QG is highly related to the task of ma-
chine reading comprehension (MRC), which is a
sub-task of question answering (QA) (Du et al.,
2017;Xie et al.,2020;Pan et al.,2020a;Puri et al.,
2020;Lyu et al.,2021;Chen et al.,2019;Li et al.,
2021). Both QG and MRC receive similar input, a
(set of) document(s), while the two tasks diverge
on the output they produce, as QG systems gener-
ate questions for a predetermined answer within
the text while conversely MRC systems aim to an-
swer a prescribed set of questions. Recent QG
research suggests that the direct employment of
MRC datasets for QG tasks is advantageous (Kim
et al.,2019;Wang et al.,2020;Cho et al.,2021).
In terms of QG evaluation, widely-applied met-
rics can be categorized into two main classes:
word overlap metrics (e.g., BLEU (Papineni et al.,
2002a) and Answerability (Nema and Khapra,
2018)) and metrics that employ large pre-trained
language models (BLEURT (Sellam et al.,2020)
and BERTScore (Zhang et al.,2020a)). Evalua-
tion via automatic metrics still face a number of
challenges, however. Firstly, most existing metrics
are not specifically designed to evaluate QG sys-
tems as they are borrowed from other NLP tasks.
Since such metrics have been criticized for poor
correlation with human assessment in the evalua-
tion of their own NLP tasks such as machine trans-
lation (MT) and dialogue systems (Reiter,2018;
Graham,2015;Graham and Liu,2016;Ji et al.,
2022), thus that raises questions about the valid-
ity of results based on such metrics designed for
other tasks. Another challenge lies in the fact that
existing automatic evaluation metrics rely on com-
parison of a candidate with a ground-truth refer-
ence. Such approaches ignore the one-to-many
nature of QG ignoring the fact that a QG system is
capable of generating legitimately plausible ques-
tions that will be harshly penalised simply for di-
verging from ground-truth questions. For example,
with a passage describing Ireland, the country lo-
cated in western Europe, two questions
Q1
and
Q2
,
where
Q1
=“What is the capital of Ireland?” and
Q2
=“Which city in the Leinster province has the
largest population?”, can share the same answer
Dublin”. In other words, it is fairly appropriate
for a QG system to generate either
Q1
or
Q2
given
arXiv:2210.04320v1 [cs.CL] 9 Oct 2022
the same passage and answer, despite few overlap
between the meanings of
Q1
and
Q2
. We deem it
the one-to-many nature of the QG task, as one pas-
sage and answer can lead to many meaningful ques-
tions. A word overlap based metric will however
incorrectly assess
Q2
with a lower score if it takes
Q1
as the reference because of the lack of word
overlap between these two questions. A potential
solution is to pair each answer with a larger number
of hand-crafted reference questions. However, the
addition of reliable references requires additional
resources, usually incurring a high cost, while at-
tempting to include every possible correct question
for a given answer is prohibitively expensive and
impractical. Another drawback is that pretrained-
model-based metrics require extra resources during
the fine-tuning process, resulting in a high cost.
Besides the evaluation metrics aforementioned hu-
man evaluation is also widely employed in QG
tasks. However, the QG community currently lacks
a standard human evaluation approach as current
QG research employs disparate settings of human
evaluation (e.g., expert-based or crowd-sourced, bi-
nary or
5
-point rating scales) (Xie et al.,2020;Ji
et al.,2021).
To address the existing shortcomings in QG eval-
uation, we firstly propose a new automatic metric
called QAScore. To investigate whether QAScore
can outperform existing automatic evaluation met-
rics, we additionally devise a new human evalua-
tion approach of QG systems and evaluate its relia-
bility in terms of consistent results for QG systems
through self-replication experiments. Details of
our contributions are listed as follows:
1.
We propose a pretrained language model
based evaluation metric called QAScore,
which is unsupervised and reference-free.
QAScore utilizes the RoBERTa model (Liu
et al.,2019), and evaluates a system-generated
question using the cross entropy in terms of
the probability that RoBERTa can correctly
predict the masked words in the answer to
that question.
2.
We propose a novel and highly reliable crowd-
sourced human evaluation method that can
be used as a standard framework for evaluat-
ing QG systems. Compared to other human
evaluation methods, it is cost-effective and
easy to deploy. We further conduct a self-
replication experiment showing a correlation
of
r= 0.955
in two distinct evaluations of the
same set of systems. According to the results
of the human evaluation experiment, QAS-
core can outperform all other metrics without
supervision steps or fine-tuning, achieving a
strong Pearson correlation with human assess-
ment;
2 Background: Question Answering,
Question Generation and Evaluation
2.1 Question Answering
Question Answering (QA) aims to provide answers
a
to the corresponding questions
q
, Based on the
availability of context
c
, QA can be categorized into
Open-domain QA (without context) (Chen et al.,
2017;Zhu et al.,2021) and Machine Reading com-
prehension (with context) (Rajpurkar et al.,2016;
Saha et al.,2018). Besides, QA can also be cate-
gorized into generative QA (Koˇ
ciský et al.,2018;
Xu et al.,2022) and extractive QA (Trischler et al.,
2017;Lyu et al.,2022;Lewis et al.,2021;Zhang
et al.,2020b). Generally, the optimization objective
of QA models is to maximize the log likelihood
of the ground-truth answer
a
for the given context
c
. Therefore the objective function regarding the
parameters θof QA models is:
J(θ) = logP (a|c, q;θ)(1)
2.2 Question Generation
Question Generation (QG) is a task where mod-
els receive context passages
c
and answers
a
, then
generate the corresponding questions
q
which are
expected to be semantically relevant to the con-
text
c
and answers
a
(Pan et al.,2019;Lyu et al.,
2021). Thus QG is a reverse/dual task of QA as QA
aims to provide answers
a
to questions
q
whereas
QG targets at generating questions
q
for the given
answers
a
. Typically, the architecture of QG sys-
tems is mainly Seq2Seq model (Sutskever et al.,
2014) which generates the
q
word by word in auto-
regressive manner. The objective for optimizing
the parameters
θ
of QG systems is to maximize the
likelihood of P(q|c, a):
J(θ) = logP (q|c, a) = X
i
logP (qi|q<i, c, a)
(2)
2.3 Automatic evaluation metrics
We introduce two main categories of automatic
evaluation metrics applied for QG task: word-
overlap-based metrics and pretrained-model-based
metrics in the following sections.
2.3.1 Word-overlap-based metrics
Word-overlap-based metrics usually assess the qual-
ity of a QG system according to the overlap rate
between the words of a system-generated candidate
and a reference. Most of such metrics, including
BLEU, GLEU, ROUGE and METEOR are initially
proposed for other NLP tasks (e.g., BLEU is for
MT and ROUGE is for text summarization), while
Answerability is a QG-exclusive evaluation metric.
BLEU Bilingual Evaluation Understudy (BLEU)
is a method that is originally proposed for
evaluating the quality of MT systems (Papineni
et al.,2002b). For QG evaluation, BLEU
computes the level of correspondence between a
system-generated question and the reference
question by calculating the precision according to
the number of n-gram matching segments. These
matching segments are thought to be unrelated to
their positions in the entire context. The more
matching segments there are, the better the quality
of the candidate is.
GLEU GLEU (Google-BLEU) is proposed to
overcome the drawbacks of evaluating a single
sentence (Wu et al.,2016). As a variation of
BLEU, the GLEU score is reported to be highly
correlated with the BLEU score on a corpus level.
GLEU uses the scores of precision and recall
instead of the modified precision in BLEU.
ROUGE Recall-Oriented Understudy for Gisting
Evaluation (ROUGE) is an evaluation metric
developed for the assessment of the text
summarization task, but originally adapted as a
recall-adaptation of BLEU (Lin,2004). ROUGE-L
is the most popular variant of ROUGE, where L
denotes the longest common subsequence (LCS).
The definition of LCS is a sequence of words that
appear in the same order in both sentences. In
contrast with sub-strings (e.g., n-gram), the
positions of words in a sub-sequence are not
required to be consecutive in the original sentence.
ROUGE-L is then computed by the F-βscore
according to the number of words in the LCS
between a question and a reference.
METEOR Metric for Evaluation of Translation
with Explicit ORdering (METEOR) was firstly
proposed to make up for the disadvantages of
BLEU, such as lack of recall and the inaccuracy of
assessing a single sentence (Banerjee and Lavie,
2005). METEOR first generates a set of mappings
between the question qand the reference r
according to a set of stages, including: exact token
matching (i.e., two tokens are the same), WordNet
synonyms (e.g., well and good), and Porter
stemmer (e.g., friend and friends). METOER
score is then computed by the weighted harmonic
mean of precision and recall in terms of the
number of unigrams in mappings between a
question and a reference.
Answerability Aside from the aforementioned
evaluation methods - which are borrowed from
other NLP tasks, an automatic metric called
Answerability is specifically proposed for the QG
task (Nema and Khapra,2018). Nema and Khapra
(2018) suggest combining it with other existing
metrics since its aim is to measure how answerable
a question is, something not usually targeted by
other automatic metrics. For example, given a
reference question r: “What is the address of
DCU?” and two generated questions q1: “address
of DCU” and q2: “What is the address of ”, it is
obvious that q1is rather answerable since it
contains enough information while q2is very
confusing. However, any similarity-based metric is
certainly prone to think that q2(ROUGE-L: 90.9;
METEOR: 41.4; BLEU-1: 81.9) is closer to r
than q1(ROUGE-L: 66.7; METEOR: 38.0;
BLEU-1: 36.8). Thus, Answerability is proposed
to solve such an issue. In detail, for a
system-generated question qand a reference
question r, the Answerability score can be
computed as shown in Equation 3:
P=X
iE
wi
hi(q, r)
ki(q)
R=X
iE
wi
hi(q, r)
ki(r)
Answerability =2×P×R
P+R
(3)
where
i
(
iE
) represents certain types of ele-
ments in
E={R, N, Q, F }
(
R=
Relevant Con-
tent Word,
N=
Named Entity,
Q=
Question
Type, and
F=
Function Word).
wi
is the weight
for type
i
that
XiEwi= 1
. Function
hi(x, y)
returns the number of
i
-type words in question
x
that have matching
i
-type words in question
y
, and
ki(x)
returns the number of
i
-type words occuring
in question
x
. The final Answerability score is the
F1 score of Precision Pand Recall R.
Along with using Answerability individually, a
common practice is to combine it with other met-
rics as suggested when evaluating QG systems
(Chen et al.,2020;Lewis et al.,2020b):
Metricmod =β·Answerability+
(1 β)·Metricori (4)
where
Metricmod
is a modified version of an origi-
nal evaluation metric
Metricori
using Answerabil-
ity, and
β
is a hyper-parameter. In this experiment,
we combine it with BLEU to generate the
Q
-BLEU
score using the default value of β.
2.3.2 Pretrained-model-based metrics
BERTScore Zhang et al. (2020a) proposed an
automatic metric called BERTScore for evaluating
text generation task because word-overlap-based
metrics like BLEU fail to account for
compositional diversity. Instead, BERTScore
computes a similarity score between tokens in a
candidate sentence and its reference based on their
contextualized representations produced by BERT
(Devlin et al.,2019). Given a question that has m
tokens and a question that has
n
tokens, the BERT
model can first generate the representations of q
and ras q=hq1, q2, . . . , qmiand
r=hr1, r2, . . . , rni, where qiand rirespectively
mean the contextual embeddings of the i-th token
in qand r. Then, the BERT score between the
question and the reference can be computed by
Equation 5:
PBERT =1
mX
pip
max
rjrp>
irj
RBERT =1
nX
rir
max
pjpp>
jri
BERTScore =2·PBERT ·RBERT
PBERT +RBERT
(5)
where the final BERTScore is the F1 measure com-
puted by precision PBERT and recall RBERT.
BLEURT
BLEURT is proposed to solve the issue
that metrics like BLEU may correlate poorly with
human judgments (Sellam et al.,2020). It is a
trained evaluation metric which takes a candidate
and its reference as input and gives a score to
indicate how the candidate can cover the meaning
of the reference. BLEURT uses a BERT-based
regression model trained on the human rating data
from the WMT Metrics Shared Task from 2017 to
2019. Since BLEURT was proposed for evaluating
models on the sentence level, meanwhile no
formal experiments are available for corpus-level
evaluation, we directly compute the final BLEURT
score of a QG system as the arithmetic mean of all
sentence-level BLEURT scores in our QG
evaluation experiment as suggested (see the
discussion on https://github.com/
google-research/bleurt/issues/10).
2.4 Human evaluation
Although the aforementioned prevailing automatic
metrics mentioned above are widely employed for
QG evaluation, criticism of
n
-gram overlap-based
metrics’ ability to accurately and comprehensively
evaluate the quality has also been highlighted (Yuan
et al.,2017). As a single answer can potentially
have a large number of corresponding plausible
questions, simply computing the overlap rate be-
tween an output and a reference to reflect the real
quality of a QG system does not seem convincing.
A possible solution is to obtain more correct ques-
tions per answer, as
n
-gram overlap-based metrics
would usually benefit from multiple ground-truth
references. However, this may elicit new issues: 1)
adding additional references over the entire corpora
requires similar effort to creating a new data set in-
curring expensive and time resource costs; 2) it is
not straightforward to formulate how word overlap
should contribute to the final score for systems.
Hence, human evaluation is also involved when
evaluating newly proposed QG systems. A com-
mon approach is to evaluate a set of system-
generated questions and ask human raters to score
these questions on an
n
-point Likert scale. Below
we introduce and describe recent human evalua-
tions are applied to evaluate QG systems.
Jia et al. (2021) proposed EQG-RACE to gen-
erate examination-type questions for educational
purposes. 100 outputs are sampled and three ex-
pert raters are required to score these outputs in
three dimensions: fluency - whether a question is
grammatical and fluent; relevancy - whether the
question is semantically relevant to the passage;
and answerability - whether the question can be an-
swered by the right answer. A 3-point scale is used
for each aspect, and aspects are reported separately
without overall performance.
KD-QG is a framework with a knowledge base
摘要:

QAScore-AnUnsupervisedUnreferencedMetricfortheQuestionGenerationEvaluationTianboJi1ChenyangLyu2GarethJones1LitingZhou2YvetteGraham31ADAPTCentre2SchoolofComputing,DublinCityUniversity,Ireland3SchoolofComputerScienceandStatistics,TrinityCollegeDublin,Ireland{tianbo.ji,yvette.graham,gareth.jones}@adapt...

展开>> 收起<<
QAScore - An Unsupervised Unreferenced Metric for the Question Generation Evaluation Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou2Yvette Graham3.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:579.56KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注