QAScore - An Unsupervised Unreferenced Metric for the Question Generation Evaluation Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou2Yvette Graham3

2025-05-02 0 0 579.56KB 19 页 10玖币

侵权投诉

QAScore - An Unsupervised Unreferenced Metric for the Question

Generation Evaluation

Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou 2Yvette Graham 3

1ADAPT Centre

2School of Computing, Dublin City University, Ireland

3School of Computer Science and Statistics, Trinity College Dublin, Ireland

{tianbo.ji,yvette.graham,gareth.jones}@adaptcentre.com

chenyang.lyu2@mail.dcu.ie,liting.zhou@dcu.ie

Abstract

Question Generation (QG) aims to automate

the task of composing questions for a passage

with a set of chosen answers found within the

passage. In recent years, the introduction of

neural generation models has resulted in sub-

stantial improvements of automatically gener-

ated questions in terms of quality, especially

compared to traditional approaches that em-

ploy manually crafted heuristics. However,

the metrics commonly applied in QG evalua-

tions have been criticized for their low agree-

ment with human judgement. We therefore

propose a new reference-free evaluation met-

ric that has the potential to provide a better

mechanism for evaluating QG systems, called

QAScore. Instead of ﬁne-tuning a language

model to maximize its correlation with human

judgements, QAScore evaluates a question by

computing the cross entropy according to the

probability that the language model can cor-

rectly generate the masked words in the an-

swer to that question. Furthermore, we con-

duct a new crowd-sourcing human evaluation

experiment for the QG evaluation to investi-

gate how QAScore and other metrics can cor-

relate with human judgements. Experiments

show that QAScore obtains a stronger correla-

tion with the results of our proposed human

evaluation method compared to existing tra-

ditional word-overlap-based metrics such as

BLEU and ROUGE, as well as the existing

pretrained-model-based metric BERTScore.

1 Introduction

Question Generation (QG) commonly comprises

automatic composition of an appropriate question

given a passage of text and answer located within

that text. QG is highly related to the task of ma-

chine reading comprehension (MRC), which is a

sub-task of question answering (QA) (Du et al.,

2017;Xie et al.,2020;Pan et al.,2020a;Puri et al.,

2020;Lyu et al.,2021;Chen et al.,2019;Li et al.,

2021). Both QG and MRC receive similar input, a

(set of) document(s), while the two tasks diverge

on the output they produce, as QG systems gener-

ate questions for a predetermined answer within

the text while conversely MRC systems aim to an-

swer a prescribed set of questions. Recent QG

research suggests that the direct employment of

MRC datasets for QG tasks is advantageous (Kim

et al.,2019;Wang et al.,2020;Cho et al.,2021).

In terms of QG evaluation, widely-applied met-

rics can be categorized into two main classes:

word overlap metrics (e.g., BLEU (Papineni et al.,

2002a) and Answerability (Nema and Khapra,

2018)) and metrics that employ large pre-trained

language models (BLEURT (Sellam et al.,2020)

and BERTScore (Zhang et al.,2020a)). Evalua-

tion via automatic metrics still face a number of

challenges, however. Firstly, most existing metrics

are not speciﬁcally designed to evaluate QG sys-

tems as they are borrowed from other NLP tasks.

Since such metrics have been criticized for poor

correlation with human assessment in the evalua-

tion of their own NLP tasks such as machine trans-

lation (MT) and dialogue systems (Reiter,2018;

Graham,2015;Graham and Liu,2016;Ji et al.,

2022), thus that raises questions about the valid-

ity of results based on such metrics designed for

other tasks. Another challenge lies in the fact that

existing automatic evaluation metrics rely on com-

parison of a candidate with a ground-truth refer-

ence. Such approaches ignore the one-to-many

nature of QG ignoring the fact that a QG system is

capable of generating legitimately plausible ques-

tions that will be harshly penalised simply for di-

verging from ground-truth questions. For example,

with a passage describing Ireland, the country lo-

cated in western Europe, two questions

and

where

=“What is the capital of Ireland?” and

=“Which city in the Leinster province has the

largest population?”, can share the same answer

“Dublin”. In other words, it is fairly appropriate

for a QG system to generate either

given

arXiv:2210.04320v1 [cs.CL] 9 Oct 2022

the same passage and answer, despite few overlap

between the meanings of

and

. We deem it

the one-to-many nature of the QG task, as one pas-

sage and answer can lead to many meaningful ques-

tions. A word overlap based metric will however

incorrectly assess

with a lower score if it takes

as the reference because of the lack of word

overlap between these two questions. A potential

solution is to pair each answer with a larger number

of hand-crafted reference questions. However, the

addition of reliable references requires additional

resources, usually incurring a high cost, while at-

tempting to include every possible correct question

for a given answer is prohibitively expensive and

impractical. Another drawback is that pretrained-

model-based metrics require extra resources during

the ﬁne-tuning process, resulting in a high cost.

Besides the evaluation metrics aforementioned hu-

man evaluation is also widely employed in QG

tasks. However, the QG community currently lacks

a standard human evaluation approach as current

QG research employs disparate settings of human

evaluation (e.g., expert-based or crowd-sourced, bi-

nary or

-point rating scales) (Xie et al.,2020;Ji

et al.,2021).

To address the existing shortcomings in QG eval-

uation, we ﬁrstly propose a new automatic metric

called QAScore. To investigate whether QAScore

can outperform existing automatic evaluation met-

rics, we additionally devise a new human evalua-

tion approach of QG systems and evaluate its relia-

bility in terms of consistent results for QG systems

through self-replication experiments. Details of

our contributions are listed as follows:

We propose a pretrained language model

based evaluation metric called QAScore,

which is unsupervised and reference-free.

QAScore utilizes the RoBERTa model (Liu

et al.,2019), and evaluates a system-generated

question using the cross entropy in terms of

the probability that RoBERTa can correctly

predict the masked words in the answer to

that question.

We propose a novel and highly reliable crowd-

sourced human evaluation method that can

be used as a standard framework for evaluat-

ing QG systems. Compared to other human

evaluation methods, it is cost-effective and

easy to deploy. We further conduct a self-

replication experiment showing a correlation

r= 0.955

in two distinct evaluations of the

same set of systems. According to the results

of the human evaluation experiment, QAS-

core can outperform all other metrics without

supervision steps or ﬁne-tuning, achieving a

strong Pearson correlation with human assess-

ment;

2 Background: Question Answering,

Question Generation and Evaluation

2.1 Question Answering

Question Answering (QA) aims to provide answers

to the corresponding questions

, Based on the

availability of context

, QA can be categorized into

Open-domain QA (without context) (Chen et al.,

2017;Zhu et al.,2021) and Machine Reading com-

prehension (with context) (Rajpurkar et al.,2016;

Saha et al.,2018). Besides, QA can also be cate-

gorized into generative QA (Koˇ

ciský et al.,2018;

Xu et al.,2022) and extractive QA (Trischler et al.,

2017;Lyu et al.,2022;Lewis et al.,2021;Zhang

et al.,2020b). Generally, the optimization objective

of QA models is to maximize the log likelihood

of the ground-truth answer

for the given context

. Therefore the objective function regarding the

parameters θof QA models is:

J(θ) = logP (a|c, q;θ)(1)

2.2 Question Generation

Question Generation (QG) is a task where mod-

els receive context passages

and answers

, then

generate the corresponding questions

which are

expected to be semantically relevant to the con-

text

and answers

(Pan et al.,2019;Lyu et al.,

2021). Thus QG is a reverse/dual task of QA as QA

aims to provide answers

to questions

whereas

QG targets at generating questions

for the given

answers

. Typically, the architecture of QG sys-

tems is mainly Seq2Seq model (Sutskever et al.,

2014) which generates the

word by word in auto-

regressive manner. The objective for optimizing

the parameters

of QG systems is to maximize the

likelihood of P(q|c, a):

J(θ) = logP (q|c, a) = X

logP (qi|q<i, c, a)

(2)

2.3 Automatic evaluation metrics

We introduce two main categories of automatic

evaluation metrics applied for QG task: word-

overlap-based metrics and pretrained-model-based

metrics in the following sections.

2.3.1 Word-overlap-based metrics

Word-overlap-based metrics usually assess the qual-

ity of a QG system according to the overlap rate

between the words of a system-generated candidate

and a reference. Most of such metrics, including

BLEU, GLEU, ROUGE and METEOR are initially

proposed for other NLP tasks (e.g., BLEU is for

MT and ROUGE is for text summarization), while

Answerability is a QG-exclusive evaluation metric.

BLEU Bilingual Evaluation Understudy (BLEU)

is a method that is originally proposed for

evaluating the quality of MT systems (Papineni

et al.,2002b). For QG evaluation, BLEU

computes the level of correspondence between a

system-generated question and the reference

question by calculating the precision according to

the number of n-gram matching segments. These

matching segments are thought to be unrelated to

their positions in the entire context. The more

matching segments there are, the better the quality

of the candidate is.

GLEU GLEU (Google-BLEU) is proposed to

overcome the drawbacks of evaluating a single

sentence (Wu et al.,2016). As a variation of

BLEU, the GLEU score is reported to be highly

correlated with the BLEU score on a corpus level.

GLEU uses the scores of precision and recall

instead of the modiﬁed precision in BLEU.

ROUGE Recall-Oriented Understudy for Gisting

Evaluation (ROUGE) is an evaluation metric

developed for the assessment of the text

summarization task, but originally adapted as a

recall-adaptation of BLEU (Lin,2004). ROUGE-L

is the most popular variant of ROUGE, where L

denotes the longest common subsequence (LCS).

The deﬁnition of LCS is a sequence of words that

appear in the same order in both sentences. In

contrast with sub-strings (e.g., n-gram), the

positions of words in a sub-sequence are not

required to be consecutive in the original sentence.

ROUGE-L is then computed by the F-βscore

according to the number of words in the LCS

between a question and a reference.

METEOR Metric for Evaluation of Translation

with Explicit ORdering (METEOR) was ﬁrstly

proposed to make up for the disadvantages of

BLEU, such as lack of recall and the inaccuracy of

assessing a single sentence (Banerjee and Lavie,

2005). METEOR ﬁrst generates a set of mappings

between the question qand the reference r

according to a set of stages, including: exact token

matching (i.e., two tokens are the same), WordNet

synonyms (e.g., well and good), and Porter

stemmer (e.g., friend and friends). METOER

score is then computed by the weighted harmonic

mean of precision and recall in terms of the

number of unigrams in mappings between a

question and a reference.

Answerability Aside from the aforementioned

evaluation methods - which are borrowed from

other NLP tasks, an automatic metric called

Answerability is speciﬁcally proposed for the QG

task (Nema and Khapra,2018). Nema and Khapra

(2018) suggest combining it with other existing

metrics since its aim is to measure how answerable

a question is, something not usually targeted by

other automatic metrics. For example, given a

reference question r: “What is the address of

DCU?” and two generated questions q1: “address

of DCU” and q2: “What is the address of ”, it is

obvious that q1is rather answerable since it

contains enough information while q2is very

confusing. However, any similarity-based metric is

certainly prone to think that q2(ROUGE-L: 90.9;

METEOR: 41.4; BLEU-1: 81.9) is closer to r

than q1(ROUGE-L: 66.7; METEOR: 38.0;

BLEU-1: 36.8). Thus, Answerability is proposed

to solve such an issue. In detail, for a

system-generated question qand a reference

question r, the Answerability score can be

computed as shown in Equation 3:

P=X

i∈E

hi(q, r)

ki(q)

R=X

i∈E

hi(q, r)

ki(r)

Answerability =2×P×R

P+R

(3)

where

(

i∈E

) represents certain types of ele-

ments in

E={R, N, Q, F }

(

Relevant Con-

tent Word,

Named Entity,

Question

Type, and

Function Word).

is the weight

for type

that

Xi∈Ewi= 1

. Function

hi(x, y)

returns the number of

-type words in question

that have matching

-type words in question

, and

ki(x)

returns the number of

-type words occuring

in question

. The ﬁnal Answerability score is the

F1 score of Precision Pand Recall R.

Along with using Answerability individually, a

common practice is to combine it with other met-

rics as suggested when evaluating QG systems

(Chen et al.,2020;Lewis et al.,2020b):

Metricmod =β·Answerability+

(1 −β)·Metricori (4)

where

Metricmod

is a modiﬁed version of an origi-

nal evaluation metric

Metricori

using Answerabil-

ity, and

is a hyper-parameter. In this experiment,

we combine it with BLEU to generate the

-BLEU

score using the default value of β.

2.3.2 Pretrained-model-based metrics

BERTScore Zhang et al. (2020a) proposed an

automatic metric called BERTScore for evaluating

text generation task because word-overlap-based

metrics like BLEU fail to account for

compositional diversity. Instead, BERTScore

computes a similarity score between tokens in a

candidate sentence and its reference based on their

contextualized representations produced by BERT

(Devlin et al.,2019). Given a question that has m

tokens and a question that has

tokens, the BERT

model can ﬁrst generate the representations of q

and ras q=hq1, q2, . . . , qmiand

r=hr1, r2, . . . , rni, where qiand rirespectively

mean the contextual embeddings of the i-th token

in qand r. Then, the BERT score between the

question and the reference can be computed by

Equation 5:

PBERT =1

pi∈p

max

rj∈rp>

irj

RBERT =1

ri∈r

max

pj∈pp>

jri

BERTScore =2·PBERT ·RBERT

PBERT +RBERT

(5)

where the ﬁnal BERTScore is the F1 measure com-

puted by precision PBERT and recall RBERT.

BLEURT

BLEURT is proposed to solve the issue

that metrics like BLEU may correlate poorly with

human judgments (Sellam et al.,2020). It is a

trained evaluation metric which takes a candidate

and its reference as input and gives a score to

indicate how the candidate can cover the meaning

of the reference. BLEURT uses a BERT-based

regression model trained on the human rating data

from the WMT Metrics Shared Task from 2017 to

2019. Since BLEURT was proposed for evaluating

models on the sentence level, meanwhile no

formal experiments are available for corpus-level

evaluation, we directly compute the ﬁnal BLEURT

score of a QG system as the arithmetic mean of all

sentence-level BLEURT scores in our QG

evaluation experiment as suggested (see the

discussion on https://github.com/

google-research/bleurt/issues/10).

2.4 Human evaluation

Although the aforementioned prevailing automatic

metrics mentioned above are widely employed for

QG evaluation, criticism of

-gram overlap-based

metrics’ ability to accurately and comprehensively

evaluate the quality has also been highlighted (Yuan

et al.,2017). As a single answer can potentially

have a large number of corresponding plausible

questions, simply computing the overlap rate be-

tween an output and a reference to reﬂect the real

quality of a QG system does not seem convincing.

A possible solution is to obtain more correct ques-

tions per answer, as

-gram overlap-based metrics

would usually beneﬁt from multiple ground-truth

references. However, this may elicit new issues: 1)

adding additional references over the entire corpora

requires similar effort to creating a new data set in-

curring expensive and time resource costs; 2) it is

not straightforward to formulate how word overlap

should contribute to the ﬁnal score for systems.

Hence, human evaluation is also involved when

evaluating newly proposed QG systems. A com-

mon approach is to evaluate a set of system-

generated questions and ask human raters to score

these questions on an

-point Likert scale. Below

we introduce and describe recent human evalua-

tions are applied to evaluate QG systems.

Jia et al. (2021) proposed EQG-RACE to gen-

erate examination-type questions for educational

purposes. 100 outputs are sampled and three ex-

pert raters are required to score these outputs in

three dimensions: ﬂuency - whether a question is

grammatical and ﬂuent; relevancy - whether the

question is semantically relevant to the passage;

and answerability - whether the question can be an-

swered by the right answer. A 3-point scale is used

for each aspect, and aspects are reported separately

without overall performance.

KD-QG is a framework with a knowledge base

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QAScore-AnUnsupervisedUnreferencedMetricfortheQuestionGenerationEvaluationTianboJi1ChenyangLyu2GarethJones1LitingZhou2YvetteGraham31ADAPTCentre2SchoolofComputing,DublinCityUniversity,Ireland3SchoolofComputerScienceandStatistics,TrinityCollegeDublin,Ireland{tianbo.ji,yvette.graham,gareth.jones}@adapt...

展开>> 收起<<

QAScore - An Unsupervised Unreferenced Metric for the Question Generation Evaluation Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou2Yvette Graham3.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

QAScore - An Unsupervised Unreferenced Metric for the Question Generation Evaluation Tianbo Ji1Chenyang Lyu2Gareth Jones1Liting Zhou2Yvette Graham3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: