Generative Language Models for Paragraph-Level Question Generation Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados Cardiff NLP School of Computer Science and Informatics Cardiff University UK

2025-05-06 0 0 1.15MB 19 页 10玖币

侵权投诉

Generative Language Models for Paragraph-Level Question Generation

Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados

Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK

{UshioA,AlvaManchegoF,CamachoColladosJ}@cardiff.ac.uk

Abstract

Powerful generative models have led to recent

progress in question generation (QG). How-

ever, it is difﬁcult to measure advances in

QG research since there are no standardized

resources that allow a uniform comparison

among approaches. In this paper, we introduce

QG-Bench, a multilingual and multidomain

benchmark for QG that uniﬁes existing ques-

tion answering datasets by converting them to

a standard QG setting. It includes general-

purpose datasets such as SQuAD (Rajpurkar

et al.,2016) for English, datasets from ten do-

mains and two styles, as well as datasets in

eight different languages. Using QG-Bench

as a reference, we perform an extensive anal-

ysis of the capabilities of language models for

the task. First, we propose robust QG base-

lines based on ﬁne-tuning generative language

models. Then, we complement automatic eval-

uation based on standard metrics with an ex-

tensive manual evaluation, which in turn sheds

light on the difﬁculty of evaluating QG models.

Finally, we analyse both the domain adaptabil-

ity of these models as well as the effectiveness

of multilingual models in languages other than

English. QG-Bench is released along with

the ﬁne-tuned models presented in the paper,1

which are also available as a demo.2

1 Introduction

Question generation (QG, Mitkov and Ha,2003)

is the task of generating a question given an in-

put context consisting of a document, a paragraph

or a sentence, and an answer where the question

is anchored (see Figure 1). QG has been widely

studied in natural language processing communi-

ties (Du et al.,2017;Zhou et al.,2017;Du and

Cardie,2018), and it has recently been exploited

to train question answering (QA) models without

human supervision (Lewis et al.,2019;Zhang and

1https://github.com/asahi417/

lm-question-generation

2https://autoqg.net/

Figure 1: Overview of paragraph-level QG.

Bansal,2019;Puri et al.,2020), or as a means of

data augmentation (Shakeri et al.,2020;Bartolo

et al.,2021). It has also been applied to develop

educational systems (Heilman and Smith,2010;

Lindberg et al.,2013), information retrieval mod-

els (Pyatkin et al.,2021;Lewis et al.,2021), and

for model interpretation (Perez et al.,2020;Lee

et al.,2020).

Despite its success in downstream applications,

the development of neural QG models has received

less attention. For example, the choice of the base

pre-trained model is arbitrary (without proper justi-

ﬁcation in most cases) as it is not straightforward to

compare different models. As a consequence, while

ERNIE-GEN (Xiao et al.,2021) and UniLMv2

(Bao et al.,2020) are current SotA in the SQuAD

QG benchmark (Du et al.,2017), T5 (Raffel et al.,

2020) and BART (Lewis et al.,2020a) are used

in many applications in practice (Paranjape et al.,

2021;Bartolo et al.,2021;Lewis et al.,2021;Py-

atkin et al.,2021).

A possible reason is inconsistent evaluation and

comparison of QG models, due to the lack of appro-

priate evaluation protocols and benchmarks. For in-

stance, evaluation of QG models relies on BLEU4

(Papineni et al.,2002), METEOR (Denkowski and

Lavie,2014), and ROUGE

(Lin,2004), with

human-made questions as references. However,

arXiv:2210.03992v3 [cs.CL] 2 Jan 2023

some of these metrics may have low correlation

with human judgements, especially when it comes

to answerability, since they tend not to take the

associated answer into account (Nema and Khapra,

2018). Moreover, QG applications can use different

contexts as input, such as sentence-level (Pyatkin

et al.,2021;Lewis et al.,2019) vs paragraph-level

(Zhang and Bansal,2019;Puri et al.,2020), or

answer-aware (Shakeri et al.,2020;Bartolo et al.,

2021) vs answer-free (Lopez et al.,2020). These

are generally used interchangeably in the literature.

To investigate how to tackle the issues previ-

ously raised, we introduce QG-Bench, a collec-

tion of standard QA datasets uniﬁed into a single

benchmark, including domain-speciﬁc datasets and

for eight different languages (§ 3). We then use

QG-Bench to ﬁne-tune various generative language

models (LMs) by formulating paragraph-level QG

as a sequence-to-sequence generation task (§ 4),

and measure their performance on in-domain and

language-speciﬁc data (§ 5). Finally, we present a

multi-faceted analysis of our QG models by vary-

ing their input context size (§ 6.1), conducting a

manual evaluation (§ 6.2), and studying their abili-

ties for domain adaptation (§ 6.3).

2 Related Work

Early work on QG was based on human-engineered

templates (Mitkov and Ha,2003;Rus et al.,2010)

and well-designed pipelines (Heilman and Smith,

2010;Labutov et al.,2015), but soon neural ap-

proaches took over by generating a question from a

text in an end-to-end manner (Du et al.,2017;Zhou

et al.,2017;Du and Cardie,2018). The quality of

QG models was later improved by masked LM

pre-training (Devlin et al.,2019;Liu et al.,2019)

where the encoder of the QG model is ﬁne-tuned

from pre-trained LMs (Chan and Fan,2019;Zhang

and Bansal,2019). Recently, sequence-to-sequence

LM pre-training has allowed to fully ﬁne-tune QG

models (both encoder and decoder), achieving SotA

performance (Dong et al.,2019;Qi et al.,2020;

Bao et al.,2020;Xiao et al.,2021). Following

the latest research in the literature, we focus on

sequence-to-sequence LM-based QG models.

QG can be applied to domain adaptation

(Shakeri et al.,2020), knowledge-enhanced

LM pre-training (Jia et al.,2021), adversar-

ial/counterfactual data augmentation (Bartolo et al.,

2021;Paranjape et al.,2021), and nearest neigh-

bour QA systems (Lewis et al.,2021). Applications

of QG go beyond QA, including semantic role la-

beling (Pyatkin et al.,2021), visual QA (Krishna

et al.,2019), multi-hop question decomposition

(Perez et al.,2020), and question rewriting (Lee

et al.,2020). Moreover, QG can be applied to

unsupervised QA, which consists of training a QA

model without any supervision and relying on a QG

model to generate questions (Lewis et al.,2019).

Puri et al. (2020) showed that with a carefully-

designed QG model, we can generate high-quality

QA datasets on which a QA model can even outper-

form their supervised counterparts. This inspired

Zhang and Bansal (2019) to propose QA-based

evaluation, which connects the quality of a QG

model to the accuracy of a QA model trained on

the synthetic data generated by the QG model.

While QG models can be applied to this vari-

ety of tasks, the comparison across tasks is not

always straightforward. For this reason, and given

the relevance of QG in current research, in this

paper we propose an intrinsic QG benchmark in

which we can evaluate different aspects of a QG

model in a simple manner, including, but not only,

analysis of input types, domain adaptability and

multilinguality. The most similar work to ours is

the MTG benchmark (Chen et al.,2021), which

contains multilingual test sets for four NLG tasks.

While QG is part of this benchmark, there are a few

major differences from our proposed QG-Bench:

(i) we provide training/validation/test sets to allow

model training in each language in addition to the

evaluation; (ii) MTG’s test set consists of parallel

sentences across languages by a translation from

English, while we leverage monolingual datasets;

(iii) we include eight languages, while MTG has

ﬁve; and (iv) QG-Bench includes datasets from

different domains and styles.

3 QG-Bench: A Uniﬁed Question

Generation Benchmark

In this section, we describe our process to construct

QG-Bench, including data collection and uniﬁca-

tion (§ 3.1), and its statistics (§ 3.2).

3.1 Data Collection and Uniﬁcation

We uniﬁed a collection of datasets, designed to be

used for QG model training and evaluation. All

datasets are in the same format, where each en-

try contains four features: paragraph,sentence,

question, and answer. As described in Figure 1,

we assume question as the output of a QG system,

which is conditioned by an answer and it is always

a sub-string of a sentence from a paragraph. We

leverage existing QA datasets by compiling them

into this uniﬁed QG format. All datasets included

in QG-Bench are described below.

SQuAD (English).

We ﬁrst consider SQuAD v1.1

(Rajpurkar et al.,2016), an extractive QA dataset

based on Wikipedia which has been used in QG

commonly since (Du et al.,2017;Zhou et al.,2017).

As the original test set of SQuAD is not released,

we use the same data split as in (Du et al.,2017).

Domain-speciﬁc Datasets (English).

To as-

sess models’ domain adaptivity, we consider

two domain-speciﬁc QA datasets: SQuADShifts

(Miller et al.,2020) and SubjQA (Bjerva et al.,

2020). SQuADShifts contains questions in the

same style of SQuAD but from four additional

domains (Amazon/Wikipedia/News/Reddit), while

SubjQA consists, unlike SQuAD, of subjective

questions/answers in general (e.g. how is the hero?

-the hero was wonderful) across six domains. As

the original SQuADShifts consists of test set only,

we created a new training/validation/test split, in

which half of the dataset remains in the test set,

while the remaining half is split for validation and

training by a 1:2 ratio.

Datasets in Languages other than English.

investigate multilinguality in QG, we compile

the following seven SQuAD-style QA datasets:

JAQuAD (So et al.,2022) (Japanese), GerQuAD

(Möller et al.,2021) (German), SberQuAd (Eﬁ-

mov et al.,2020) (Russian), KorQuAD (Lim et al.,

2019) (Korean), FQuAD (d’Hoffschmidt et al.,

2020) (French), Spanish SQuAD (Casimiro Pio

et al.,2019) (Spanish), and Italian SQuAD (Croce

et al.,2018) (Italian). Since they do not release test

sets, we sampled a subset from the training sets

as the test set following Du et al. (2017). The test

sets contain the same number of questions as its

validation set, and the new training/test splits have

no overlap in terms of the paragraphs.

Other Datasets not Included in QG-Bench.

theory, any extractive QA dataset could be part of

our benchmark. However, we decided not to in-

clude datasets such as BioASQ (Tsatsaronis et al.,

2015) and NewsQA (Trischler et al.,2017) because

they have very long input texts, representing an-

other category that needs extra mechanisms to han-

dle long sequences (Izacard and Grave,2020a,b),

which is out of the scope of this paper. In addition,

one could leverage multilingual QA benchmarks

Data size Average character length

(train/valid/test) (para./sent./ques./ans.)

SQuAD 75,722 / 10,570 / 11,877 757 / 179 / 59/ 20

SubjQA

-Book 637 / 92 / 191 1,514 / 146 / 28 / 83

-Elec. 697 / 99 / 238 1,282 / 129 / 26 / 66

-Grocery 687 / 101 / 379 896 / 107 / 25 / 49

-Movie 724 / 101 / 154 1,746 / 146 / 27 / 72

-Rest. 823 / 129 / 136 1,006 / 104 / 26 / 51

-Trip 875 / 143 / 397 1,002 / 108 / 27 / 51

SQuADShifts

-Amazon 3,295 / 1,648 / 4,942 773 / 111 /43 / 18

-Wiki 2,646 / 1,323 / 3,969 773 / 184 / 58 / 26

-News 3,355 / 1,678 / 5,032 781 / 169 / 51 / 20

-Reddit 3,268 / 1,634 / 4,901 774 / 116 / 45 / 19

Multilingual QG

-Ja 27,809 / 3,939 / 3,939 424 / 72 / 32 / 6

-Es 77,025 / 10,570 / 10,570 781 / 122 / 64 / 21

-De 9,314 / 2,204 / 2,204 1,577 / 165 / 59 / 66

-Ru 40,291 / 5,036 / 5,036 754 174 / 64 / 26

-Ko 54,556 / 5,766 / 5,766 521 / 81 / 34 / 6

-It 46,550 / 7,609 / 7,609 807 / 124 / 66 / 16

-Fr 17,543 / 3,188 / 3,188 797 / 160 / 57 / 23

Table 1: Statistics of of all datasets integrating into our

question generation benchmark after uniﬁcation.

(Clark et al.,2020;Artetxe et al.,2020;Lewis et al.,

2020b) to obtain multilingual QG datasets, but

XQuAD (Artetxe et al.,2020) and MLQA (Lewis

et al.,2020b) do not contain training sets, and Ty-

diQA (Clark et al.,2020) contains a very small

training set. Instead, we focused on monolingual

QA datasets in each language.

3.2 Data Statistics

Table 1 summarizes statistics of each QG dataset af-

ter uniﬁcation. It can be observed that SubjQA and

SQuADShifts have ten to a hundred times less train-

ing data than SQuAD. Also, SubjQA’s answers are

twice longer than SQuAD’s answers, which can be

explained by how they differ in the way questions

are formed (i.e., SubjQA being more subjective in

nature). Likewise, except for Spanish, the datasets

for languages other than English contain less train-

ing data than the original SQuAD, with the number

varying depending on the language.

4 LMs for Question Generation

In this section, we formalize the QG task from a

language modelling perspective (§ 4.1), including

details on the ﬁne-tuning process (§ 4.2) and the

setup for our experiments with QG-Bench (§ 4.3).

4.1 Task Formulation

Given an input text

, the goal of QG is to generate

a natural question

ˆq

related to the information in

the input. The task is formulated as a conditional

sequence generation, and the model is optimized

to maximize the conditional log-likelihood

P(q|c)

as in Equation 1.

ˆq= arg max

P(q|c)(1)

In practice, the log-likelihood is factorized into

word or subword level predictions, similar to other

sequence-to-sequence learning settings (Sutskever

et al.,2014).

4.2 Language Model Fine-tuning

Fine-tuning sequence-to-sequence LMs on QG can

be done in the same way as for Machine Translation

or Summarization, where models are trained to

predict the output tokens given the input tokens

(Dong et al.,2019;Qi et al.,2020;Bao et al.,2020;

Xiao et al.,2021). We follow Chan and Fan (2019)

by introducing a highlight token

<hl>

to take into

account an answer awithin a context cas below:

x= [c1,...,<hl>, a1, . . . , a|a|,<hl>, . . . , c|c|]

Instead of a paragraph, we can similarly use a sen-

tence to highlight an answer (sentence-level QG) or

highlight a sentence instead of an answer (answer-

free QG). We investigate these model variations in

our analysis (§ 6.1), but assume the answer high-

lighted paragraph as the default input.

Note that it is possible to train other types of LMs

on QG, but masked LMs were not designed for

natural language generation and require a speciﬁc

decoding technique (Chan and Fan,2019). Also,

recurrent LMs have poor ability for conditional

generation on the answer due to its unidirectional

architecture (Lopez et al.,2020). Since they are

not as suited for QG as the sequence-to-sequence

models, they are out of the scope of this paper.

4.3 Experimental Setup

Comparison Models.

As sequence-to-sequence

LMs, we use T5 (Raffel et al.,2020) and BART

(Lewis et al.,2020a) for the English datasets and

mT5 (Xue et al.,2021) and mBART (Liu et al.,

2020) for the multilingual experiments. Model

weights are taken from HuggingFace (Wolf et al.,

2020).

Previous research reported improvements

on QG with more recent LMs (Qi et al.,2020;

We use

t5-small

t5-base

t5-large

facebook/bart-base

facebook/bart-large

, and

google/mt5-small.

Xiao et al.,2021;Bao et al.,2020). We tried to

replicate these previous works in QG-Bench, but

after multiple attempts using their provided code

and contacting the authors, this was not possible.

Nonetheless, both T5 and BART are widely used in

practice and, as we will show, they can still provide

strong results with an appropriate conﬁguration.

Parameter Optimization.

We performed an ex-

tensive exploration to ﬁnd the best combination of

hyper-parameters to ﬁne-tune LMs on QG, which

consists of a two-phase search. First, we ﬁne-tune

a model on every possible conﬁguration from the

search space for 2 epochs. The top-5 models in

terms of BLEU4 (Papineni et al.,2002) on the vali-

dation set are selected to continue ﬁne-tuning until

their performance plateaus.

Finally, the model that

achieves the highest BLEU4 on the validation set

is employed as the ﬁnal model. We used BLEU4 as

an objective metric in our parameter optimization

since it is light to compute, and following previous

work (Du and Cardie,2018;Dong et al.,2019;Xiao

et al.,2021). However, as we will see in our experi-

ments, future work could also explore the usage of

alternative metrics for validation. The search space

contains 24 conﬁgurations, which are made up of

learning rates from

[0.0001,0.00005,0.00001]

, la-

bel smoothing from

[0.0,0.15]

, and batch size from

[64,128,256,512]

Our experiments show that

this simple parameter optimization strategy signif-

icantly improves all models’ performances by ro-

bustly ﬁnding the best conﬁguration for each one.

We ran the parameter optimization on a machine

equipped with two Nvidia Quadro RTX 8000. Tak-

ing SQuAD as a reference, training and evalua-

tion took around three weeks for T5

LARGE

, one

week for T5

BASE

and mT5

SMALL

, three days for

SMALL

, one week for BART

LARGE

, and four days

for BARTSMALL.

5 Automatic Evaluation

In this section, we report the main results in QG-

Bench (§ 3), using the methodology described in

§ 4.

This two-stage process is introduced due to computation

limitations, and we might see further improvements (even if

small) if a full validation search is performed.

Other parameters are ﬁxed: random seed is

, beam size

, input token length is

512

, and output token length is

for ﬁne-tuning and 64 for evaluation.

See Appendix for the actual parameters found by the

optimization procedure as well as more training details.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GenerativeLanguageModelsforParagraph-LevelQuestionGenerationAsahiUshioandFernandoAlva-ManchegoandJoseCamacho-ColladosCardiffNLP,SchoolofComputerScienceandInformatics,CardiffUniversity,UK{UshioA,AlvaManchegoF,CamachoColladosJ}@cardiff.ac.ukAbstractPowerfulgenerativemodelshaveledtorecentprogressinques...

展开>> 收起<<

Generative Language Models for Paragraph-Level Question Generation Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados Cardiff NLP School of Computer Science and Informatics Cardiff University UK.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generative Language Models for Paragraph-Level Question Generation Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados Cardiff NLP School of Computer Science and Informatics Cardiff University UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: