Generative Language Models for Paragraph-Level Question Generation Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados Cardiff NLP School of Computer Science and Informatics Cardiff University UK

2025-05-06 0 0 1.15MB 19 页 10玖币
侵权投诉
Generative Language Models for Paragraph-Level Question Generation
Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados
Cardiff NLP, School of Computer Science and Informatics, Cardiff University, UK
{UshioA,AlvaManchegoF,CamachoColladosJ}@cardiff.ac.uk
Abstract
Powerful generative models have led to recent
progress in question generation (QG). How-
ever, it is difficult to measure advances in
QG research since there are no standardized
resources that allow a uniform comparison
among approaches. In this paper, we introduce
QG-Bench, a multilingual and multidomain
benchmark for QG that unifies existing ques-
tion answering datasets by converting them to
a standard QG setting. It includes general-
purpose datasets such as SQuAD (Rajpurkar
et al.,2016) for English, datasets from ten do-
mains and two styles, as well as datasets in
eight different languages. Using QG-Bench
as a reference, we perform an extensive anal-
ysis of the capabilities of language models for
the task. First, we propose robust QG base-
lines based on fine-tuning generative language
models. Then, we complement automatic eval-
uation based on standard metrics with an ex-
tensive manual evaluation, which in turn sheds
light on the difficulty of evaluating QG models.
Finally, we analyse both the domain adaptabil-
ity of these models as well as the effectiveness
of multilingual models in languages other than
English. QG-Bench is released along with
the fine-tuned models presented in the paper,1
which are also available as a demo.2
1 Introduction
Question generation (QG, Mitkov and Ha,2003)
is the task of generating a question given an in-
put context consisting of a document, a paragraph
or a sentence, and an answer where the question
is anchored (see Figure 1). QG has been widely
studied in natural language processing communi-
ties (Du et al.,2017;Zhou et al.,2017;Du and
Cardie,2018), and it has recently been exploited
to train question answering (QA) models without
human supervision (Lewis et al.,2019;Zhang and
1https://github.com/asahi417/
lm-question-generation
2https://autoqg.net/
Figure 1: Overview of paragraph-level QG.
Bansal,2019;Puri et al.,2020), or as a means of
data augmentation (Shakeri et al.,2020;Bartolo
et al.,2021). It has also been applied to develop
educational systems (Heilman and Smith,2010;
Lindberg et al.,2013), information retrieval mod-
els (Pyatkin et al.,2021;Lewis et al.,2021), and
for model interpretation (Perez et al.,2020;Lee
et al.,2020).
Despite its success in downstream applications,
the development of neural QG models has received
less attention. For example, the choice of the base
pre-trained model is arbitrary (without proper justi-
fication in most cases) as it is not straightforward to
compare different models. As a consequence, while
ERNIE-GEN (Xiao et al.,2021) and UniLMv2
(Bao et al.,2020) are current SotA in the SQuAD
QG benchmark (Du et al.,2017), T5 (Raffel et al.,
2020) and BART (Lewis et al.,2020a) are used
in many applications in practice (Paranjape et al.,
2021;Bartolo et al.,2021;Lewis et al.,2021;Py-
atkin et al.,2021).
A possible reason is inconsistent evaluation and
comparison of QG models, due to the lack of appro-
priate evaluation protocols and benchmarks. For in-
stance, evaluation of QG models relies on BLEU4
(Papineni et al.,2002), METEOR (Denkowski and
Lavie,2014), and ROUGE
L
(Lin,2004), with
human-made questions as references. However,
arXiv:2210.03992v3 [cs.CL] 2 Jan 2023
some of these metrics may have low correlation
with human judgements, especially when it comes
to answerability, since they tend not to take the
associated answer into account (Nema and Khapra,
2018). Moreover, QG applications can use different
contexts as input, such as sentence-level (Pyatkin
et al.,2021;Lewis et al.,2019) vs paragraph-level
(Zhang and Bansal,2019;Puri et al.,2020), or
answer-aware (Shakeri et al.,2020;Bartolo et al.,
2021) vs answer-free (Lopez et al.,2020). These
are generally used interchangeably in the literature.
To investigate how to tackle the issues previ-
ously raised, we introduce QG-Bench, a collec-
tion of standard QA datasets unified into a single
benchmark, including domain-specific datasets and
for eight different languages (§ 3). We then use
QG-Bench to fine-tune various generative language
models (LMs) by formulating paragraph-level QG
as a sequence-to-sequence generation task (§ 4),
and measure their performance on in-domain and
language-specific data (§ 5). Finally, we present a
multi-faceted analysis of our QG models by vary-
ing their input context size (§ 6.1), conducting a
manual evaluation (§ 6.2), and studying their abili-
ties for domain adaptation (§ 6.3).
2 Related Work
Early work on QG was based on human-engineered
templates (Mitkov and Ha,2003;Rus et al.,2010)
and well-designed pipelines (Heilman and Smith,
2010;Labutov et al.,2015), but soon neural ap-
proaches took over by generating a question from a
text in an end-to-end manner (Du et al.,2017;Zhou
et al.,2017;Du and Cardie,2018). The quality of
QG models was later improved by masked LM
pre-training (Devlin et al.,2019;Liu et al.,2019)
where the encoder of the QG model is fine-tuned
from pre-trained LMs (Chan and Fan,2019;Zhang
and Bansal,2019). Recently, sequence-to-sequence
LM pre-training has allowed to fully fine-tune QG
models (both encoder and decoder), achieving SotA
performance (Dong et al.,2019;Qi et al.,2020;
Bao et al.,2020;Xiao et al.,2021). Following
the latest research in the literature, we focus on
sequence-to-sequence LM-based QG models.
QG can be applied to domain adaptation
(Shakeri et al.,2020), knowledge-enhanced
LM pre-training (Jia et al.,2021), adversar-
ial/counterfactual data augmentation (Bartolo et al.,
2021;Paranjape et al.,2021), and nearest neigh-
bour QA systems (Lewis et al.,2021). Applications
of QG go beyond QA, including semantic role la-
beling (Pyatkin et al.,2021), visual QA (Krishna
et al.,2019), multi-hop question decomposition
(Perez et al.,2020), and question rewriting (Lee
et al.,2020). Moreover, QG can be applied to
unsupervised QA, which consists of training a QA
model without any supervision and relying on a QG
model to generate questions (Lewis et al.,2019).
Puri et al. (2020) showed that with a carefully-
designed QG model, we can generate high-quality
QA datasets on which a QA model can even outper-
form their supervised counterparts. This inspired
Zhang and Bansal (2019) to propose QA-based
evaluation, which connects the quality of a QG
model to the accuracy of a QA model trained on
the synthetic data generated by the QG model.
While QG models can be applied to this vari-
ety of tasks, the comparison across tasks is not
always straightforward. For this reason, and given
the relevance of QG in current research, in this
paper we propose an intrinsic QG benchmark in
which we can evaluate different aspects of a QG
model in a simple manner, including, but not only,
analysis of input types, domain adaptability and
multilinguality. The most similar work to ours is
the MTG benchmark (Chen et al.,2021), which
contains multilingual test sets for four NLG tasks.
While QG is part of this benchmark, there are a few
major differences from our proposed QG-Bench:
(i) we provide training/validation/test sets to allow
model training in each language in addition to the
evaluation; (ii) MTG’s test set consists of parallel
sentences across languages by a translation from
English, while we leverage monolingual datasets;
(iii) we include eight languages, while MTG has
ve; and (iv) QG-Bench includes datasets from
different domains and styles.
3 QG-Bench: A Unified Question
Generation Benchmark
In this section, we describe our process to construct
QG-Bench, including data collection and unifica-
tion (§ 3.1), and its statistics (§ 3.2).
3.1 Data Collection and Unification
We unified a collection of datasets, designed to be
used for QG model training and evaluation. All
datasets are in the same format, where each en-
try contains four features: paragraph,sentence,
question, and answer. As described in Figure 1,
we assume question as the output of a QG system,
which is conditioned by an answer and it is always
a sub-string of a sentence from a paragraph. We
leverage existing QA datasets by compiling them
into this unified QG format. All datasets included
in QG-Bench are described below.
SQuAD (English).
We first consider SQuAD v1.1
(Rajpurkar et al.,2016), an extractive QA dataset
based on Wikipedia which has been used in QG
commonly since (Du et al.,2017;Zhou et al.,2017).
As the original test set of SQuAD is not released,
we use the same data split as in (Du et al.,2017).
Domain-specific Datasets (English).
To as-
sess models’ domain adaptivity, we consider
two domain-specific QA datasets: SQuADShifts
(Miller et al.,2020) and SubjQA (Bjerva et al.,
2020). SQuADShifts contains questions in the
same style of SQuAD but from four additional
domains (Amazon/Wikipedia/News/Reddit), while
SubjQA consists, unlike SQuAD, of subjective
questions/answers in general (e.g. how is the hero?
-the hero was wonderful) across six domains. As
the original SQuADShifts consists of test set only,
we created a new training/validation/test split, in
which half of the dataset remains in the test set,
while the remaining half is split for validation and
training by a 1:2 ratio.
Datasets in Languages other than English.
To
investigate multilinguality in QG, we compile
the following seven SQuAD-style QA datasets:
JAQuAD (So et al.,2022) (Japanese), GerQuAD
(Möller et al.,2021) (German), SberQuAd (Efi-
mov et al.,2020) (Russian), KorQuAD (Lim et al.,
2019) (Korean), FQuAD (d’Hoffschmidt et al.,
2020) (French), Spanish SQuAD (Casimiro Pio
et al.,2019) (Spanish), and Italian SQuAD (Croce
et al.,2018) (Italian). Since they do not release test
sets, we sampled a subset from the training sets
as the test set following Du et al. (2017). The test
sets contain the same number of questions as its
validation set, and the new training/test splits have
no overlap in terms of the paragraphs.
Other Datasets not Included in QG-Bench.
In
theory, any extractive QA dataset could be part of
our benchmark. However, we decided not to in-
clude datasets such as BioASQ (Tsatsaronis et al.,
2015) and NewsQA (Trischler et al.,2017) because
they have very long input texts, representing an-
other category that needs extra mechanisms to han-
dle long sequences (Izacard and Grave,2020a,b),
which is out of the scope of this paper. In addition,
one could leverage multilingual QA benchmarks
Data size Average character length
(train/valid/test) (para./sent./ques./ans.)
SQuAD 75,722 / 10,570 / 11,877 757 / 179 / 59/ 20
SubjQA
-Book 637 / 92 / 191 1,514 / 146 / 28 / 83
-Elec. 697 / 99 / 238 1,282 / 129 / 26 / 66
-Grocery 687 / 101 / 379 896 / 107 / 25 / 49
-Movie 724 / 101 / 154 1,746 / 146 / 27 / 72
-Rest. 823 / 129 / 136 1,006 / 104 / 26 / 51
-Trip 875 / 143 / 397 1,002 / 108 / 27 / 51
SQuADShifts
-Amazon 3,295 / 1,648 / 4,942 773 / 111 /43 / 18
-Wiki 2,646 / 1,323 / 3,969 773 / 184 / 58 / 26
-News 3,355 / 1,678 / 5,032 781 / 169 / 51 / 20
-Reddit 3,268 / 1,634 / 4,901 774 / 116 / 45 / 19
Multilingual QG
-Ja 27,809 / 3,939 / 3,939 424 / 72 / 32 / 6
-Es 77,025 / 10,570 / 10,570 781 / 122 / 64 / 21
-De 9,314 / 2,204 / 2,204 1,577 / 165 / 59 / 66
-Ru 40,291 / 5,036 / 5,036 754 174 / 64 / 26
-Ko 54,556 / 5,766 / 5,766 521 / 81 / 34 / 6
-It 46,550 / 7,609 / 7,609 807 / 124 / 66 / 16
-Fr 17,543 / 3,188 / 3,188 797 / 160 / 57 / 23
Table 1: Statistics of of all datasets integrating into our
question generation benchmark after unification.
(Clark et al.,2020;Artetxe et al.,2020;Lewis et al.,
2020b) to obtain multilingual QG datasets, but
XQuAD (Artetxe et al.,2020) and MLQA (Lewis
et al.,2020b) do not contain training sets, and Ty-
diQA (Clark et al.,2020) contains a very small
training set. Instead, we focused on monolingual
QA datasets in each language.
3.2 Data Statistics
Table 1 summarizes statistics of each QG dataset af-
ter unification. It can be observed that SubjQA and
SQuADShifts have ten to a hundred times less train-
ing data than SQuAD. Also, SubjQAs answers are
twice longer than SQuAD’s answers, which can be
explained by how they differ in the way questions
are formed (i.e., SubjQA being more subjective in
nature). Likewise, except for Spanish, the datasets
for languages other than English contain less train-
ing data than the original SQuAD, with the number
varying depending on the language.
4 LMs for Question Generation
In this section, we formalize the QG task from a
language modelling perspective (§ 4.1), including
details on the fine-tuning process (§ 4.2) and the
setup for our experiments with QG-Bench (§ 4.3).
4.1 Task Formulation
Given an input text
c
, the goal of QG is to generate
a natural question
ˆq
related to the information in
the input. The task is formulated as a conditional
sequence generation, and the model is optimized
to maximize the conditional log-likelihood
P(q|c)
as in Equation 1.
ˆq= arg max
q
P(q|c)(1)
In practice, the log-likelihood is factorized into
word or subword level predictions, similar to other
sequence-to-sequence learning settings (Sutskever
et al.,2014).
4.2 Language Model Fine-tuning
Fine-tuning sequence-to-sequence LMs on QG can
be done in the same way as for Machine Translation
or Summarization, where models are trained to
predict the output tokens given the input tokens
(Dong et al.,2019;Qi et al.,2020;Bao et al.,2020;
Xiao et al.,2021). We follow Chan and Fan (2019)
by introducing a highlight token
<hl>
to take into
account an answer awithin a context cas below:
x= [c1,...,<hl>, a1, . . . , a|a|,<hl>, . . . , c|c|]
Instead of a paragraph, we can similarly use a sen-
tence to highlight an answer (sentence-level QG) or
highlight a sentence instead of an answer (answer-
free QG). We investigate these model variations in
our analysis (§ 6.1), but assume the answer high-
lighted paragraph as the default input.
Note that it is possible to train other types of LMs
on QG, but masked LMs were not designed for
natural language generation and require a specific
decoding technique (Chan and Fan,2019). Also,
recurrent LMs have poor ability for conditional
generation on the answer due to its unidirectional
architecture (Lopez et al.,2020). Since they are
not as suited for QG as the sequence-to-sequence
models, they are out of the scope of this paper.
4.3 Experimental Setup
Comparison Models.
As sequence-to-sequence
LMs, we use T5 (Raffel et al.,2020) and BART
(Lewis et al.,2020a) for the English datasets and
mT5 (Xue et al.,2021) and mBART (Liu et al.,
2020) for the multilingual experiments. Model
weights are taken from HuggingFace (Wolf et al.,
2020).
3
Previous research reported improvements
on QG with more recent LMs (Qi et al.,2020;
3
We use
t5-small
,
t5-base
,
t5-large
,
facebook/bart-base
,
facebook/bart-large
, and
google/mt5-small.
Xiao et al.,2021;Bao et al.,2020). We tried to
replicate these previous works in QG-Bench, but
after multiple attempts using their provided code
and contacting the authors, this was not possible.
Nonetheless, both T5 and BART are widely used in
practice and, as we will show, they can still provide
strong results with an appropriate configuration.
Parameter Optimization.
We performed an ex-
tensive exploration to find the best combination of
hyper-parameters to fine-tune LMs on QG, which
consists of a two-phase search. First, we fine-tune
a model on every possible configuration from the
search space for 2 epochs. The top-5 models in
terms of BLEU4 (Papineni et al.,2002) on the vali-
dation set are selected to continue fine-tuning until
their performance plateaus.
4
Finally, the model that
achieves the highest BLEU4 on the validation set
is employed as the final model. We used BLEU4 as
an objective metric in our parameter optimization
since it is light to compute, and following previous
work (Du and Cardie,2018;Dong et al.,2019;Xiao
et al.,2021). However, as we will see in our experi-
ments, future work could also explore the usage of
alternative metrics for validation. The search space
contains 24 configurations, which are made up of
learning rates from
[0.0001,0.00005,0.00001]
, la-
bel smoothing from
[0.0,0.15]
, and batch size from
[64,128,256,512]
.
5
Our experiments show that
this simple parameter optimization strategy signif-
icantly improves all models’ performances by ro-
bustly finding the best configuration for each one.
6
We ran the parameter optimization on a machine
equipped with two Nvidia Quadro RTX 8000. Tak-
ing SQuAD as a reference, training and evalua-
tion took around three weeks for T5
LARGE
, one
week for T5
BASE
and mT5
SMALL
, three days for
T5
SMALL
, one week for BART
LARGE
, and four days
for BARTSMALL.
5 Automatic Evaluation
In this section, we report the main results in QG-
Bench (§ 3), using the methodology described in
§ 4.
4
This two-stage process is introduced due to computation
limitations, and we might see further improvements (even if
small) if a full validation search is performed.
5
Other parameters are fixed: random seed is
1
, beam size
is
4
, input token length is
512
, and output token length is
34
for fine-tuning and 64 for evaluation.
6
See Appendix for the actual parameters found by the
optimization procedure as well as more training details.
摘要:

GenerativeLanguageModelsforParagraph-LevelQuestionGenerationAsahiUshioandFernandoAlva-ManchegoandJoseCamacho-ColladosCardiffNLP,SchoolofComputerScienceandInformatics,CardiffUniversity,UK{UshioA,AlvaManchegoF,CamachoColladosJ}@cardiff.ac.ukAbstractPowerfulgenerativemodelshaveledtorecentprogressinques...

展开>> 收起<<
Generative Language Models for Paragraph-Level Question Generation Asahi Ushio and Fernando Alva-Manchego and Jose Camacho-Collados Cardiff NLP School of Computer Science and Informatics Cardiff University UK.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.15MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注