
which is conditioned by an answer and it is always
a sub-string of a sentence from a paragraph. We
leverage existing QA datasets by compiling them
into this unified QG format. All datasets included
in QG-Bench are described below.
SQuAD (English).
We first consider SQuAD v1.1
(Rajpurkar et al.,2016), an extractive QA dataset
based on Wikipedia which has been used in QG
commonly since (Du et al.,2017;Zhou et al.,2017).
As the original test set of SQuAD is not released,
we use the same data split as in (Du et al.,2017).
Domain-specific Datasets (English).
To as-
sess models’ domain adaptivity, we consider
two domain-specific QA datasets: SQuADShifts
(Miller et al.,2020) and SubjQA (Bjerva et al.,
2020). SQuADShifts contains questions in the
same style of SQuAD but from four additional
domains (Amazon/Wikipedia/News/Reddit), while
SubjQA consists, unlike SQuAD, of subjective
questions/answers in general (e.g. how is the hero?
-the hero was wonderful) across six domains. As
the original SQuADShifts consists of test set only,
we created a new training/validation/test split, in
which half of the dataset remains in the test set,
while the remaining half is split for validation and
training by a 1:2 ratio.
Datasets in Languages other than English.
To
investigate multilinguality in QG, we compile
the following seven SQuAD-style QA datasets:
JAQuAD (So et al.,2022) (Japanese), GerQuAD
(Möller et al.,2021) (German), SberQuAd (Efi-
mov et al.,2020) (Russian), KorQuAD (Lim et al.,
2019) (Korean), FQuAD (d’Hoffschmidt et al.,
2020) (French), Spanish SQuAD (Casimiro Pio
et al.,2019) (Spanish), and Italian SQuAD (Croce
et al.,2018) (Italian). Since they do not release test
sets, we sampled a subset from the training sets
as the test set following Du et al. (2017). The test
sets contain the same number of questions as its
validation set, and the new training/test splits have
no overlap in terms of the paragraphs.
Other Datasets not Included in QG-Bench.
In
theory, any extractive QA dataset could be part of
our benchmark. However, we decided not to in-
clude datasets such as BioASQ (Tsatsaronis et al.,
2015) and NewsQA (Trischler et al.,2017) because
they have very long input texts, representing an-
other category that needs extra mechanisms to han-
dle long sequences (Izacard and Grave,2020a,b),
which is out of the scope of this paper. In addition,
one could leverage multilingual QA benchmarks
Data size Average character length
(train/valid/test) (para./sent./ques./ans.)
SQuAD 75,722 / 10,570 / 11,877 757 / 179 / 59/ 20
SubjQA
-Book 637 / 92 / 191 1,514 / 146 / 28 / 83
-Elec. 697 / 99 / 238 1,282 / 129 / 26 / 66
-Grocery 687 / 101 / 379 896 / 107 / 25 / 49
-Movie 724 / 101 / 154 1,746 / 146 / 27 / 72
-Rest. 823 / 129 / 136 1,006 / 104 / 26 / 51
-Trip 875 / 143 / 397 1,002 / 108 / 27 / 51
SQuADShifts
-Amazon 3,295 / 1,648 / 4,942 773 / 111 /43 / 18
-Wiki 2,646 / 1,323 / 3,969 773 / 184 / 58 / 26
-News 3,355 / 1,678 / 5,032 781 / 169 / 51 / 20
-Reddit 3,268 / 1,634 / 4,901 774 / 116 / 45 / 19
Multilingual QG
-Ja 27,809 / 3,939 / 3,939 424 / 72 / 32 / 6
-Es 77,025 / 10,570 / 10,570 781 / 122 / 64 / 21
-De 9,314 / 2,204 / 2,204 1,577 / 165 / 59 / 66
-Ru 40,291 / 5,036 / 5,036 754 174 / 64 / 26
-Ko 54,556 / 5,766 / 5,766 521 / 81 / 34 / 6
-It 46,550 / 7,609 / 7,609 807 / 124 / 66 / 16
-Fr 17,543 / 3,188 / 3,188 797 / 160 / 57 / 23
Table 1: Statistics of of all datasets integrating into our
question generation benchmark after unification.
(Clark et al.,2020;Artetxe et al.,2020;Lewis et al.,
2020b) to obtain multilingual QG datasets, but
XQuAD (Artetxe et al.,2020) and MLQA (Lewis
et al.,2020b) do not contain training sets, and Ty-
diQA (Clark et al.,2020) contains a very small
training set. Instead, we focused on monolingual
QA datasets in each language.
3.2 Data Statistics
Table 1 summarizes statistics of each QG dataset af-
ter unification. It can be observed that SubjQA and
SQuADShifts have ten to a hundred times less train-
ing data than SQuAD. Also, SubjQA’s answers are
twice longer than SQuAD’s answers, which can be
explained by how they differ in the way questions
are formed (i.e., SubjQA being more subjective in
nature). Likewise, except for Spanish, the datasets
for languages other than English contain less train-
ing data than the original SQuAD, with the number
varying depending on the language.
4 LMs for Question Generation
In this section, we formalize the QG task from a
language modelling perspective (§ 4.1), including
details on the fine-tuning process (§ 4.2) and the
setup for our experiments with QG-Bench (§ 4.3).
4.1 Task Formulation
Given an input text
c
, the goal of QG is to generate
a natural question
ˆq
related to the information in