
It’s springtime of the pandemic. After the trauma of the
last year, the quarantined are emerging into sunlight, and
beginning to navigate travel, classrooms and restaurants.
And they are discovering that when it comes to returning
to the old ways, many feel out of sorts. Do they shake
hands? Hug? With or without a mask?
How are people adapting to life after the pandemic?
Table 2: Examples of our evaluation data contain-
ing paragraphs from news articles with human written
questions. More in Table 9in Appendix A
Training Data
Most prior work has success-
fully trained models for question generation using
SQUAD (Rajpurkar et al.,2016), TriviaQA (Joshi
et al.,2017), or NQ (Kwiatkowski et al.,2019b)
datasets, the answers to which are typically short.
To account for the open-ended nature of our de-
sired questions, we rely on the ELI5 (Fan et al.,
2019, Explain Like I’m Five) dataset. The dataset
comprises 270K English-language threads in sim-
ple language from the Reddit forum of the same
name
2
, i.e easily comprehensible to someone with
minimal background knowledge.
Compared to existing datasets, ELI5 comprises
diverse questions requiring long-form answers.
It contains a significant number of open-ended
how/why questions. Interestingly, even what ques-
tions tend to require paragraph-length explanations
(What is the difference...). As seen in Table 8in
Appendix A, each question is open-ended, inquis-
itive and requires an answer that is descriptive in
nature. Finally, one of the advantages of the ELI5
dataset is that it covers diverse domains such as sci-
ence, health, and politics. This quality makes ELI5
an ideal candidate to transfer to the news domain,
which similarly covers a diverse range of topics.
Evaluation Data
Since our goal is to gener-
ate open-ended questions from news articles, we
specifically design our evaluation data to reflect
the same. To achieve this goal we obtain English-
language articles from The New York Times website
from January 2020 to June 2020. We obtained writ-
ten consent to use this content for research purposes
by the copyright holder. One of the additional ad-
vantages of crawling data from the The New York
Times website is that we can divide news articles by
domain, as each news article appears in a specific
2https://www.reddit.com/r/explainlike
imfive/
section of the website. From the given URL
3
, we
can tell that the article belongs to the Science do-
main. Additionally, as most pre-trained language
models were trained prior to the Covid-19 pan-
demic, we also test how well they generalize to
COVID-19 related news topics.
Each news article from a particular domain is
segmented into several paragraphs. We randomly
sample
529
paragraphs spanning six domains. This
includes 55 paragraphs from Science, 66 from Cli-
mate, 98 from Technology, 110 from Health, 100
from NYRegion, and 100 from Business. While
we understand that selecting standalone paragraphs
might sometimes ignore the greater context, or suf-
fer from co-reference issues, we carefully replace
any such paragraphs from our bigger pool.
As we do not have gold questions associated
with each paragraph, we crowd-source human-
written questions for each paragraph on Amazon
Mechanical Turk. Each paragraph is shown to a
distinct crowdworker who is then instructed to read
the paragraph carefully and write an open-ended
question that is answered by the entire passage. We
recruit 96 distinct crowd workers for this task. Af-
ter the questions are collected from first round of
crowd-sourcing, two expert news media employ-
ees approve or reject them based on quality. The
paragraphs with rejected questions are put up again
and through this iterative process and careful qual-
ity control we obtain one high quality open-ended
question associated with each paragraph. Table 2
and 9shows selected paragraphs from our evalua-
tion set and the associated human-generated open-
ended question.
4 CONSISTENT Model
The backbone of our approach is a fine-tuned
BART-large (Lewis et al.,2020) model on the ELI5
dataset of question-answer pairs. However, there
are two major factors to consider in our end-to-end
question generation pipeline. The generated ques-
tions i) must be relevant and factually consistent to
the input paragraph, and ii) must have the answer
self-contained in the input paragraph. Our CON-
SISTENT model (Figures 2and 3) addresses these
issues as described below.
Factual Consistency
To ensure faithfulness to
the input paragraph, we need to design our model
3https://www.nytimes.com/2021/12/10/s
cience/astronaut-wings-faa-bezos-musk.ht
ml