Improving Question Answering with Generation of NQ-like Questions
Saptarashmi Bandyopadhyay
University of Maryland, College Park
saptab1@umd.edu
Shraman Pal
IIT Kharagpur
shramanpal@gmail.com
Hao Zou
University of Minnesota
zou00080@umn.edu
Abhranil Chandra
IIT Kharagpur
abhranil.iitkgp@gmail.com
Jordan Boyd-Graber
University of Maryland, College Park
jbg@umiacs.umd.edu
Abstract
Question Answering (QA) systems require a
large amount of annotated data which is costly
and time-consuming to gather. Converting
datasets of existing QA benchmarks are chal-
lenging due to different formats and com-
plexities. To address these issues, we pro-
pose an algorithm to automatically generate
shorter questions resembling day-to-day hu-
man communication in the Natural Questions
(NQ) dataset from longer trivia questions in
Quizbowl (QB) dataset by leveraging conver-
sion in style among the datasets. This provides
an automated way to generate more data for
our QA systems. To ensure quality as well
as quantity of data, we detect and remove ill-
formed questions using a neural classifier. We
demonstrate that in a low resource setting, us-
ing the generated data improves the QA perfor-
mance over the baseline system on both NQ
and QB data. Our algorithm improves the scal-
ability of training data while maintaining qual-
ity of data for QA systems.
1 Introduction
Large-scale data collection is a challenging process
in the domain of Question Answering and Informa-
tion Retrieval due to the necessity of high-quality
annotations which are scarce and expensive to gen-
erate. There are several QA datasets (Joshi et al.,
2017), (Rajpurkar et al.,2016), (Yang et al.,2018),
(Kwiatkowski et al.,2019), (Rodriguez et al.,2021)
with significantly different structure and complex-
ity. Large quantities of high quality data are effec-
tive in training a more efficacious Machine Learn-
ing system.
In this paper, we focus on converting questions
normally spoken in a trivia competition to ques-
tions resembling day-to-day human communica-
tion. Trivia questions have multiple lines, consist
of multiple hints as standalone sentences to an an-
swer, and players can buzz on any sentence in the
question to give an answer. In contrast, questions
used in daily human communication are shorter
(often a single line). We propose an algorithm
to generate multiple short natural questions from
every long trivia question by converting each of
the sentences with multiple hints to several shorter
questions. We also add a BERT (Devlin et al.,
2019) based quality control method to filter out
the ill-formed questions and retain the well-formed
questions. We show that our algorithm for gener-
ating questions improves the performance of two
question answering (QA) systems in a low resource
setting. We also demonstrate that concatenation of
original natural questions with generated questions
improves the QA system performance. Finally, we
prove that by using such a method to generate syn-
thetic data, we can achieve higher scores than a
system that uses only NQ data.
2 Dataset and Data Extraction
We use two popular datasets- the Quizbowl (Ro-
driguez et al.,2021), henceforth referred to as QB
dataset, and NQ-Open dataset (Lee et al.,2019), de-
rived from Natural Questions (Kwiatkowski et al.,
2019). QB has a total of 119247 question/answer
samples and NQ has 91434 total question/answer
samples. For the NQ dataset, we use the same 1800
dev and 1769 test question/answer splits as used in
the EfficientQA Competition (Min et al.,2021).
As our task involves transforming QB questions
to NQ-like questions, we extract pairs of questions
that are semantically similar. We first extract ev-
ery possible question-question pair with the same
answer using string matching resulting in 95651
question-question pairs. From this parallel corpus,
we extract the last sentence of the QB questions and
pass them through a pre-trained Sentence-BERT
(Reimers and Gurevych,2019) model along with
the corresponding NQ question. We take the co-
sine similarity between the [CLS] embedding to
find pairs that are semantically equivalent by set-
ting the threshold to 0.5. From this we extract
arXiv:2210.06599v1 [cs.CL] 12 Oct 2022