
ity of annotation. Existing datasets in research
works either offer limited coverage of all domains,
where GenQA can be applied (Bajaj et al.,2018),
or are too small to be used as supervised training
data (Muller et al.,2021). Generally, collecting a
human-authored answer to a question when given
a context is significantly more expensive compared
to annotating the correctness of an extracted web
sentence as an answer for the same question. Con-
sequently, there are a large number of annotated
datasets (Wang et al.,2007;Yang et al.,2015;Garg
et al.,2020) available for the latter type, aimed at
training answer sentence selection (AS2) systems.
In this work, we propose a training paradigm
for transferring the knowledge learned by a dis-
criminative AS2 ranking model to train an answer
generation QA system. Towards this, we learn a
GenQA model using weak supervision provided
by a trained AS2 model on a unlabeled data set
comprising of questions and answer candidates.
Specifically, for each question, the AS2 model is
used to rank a set of answer candidates without
having any label of correctness/incorrectness for
answering the question. The top ranked answer is
used as the generation target for the GenQA model,
while the question along with the next
k
top-ranked
answers are used as the input for the GenQA model.
We supplement the ranking order of answer can-
didates with the prediction confidence scores pro-
vided by the AS2 model for each answer candi-
date. This is done by modifying our knowledge
transfer strategy in two ways. First, we weight
the loss of each training instance (question + con-
text, comprised of
k
answer candidates) using the
AS2 model score of the top ranked answer, which
is to be used as the GenQA target. This allows
the GenQA model to selectively learn more from
‘good’ quality target answers in the weakly super-
vised training data (AS2 models are calibrated to
produce higher confidence scores for correct an-
swers). However, this loss weighting only con-
siders the score of the output target, and does not
exploit the scores of the input candidates. To over-
come this limitation, we discretize and label the
AS2 scores into
l
confidence buckets, add these
bucket labels to the GenQA vocabulary, and finally
prepend the corresponding label to each answer
candidate in the input and/or the output. This con-
fidence bucket label provides the GenQA model
with an additional signal about the answer quality
of each candidate as assigned by the AS2 model.
We show that both these techniques improve the
QA accuracy, and can be combined to provide ad-
ditional improvements.
We empirically evaluate
1
our proposed knowl-
edge transferring technique from AS2 to GenQA
on three popular public datasets: MS-MARCO
NLG (Bajaj et al.,2018), WikiQA (Yang et al.,
2015), TREC-QA (Wang et al.,2007); and one
large scale industrial QA dataset. Our results show
that the GenQA model trained using our paradigm
of weak supervision from an AS2 model can sur-
prisingly outperform both the AS2 model that was
used for knowledge transfer (teacher), as well as
a GenQA model trained on fully supervised data.
On small datasets such as WikiQA and TREC-QA,
we show that AS2 models trained even on small
amounts of labeled data can be effectively used
to weakly supervise a GenQA model, which then
can outperform its teacher in QA accuracy. Addi-
tionally, on MS-MARCO NLG, where fully super-
vised GenQA training data is available, we show
that an initial round of training with our weakly
supervised methods yields additional performance
improvements compared to the standard supervised
training of GenQA. Qualitatively, the answers gen-
erated by our model are often more directly related
to the question being asked, and stylistically more
natural-sounding and suitable as responses than an-
swers from AS2 models, despite being trained only
on extracted sentences from the web.
2 Related Work
Our work builds upon recent research in AS2, an-
swer generation for QA, and transfer learning.
Answer Sentence Selection
Early approaches
for AS2 use CNNs (Severyn and Moschitti,2015)
or alignment networks (Shen et al.,2017;Tran
et al.,2018;Tay et al.,2018) to learn and score
question and answer representations. Compare-
and-aggregate architectures have also been exten-
sively studied (Wang and Jiang,2017;Bian et al.,
2017;Yoon et al.,2019) for AS2. Tayyar Mad-
abushi et al. (2018) exploited fine-grained ques-
tion classification to further improve answer se-
lection. Garg et al. (2020) achieved state-of-the-
art results by fine-tuning transformer-based mod-
els on a large-scale QA dataset first, and then
adapting to smaller AS2 datasets. Matsubara et al.
1
We will release code and all trained models checkpoints
at
https://github.com/amazon-research/
wqa-genqa-knowledge-transfer