On Task-Adaptive Pretraining for Dialogue Response Selection
Tzu-Hsiang Lin1,Ta-Chung Chi1, and Anna Rumshisky2
1Language Technologies Institute, Carnegie Mellon University
2Department of Computer Science, University of Massachusetts Lowell
tzuhsial@alumni.cmu.edu,tachungc@cmu.edu,arum@cs.uml.edu
Abstract
Recent advancements in dialogue response se-
lection (DRS) are based on the task-adaptive
pre-training (TAP) approach, by first initial-
izing their model with BERT (Devlin et al.,
2019), and adapt to dialogue data with
dialogue-specific or fine-grained pre-training
tasks. However, it is uncertain whether BERT
is the best initialization choice, or whether the
proposed dialogue-specific fine-grained learn-
ing tasks are actually better than MLM+NSP.
This paper aims to verify assumptions made
in previous works and understand the source
of improvements for DRS. We show that ini-
tializing with RoBERTa achieve similar perfor-
mance as BERT, and MLM+NSP can outper-
form all previously proposed TAP tasks, dur-
ing which we also contribute a new state-of-
the-art on the Ubuntu corpus. Additional anal-
yses shows that the main source of improve-
ments comes from the TAP step, and that the
NSP task is crucial to DRS, different from
common NLU tasks.
1 Introduction
Recent advances in dialogue response selec-
tion (DRS) (Wang et al.,2013;Al-Rfou et al.,
2016) have mostly adopted the Task-adaptive Pre-
training (TAP) approach (Gururangan et al.,2020)
and can be divided into three steps. Step one:
initialize the model using a pre-trained language
model checkpoint. Step two: perform TAP on the
DRS training data with data augmentation. Step
three: fine-tune the TAP on the DRS dataset.
For step one, surprisingly, all previous works
have exclusively chosen the original BERT (Devlin
et al.,2019). We hypothesize this is due to the Next
Sentence Prediction (NSP) task in BERT having the
same formulation as DRS, and previous works have
thus assumed BERT contains more knowledge re-
lated to DRS. Nonetheless, it is well-known in NLP
literature that BERT is under-trained and that re-
moving the NSP task during pre-training improves
downstream task performance (Liu et al.,2019).
For step two, while earlier work uses MLM+NSP
for TAP (Whang et al.,2020;Gu et al.,2020), more
recent works have assumed that MLM+NSP is too
simple or does not directly model dialogue patterns,
and thus proposed various dialogue-specific learn-
ing tasks (Whang et al.,2021;Xu et al.,2021) or
fine-grained pre-training objectives to help the DRS
models better learn dialogue patterns and granular
representations (Su et al.,2021). However, Han
et al. (2021a) recently uses MLM with a simple
variant of NSP for TAP and outperforms almost
all of them, raising questions on whether these
dialogue-specific fine-grained learning tasks are
actually better.
This paper aims to verify the assumptions made
in previous works and understand the source of
improvements from the TAP approach. First, we
include RoBERTa (Liu et al.,2019) as an ad-
ditional initialization checkpoint to BERT and
use MLM+NSP for TAP. Experiments on Ubuntu,
Douban and E-commerce benchmarks show that
(1) BERT and RoBERTa performs similarly, and
(2) MLM+NSP can outperform all previously
proposed dialogue specific and fine-grained pre-
training tasks. Then, we conduct analyses and show
that (3) the main source of improvements of DRS
come from the training time of TAP in step two,
which can even mitigate lack of a good initializa-
tion checkpoint, (4) NSP task is crucial to DRS,
as opposed to common NLU tasks that can work
with only MLM, (5) and the low N-gram train/test
overlap % and low number of distinct N-grams ex-
plains why TAP does not improve Douban, and
why overfitting occurs for E-commerce.
In short, we make the following contributions:
(1) Contrary to previous beliefs, we show that
BERT may not be the best initialization checkpoint,
and MLM+NSP can outperform all previously pro-
posed dialogue-specific fine-grained learning TAP
tasks. (2) We present a set of analyses that iden-
arXiv:2210.04073v1 [cs.CL] 8 Oct 2022