On Task-Adaptive Pretraining for Dialogue Response Selection Tzu-Hsiang Lin1Ta-Chung Chi1 and Anna Rumshisky2 1Language Technologies Institute Carnegie Mellon University

2025-05-02 0 0 310.95KB 6 页 10玖币
侵权投诉
On Task-Adaptive Pretraining for Dialogue Response Selection
Tzu-Hsiang Lin1,Ta-Chung Chi1, and Anna Rumshisky2
1Language Technologies Institute, Carnegie Mellon University
2Department of Computer Science, University of Massachusetts Lowell
tzuhsial@alumni.cmu.edu,tachungc@cmu.edu,arum@cs.uml.edu
Abstract
Recent advancements in dialogue response se-
lection (DRS) are based on the task-adaptive
pre-training (TAP) approach, by first initial-
izing their model with BERT (Devlin et al.,
2019), and adapt to dialogue data with
dialogue-specific or fine-grained pre-training
tasks. However, it is uncertain whether BERT
is the best initialization choice, or whether the
proposed dialogue-specific fine-grained learn-
ing tasks are actually better than MLM+NSP.
This paper aims to verify assumptions made
in previous works and understand the source
of improvements for DRS. We show that ini-
tializing with RoBERTa achieve similar perfor-
mance as BERT, and MLM+NSP can outper-
form all previously proposed TAP tasks, dur-
ing which we also contribute a new state-of-
the-art on the Ubuntu corpus. Additional anal-
yses shows that the main source of improve-
ments comes from the TAP step, and that the
NSP task is crucial to DRS, different from
common NLU tasks.
1 Introduction
Recent advances in dialogue response selec-
tion (DRS) (Wang et al.,2013;Al-Rfou et al.,
2016) have mostly adopted the Task-adaptive Pre-
training (TAP) approach (Gururangan et al.,2020)
and can be divided into three steps. Step one:
initialize the model using a pre-trained language
model checkpoint. Step two: perform TAP on the
DRS training data with data augmentation. Step
three: fine-tune the TAP on the DRS dataset.
For step one, surprisingly, all previous works
have exclusively chosen the original BERT (Devlin
et al.,2019). We hypothesize this is due to the Next
Sentence Prediction (NSP) task in BERT having the
same formulation as DRS, and previous works have
thus assumed BERT contains more knowledge re-
lated to DRS. Nonetheless, it is well-known in NLP
literature that BERT is under-trained and that re-
moving the NSP task during pre-training improves
downstream task performance (Liu et al.,2019).
For step two, while earlier work uses MLM+NSP
for TAP (Whang et al.,2020;Gu et al.,2020), more
recent works have assumed that MLM+NSP is too
simple or does not directly model dialogue patterns,
and thus proposed various dialogue-specific learn-
ing tasks (Whang et al.,2021;Xu et al.,2021) or
fine-grained pre-training objectives to help the DRS
models better learn dialogue patterns and granular
representations (Su et al.,2021). However, Han
et al. (2021a) recently uses MLM with a simple
variant of NSP for TAP and outperforms almost
all of them, raising questions on whether these
dialogue-specific fine-grained learning tasks are
actually better.
This paper aims to verify the assumptions made
in previous works and understand the source of
improvements from the TAP approach. First, we
include RoBERTa (Liu et al.,2019) as an ad-
ditional initialization checkpoint to BERT and
use MLM+NSP for TAP. Experiments on Ubuntu,
Douban and E-commerce benchmarks show that
(1) BERT and RoBERTa performs similarly, and
(2) MLM+NSP can outperform all previously
proposed dialogue specific and fine-grained pre-
training tasks. Then, we conduct analyses and show
that (3) the main source of improvements of DRS
come from the training time of TAP in step two,
which can even mitigate lack of a good initializa-
tion checkpoint, (4) NSP task is crucial to DRS,
as opposed to common NLU tasks that can work
with only MLM, (5) and the low N-gram train/test
overlap % and low number of distinct N-grams ex-
plains why TAP does not improve Douban, and
why overfitting occurs for E-commerce.
In short, we make the following contributions:
(1) Contrary to previous beliefs, we show that
BERT may not be the best initialization checkpoint,
and MLM+NSP can outperform all previously pro-
posed dialogue-specific fine-grained learning TAP
tasks. (2) We present a set of analyses that iden-
arXiv:2210.04073v1 [cs.CL] 8 Oct 2022
Dataset Ubuntu Douban E-commerce
Pre-train Train Valid Test Pre-train Train Valid Test Pre-train Train Valid Test
# pairs 5.1m 1m 500k 500k 3.3m 1m 50k 50k 2.8m 1m 50k 50k
pos:neg 1:1 1:1 1:9 1:9 1:1 1:1 1:1 1.2:8.8 1:1 1:1 1:1 1:9
# avg turns 7.48 10.13 10.11 10.11 5.36 6.69 6.75 6.48 6.31 5.51 5.48 5.64
Table 1: Dataset statistics. Pre-train set is generated from train set using the method in Section 2.3.1.
tify the source of improvements for TAP, and DRS
benchmark characteristics, and (3) contribute a new
state-of-the-art on the Ubuntu corpus.
2 Task-adaptive Pre-training with BERT
2.1 Task Formulation
Given a multi-turn dialogue
c={u1, u2, ..., uT}
where
ut
stands for the
tth
turn utterance, let
ri
denote a response candidate, and
yi∈ {0,1}
de-
notes a label with
yi= 1
indicating that
ri
is a
proper response for
ci
. (otherwise,
yi= 0
). Dia-
logue Response Selection (DRS) aims to learn a
model
f(c, ri)
to predict
yi
. With the cross-encoder
binary classification formulation, DRS shares the
exact form as the Next Sentence Prediction (NSP)
task used in BERT (Devlin et al.,2019).
2.2 Step 1: Initialize from Checkpoint
All previous works have exclusively used
BERT (Devlin et al.,2019) as their initializa-
tion checkpoint and does not consider other open-
source pre-trained language models. We hypoth-
esize this is due to BERT’s NSP task and DRS
sharing the same task formulation and thus may
learn representations that are more helpful to DRS.
However, RoBERTa (Liu et al.,2019) has shown
that BERT is undertrained and removing the NSP
task improves downstream task performance, rais-
ing questions on whether BERT is the best choice
for DRS.
To verify this assumption, we use both BERT
and RoBERTa in our experiments. To the best
of our knowledge, we are the first to include
RoBERTa for DRS, and that show both achieves
similar performance.
2.3 Step 2: Task-adaptive Pre-training
2.3.1 Data Augmentation
All previous works have performed data augmenta-
tion to generate more data for Task-adaptive Pre-
training. While several works have devised fine-
grained data augmentation methods such as utter-
ance insertion/deletion (Whang et al.,2021), or
next session prediction (Xu et al.,2021), we use
a standard data augmentation methodology that is
commonly used in dialogue literature (Mehri et al.,
2019;Gunasekara et al.,2019).
Given a context-response pair instance
(c=
{u1, ..., uT}, r)
in the original training set,
we generate additional
T1
context re-
sponse pairs
{(c1, r1), ..., (cT, rT)}
, where
ct=
{u1, ..., ut}, rt=ut+1
,
t∈ {1, ..., T 1}
with a
total of Tpre-training instances.
2.3.2 Pre-training Task
While BERT’s MLM+NSP objective naturally
works as a default TAP choice (Whang et al.,
2020), most recent works have hypothesized that
MLM+NSP is not capable of modeling dialogue
patterns, and designed dialogue-specific tasks such
as incoherence detection (Xu et al.,2021), order
shuffling (Whang et al.,2021), fine-grained match-
ing (Li et al.,2021), etc. However, Han et al.
(2021a) achieved a new cross-encoder state-of-the-
art with a simple variant of MLM+NSP, raising
questions on whether these dialogue-specific fine-
grained tasks actually learn better.
In our experiments, we follow the input represen-
tation of (Whang et al.,2020) and use MLM+NSP
for TAP and achieve a new state-of-the-art results
on Ubuntu.
2.4 Step 3: Finetuning
Last, TAP models are fine-tuned on the original
datasets on the DRS/NSP task to ensure a fair com-
parison. Considering the same task formulation,
our MLM+NSP in step 2 can be viewed as multi-
task learning with MLM as an auxiliary task and
NSP as the primary task (Ruder,2017).
3 Experimental Setup
3.1 Implementation Details
We used the open-source PYTORCH-LIGHTNING
framework (Falcon et al.,2019;Paszke et al.,
2019;Wolf et al.,2020) to implement our mod-
els. We use the
BERTBase
model architecture
with the Adam (Kingma and Ba,2015) optimizer,
and performed grid search over learning rates of
{1e5,5e5,1e4}
for both TAP and fine-
tuning. For TAP, we trained for 50 epochs and
摘要:

OnTask-AdaptivePretrainingforDialogueResponseSelectionTzu-HsiangLin1,Ta-ChungChi1,andAnnaRumshisky21LanguageTechnologiesInstitute,CarnegieMellonUniversity2DepartmentofComputerScience,UniversityofMassachusettsLowelltzuhsial@alumni.cmu.edu,tachungc@cmu.edu,arum@cs.uml.eduAbstractRecentadvancementsindi...

展开>> 收起<<
On Task-Adaptive Pretraining for Dialogue Response Selection Tzu-Hsiang Lin1Ta-Chung Chi1 and Anna Rumshisky2 1Language Technologies Institute Carnegie Mellon University.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:310.95KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注