On Task-Adaptive Pretraining for Dialogue Response Selection Tzu-Hsiang Lin1Ta-Chung Chi1 and Anna Rumshisky2 1Language Technologies Institute Carnegie Mellon University

2025-05-02 0 0 310.95KB 6 页 10玖币

侵权投诉

On Task-Adaptive Pretraining for Dialogue Response Selection

Tzu-Hsiang Lin1,Ta-Chung Chi1, and Anna Rumshisky2

1Language Technologies Institute, Carnegie Mellon University

2Department of Computer Science, University of Massachusetts Lowell

tzuhsial@alumni.cmu.edu,tachungc@cmu.edu,arum@cs.uml.edu

Abstract

Recent advancements in dialogue response se-

lection (DRS) are based on the task-adaptive

pre-training (TAP) approach, by ﬁrst initial-

izing their model with BERT (Devlin et al.,

2019), and adapt to dialogue data with

dialogue-speciﬁc or ﬁne-grained pre-training

tasks. However, it is uncertain whether BERT

is the best initialization choice, or whether the

proposed dialogue-speciﬁc ﬁne-grained learn-

ing tasks are actually better than MLM+NSP.

This paper aims to verify assumptions made

in previous works and understand the source

of improvements for DRS. We show that ini-

tializing with RoBERTa achieve similar perfor-

mance as BERT, and MLM+NSP can outper-

form all previously proposed TAP tasks, dur-

ing which we also contribute a new state-of-

the-art on the Ubuntu corpus. Additional anal-

yses shows that the main source of improve-

ments comes from the TAP step, and that the

NSP task is crucial to DRS, different from

common NLU tasks.

1 Introduction

Recent advances in dialogue response selec-

tion (DRS) (Wang et al.,2013;Al-Rfou et al.,

2016) have mostly adopted the Task-adaptive Pre-

training (TAP) approach (Gururangan et al.,2020)

and can be divided into three steps. Step one:

initialize the model using a pre-trained language

model checkpoint. Step two: perform TAP on the

DRS training data with data augmentation. Step

three: ﬁne-tune the TAP on the DRS dataset.

For step one, surprisingly, all previous works

have exclusively chosen the original BERT (Devlin

et al.,2019). We hypothesize this is due to the Next

Sentence Prediction (NSP) task in BERT having the

same formulation as DRS, and previous works have

thus assumed BERT contains more knowledge re-

lated to DRS. Nonetheless, it is well-known in NLP

literature that BERT is under-trained and that re-

moving the NSP task during pre-training improves

downstream task performance (Liu et al.,2019).

For step two, while earlier work uses MLM+NSP

for TAP (Whang et al.,2020;Gu et al.,2020), more

recent works have assumed that MLM+NSP is too

simple or does not directly model dialogue patterns,

and thus proposed various dialogue-speciﬁc learn-

ing tasks (Whang et al.,2021;Xu et al.,2021) or

ﬁne-grained pre-training objectives to help the DRS

models better learn dialogue patterns and granular

representations (Su et al.,2021). However, Han

et al. (2021a) recently uses MLM with a simple

variant of NSP for TAP and outperforms almost

all of them, raising questions on whether these

dialogue-speciﬁc ﬁne-grained learning tasks are

actually better.

This paper aims to verify the assumptions made

in previous works and understand the source of

improvements from the TAP approach. First, we

include RoBERTa (Liu et al.,2019) as an ad-

ditional initialization checkpoint to BERT and

use MLM+NSP for TAP. Experiments on Ubuntu,

Douban and E-commerce benchmarks show that

(1) BERT and RoBERTa performs similarly, and

(2) MLM+NSP can outperform all previously

proposed dialogue speciﬁc and ﬁne-grained pre-

training tasks. Then, we conduct analyses and show

that (3) the main source of improvements of DRS

come from the training time of TAP in step two,

which can even mitigate lack of a good initializa-

tion checkpoint, (4) NSP task is crucial to DRS,

as opposed to common NLU tasks that can work

with only MLM, (5) and the low N-gram train/test

overlap % and low number of distinct N-grams ex-

plains why TAP does not improve Douban, and

why overﬁtting occurs for E-commerce.

In short, we make the following contributions:

(1) Contrary to previous beliefs, we show that

BERT may not be the best initialization checkpoint,

and MLM+NSP can outperform all previously pro-

posed dialogue-speciﬁc ﬁne-grained learning TAP

tasks. (2) We present a set of analyses that iden-

arXiv:2210.04073v1 [cs.CL] 8 Oct 2022

Dataset Ubuntu Douban E-commerce

Pre-train Train Valid Test Pre-train Train Valid Test Pre-train Train Valid Test

# pairs 5.1m 1m 500k 500k 3.3m 1m 50k 50k 2.8m 1m 50k 50k

pos:neg 1:1 1:1 1:9 1:9 1:1 1:1 1:1 1.2:8.8 1:1 1:1 1:1 1:9

# avg turns 7.48 10.13 10.11 10.11 5.36 6.69 6.75 6.48 6.31 5.51 5.48 5.64

Table 1: Dataset statistics. Pre-train set is generated from train set using the method in Section 2.3.1.

tify the source of improvements for TAP, and DRS

benchmark characteristics, and (3) contribute a new

state-of-the-art on the Ubuntu corpus.

2 Task-adaptive Pre-training with BERT

2.1 Task Formulation

Given a multi-turn dialogue

c={u1, u2, ..., uT}

where

stands for the

tth

turn utterance, let

denote a response candidate, and

yi∈ {0,1}

de-

notes a label with

yi= 1

indicating that

is a

proper response for

. (otherwise,

yi= 0

). Dia-

logue Response Selection (DRS) aims to learn a

model

f(c, ri)

to predict

. With the cross-encoder

binary classiﬁcation formulation, DRS shares the

exact form as the Next Sentence Prediction (NSP)

task used in BERT (Devlin et al.,2019).

2.2 Step 1: Initialize from Checkpoint

All previous works have exclusively used

BERT (Devlin et al.,2019) as their initializa-

tion checkpoint and does not consider other open-

source pre-trained language models. We hypoth-

esize this is due to BERT’s NSP task and DRS

sharing the same task formulation and thus may

learn representations that are more helpful to DRS.

However, RoBERTa (Liu et al.,2019) has shown

that BERT is undertrained and removing the NSP

task improves downstream task performance, rais-

ing questions on whether BERT is the best choice

for DRS.

To verify this assumption, we use both BERT

and RoBERTa in our experiments. To the best

of our knowledge, we are the ﬁrst to include

RoBERTa for DRS, and that show both achieves

similar performance.

2.3 Step 2: Task-adaptive Pre-training

2.3.1 Data Augmentation

All previous works have performed data augmenta-

tion to generate more data for Task-adaptive Pre-

training. While several works have devised ﬁne-

grained data augmentation methods such as utter-

ance insertion/deletion (Whang et al.,2021), or

next session prediction (Xu et al.,2021), we use

a standard data augmentation methodology that is

commonly used in dialogue literature (Mehri et al.,

2019;Gunasekara et al.,2019).

Given a context-response pair instance

(c=

{u1, ..., uT}, r)

in the original training set,

we generate additional

T−1

context re-

sponse pairs

{(c1, r1), ..., (cT, rT)}

, where

ct=

{u1, ..., ut}, rt=ut+1

t∈ {1, ..., T −1}

with a

total of Tpre-training instances.

2.3.2 Pre-training Task

While BERT’s MLM+NSP objective naturally

works as a default TAP choice (Whang et al.,

2020), most recent works have hypothesized that

MLM+NSP is not capable of modeling dialogue

patterns, and designed dialogue-speciﬁc tasks such

as incoherence detection (Xu et al.,2021), order

shufﬂing (Whang et al.,2021), ﬁne-grained match-

ing (Li et al.,2021), etc. However, Han et al.

(2021a) achieved a new cross-encoder state-of-the-

art with a simple variant of MLM+NSP, raising

questions on whether these dialogue-speciﬁc ﬁne-

grained tasks actually learn better.

In our experiments, we follow the input represen-

tation of (Whang et al.,2020) and use MLM+NSP

for TAP and achieve a new state-of-the-art results

on Ubuntu.

2.4 Step 3: Finetuning

Last, TAP models are ﬁne-tuned on the original

datasets on the DRS/NSP task to ensure a fair com-

parison. Considering the same task formulation,

our MLM+NSP in step 2 can be viewed as multi-

task learning with MLM as an auxiliary task and

NSP as the primary task (Ruder,2017).

3 Experimental Setup

3.1 Implementation Details

We used the open-source PYTORCH-LIGHTNING

framework (Falcon et al.,2019;Paszke et al.,

2019;Wolf et al.,2020) to implement our mod-

els. We use the

BERTBase

model architecture

with the Adam (Kingma and Ba,2015) optimizer,

and performed grid search over learning rates of

{1e−5,5e−5,1e−4}

for both TAP and ﬁne-

tuning. For TAP, we trained for 50 epochs and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnTask-AdaptivePretrainingforDialogueResponseSelectionTzu-HsiangLin1,Ta-ChungChi1,andAnnaRumshisky21LanguageTechnologiesInstitute,CarnegieMellonUniversity2DepartmentofComputerScience,UniversityofMassachusettsLowelltzuhsial@alumni.cmu.edu,tachungc@cmu.edu,arum@cs.uml.eduAbstractRecentadvancementsindi...

展开>> 收起<<

On Task-Adaptive Pretraining for Dialogue Response Selection Tzu-Hsiang Lin1Ta-Chung Chi1 and Anna Rumshisky2 1Language Technologies Institute Carnegie Mellon University.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Task-Adaptive Pretraining for Dialogue Response Selection Tzu-Hsiang Lin1Ta-Chung Chi1 and Anna Rumshisky2 1Language Technologies Institute Carnegie Mellon University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: