Towards Efﬁcient Dialogue Pre-training with Transferable and Interpretable Latent Structure Xueliang Zhao13 Lemao Liu2 Tingchen Fu4 Shuming Shi2 Dongyan Zhao135y Rui Yan4y

2025-05-06 0 0 477.51KB 13 页 10玖币

侵权投诉

Towards Efﬁcient Dialogue Pre-training with Transferable and

Interpretable Latent Structure

Xueliang Zhao1,3∗

, Lemao Liu2, Tingchen Fu4, Shuming Shi2, Dongyan Zhao1,3,5†

, Rui Yan4†

1Wangxuan Institute of Computer Technology, Peking University

2Tencent AI Lab 3Center for Data Science, AAIS, Peking University

4Gaoling School of Artiﬁcial Intelligence, Renmin University of China

5Beijing Institute for General Artiﬁcial Intelligence

{xl.zhao,zhaody}@pku.edu.cn {redmondliu,shumingshi}@tencent.com

lucas.futingchen@gmail.com ruiyan@ruc.edu.cn

Abstract

With the availability of massive general-

domain dialogue data, pre-trained dialogue

generation appears to be super appealing to

transfer knowledge from the general domain

to downstream applications. In most existing

work, such transferable ability is mainly ob-

tained by ﬁtting a large model with hundreds

of millions of parameters on massive data in an

exhaustive way, leading to inefﬁcient running

and poor interpretability. This paper proposes

a novel dialogue generation model with a la-

tent structure that is easily transferable from

the general domain to downstream tasks in

alightweight and transparent way. Experi-

ments on two benchmarks validate the effec-

tiveness of the proposed model. Thanks to

the transferable latent structure, our model is

able to yield better dialogue responses than

four strong baselines in terms of both auto-

matic and human evaluations, and our model

with about 22% parameters particularly deliv-

ers a 5x speedup in running time compared

with the strongest baseline. Moreover, the pro-

posed model is explainable by interpreting the

discrete latent variables.

1 Introduction

Conversation between humans and machines has

long been a goal of artiﬁcial intelligence (AI).

Building an open-domain dialogue system with

data-driven techniques has gotten a lot of attention

in the AI and NLP ﬁelds in recent years, thanks

to breakthroughs in deep learning (Sutskever et al.,

2014;Gehring et al.,2017;Vaswani et al.,2017).

In particular, with the availability of massive hu-

man dialogue data (e.g., the Reddit comments) on

social media (Adiwardana et al.,2020), pre-trained

dialogue generation appears to be super appealing

to alleviate potential discrepancies between general

This work was done while X. Zhao was an intern at Ten-

cent AI Lab.

†Corresponding authors: Dongyan Zhao and Rui Yan.

domain and downstream applications (Zhang et al.,

2020;Bao et al.,2020,2021;Li et al.,2021).

The common idea behind the pre-trained dia-

logue generation can be highlighted as a two-step

pipeline: a) it ﬁrstly trains a deep neural model on

massive general-domain dialogue data, b) and then

transfers the model into downstream tasks via ﬁne-

tuning or zero-shot learning. Under this pipeline,

the transferability is mainly obtained by ﬁtting a

large model with millions of parameters on mas-

sive data in an exhaustive way. Consequently, the

downsides in existing works are obvious: their run-

ning is inefﬁcient and their outputs are difﬁcult to

explain.

This paper thereby aims to build a pre-trained

dialogue model which is easily transferable from

the general domain to downstream tasks in a

lightweight and transparent way. To this end, we

propose a novel dialogue model with a latent struc-

ture consisting of several latent variables. By using

some self-supervised tasks to endow its latent vari-

ables with some prior properties during training,

the latent structure makes the knowledge better

transferable across different domains. Speciﬁcally,

we ﬁrst propose to incorporate the transformer ar-

chitecture with a discrete conversation ﬂow. Given

a dialogue session, our model will sequentially

infer the discrete state for each utterance which

provides essential hints for future states and has

an effect on the generation of the associated ut-

terance. We further propose a method to disen-

tangle the context-sensitive information from the

conversation ﬂow, which is achieved by two dis-

entangled latent variables to capture the context-

sensitive information (e.g., topic and persona) and

the context-independent information (e.g., dialogue

logic for each utterance) respectively. Through

tailor-designed self-supervised tasks, the context-

sensitive latent variable is able to capture the holis-

tic information of a dialogue session while the

context-independent variable is supposed to reﬂect

arXiv:2210.12461v1 [cs.CL] 22 Oct 2022

the dynamic ﬂow of dialogue in each utterance.

Meanwhile, the model is optimized with variational

inference by maximizing the evidence lower bound

of the likelihood.

We conduct experiments with two multi-turn di-

alogue generation benchmarks, including DailyDi-

alog (Li et al.,2017) and ConvAI2 (Dinan et al.,

2020). Thanks to the transferable latent structure,

our model is able to yield better dialogue responses

than four strong baselines in terms of both auto-

matic and human evaluations, and our model in-

cluding about 22% - 66% parameters particularly

delivers a 2x - 30x speedup in running time. More-

over, the proposed model is explainable by visual-

izing the discrete latent variables.

Our contributions in the paper are three-fold: (1)

We present a context-free dialogue structure that

captures the prior knowledge about state transition

in a large-scale dialogue corpus. Furthermore, with

the help of this dialogue structure, our model out-

performs the state-of-the-art dialogue pre-training

method with much fewer parameters. (2) We pro-

pose a disentangled structure learning framework to

induce a context-free dialogue structure that enjoys

better transferability and interpretability. (3) We

empirically verify the effectiveness and efﬁciency

of the proposed model on two benchmarks.

2 Related Work

The success of neural networks in machine transla-

tion promotes early research on end-to-end open-

domain dialogue generation (Ritter et al.,2011;

Shang et al.,2015;Vinyals and Le,2015). Various

adaptations to the vanilla encoder-decoder archi-

tecture have been built to model the structure of

dialogue contexts (Serban et al.,2016,2017;Zhang

et al.,2019); improve response diversity (Li et al.,

2015;Zhao et al.,2017;Tao et al.,2018); introduce

external knowledge (Dinan et al.,2019;Zhao et al.,

2020a,b); and control response qualities (Xu et al.,

2019;Zhou et al.,2017;Zhang et al.,2018;Wang

et al.,2018;See et al.,2019).

Large-scale pre-training for open-domain dia-

logue generation has recently become promising as

a way to bridge the gap between conversation with

existing systems and conversation with humans. In-

spired by the successfulness of GPT-2 (Radford

et al.,2019), Zhang et al. (2020) propose to train

the transformer models on a very large dialogue

dataset to generate informative text. Bao et al.

(2020) further use discrete latent variables to ad-

dress the one-to-many mapping problem in open-

domain dialogue. Despite prior successes, the di-

alogue context is simply concatenated as a long

sequence, which may fail to capture the discourse-

level coherence among utterances. To this end, Gu

et al. (2021) and Li et al. (2021) introduce more

self-supervision objectives to capture the discourse-

level coherence and the dynamic information ﬂow

respectively.

The concept of dialogue structure has proven

useful in modeling the complicated relationships

between utterances. In the ﬁeld of task-oriented

dialogue, Shi et al. (2019) propose a discrete varia-

tional recurrent neural network (DVRNN) to learn

the dialogue structure through unsupervised learn-

ing; Qiu et al. (2020) further propose to enhance

prior work with a structured attention mechanism;

and Sun et al. (2021) propose a conversational

graph to represent deterministic dialogue structure,

where nodes and edges represent the utterance and

context information, respectively. In the ﬁeld of

open-domain dialogue, Xu et al. (2021) construct a

large dialogue structure graph with around

1.6

mil-

lion vertices to cover a wide range of topics. This

work introduces a disentangled structure learning

framework, which can induce a transferable sub-

structure and an interpretable dialogue substruc-

ture, to incorporate the structural bias in dialogue

pre-training. Thanks to the tailor-designed self-

supervised tasks, our latent structure is more gen-

eral than the dialogue structure in existing work.

3 Approach

3.1 Overview

Let

X= (u1, u2,· · · , un)

denote a dialogue ses-

sion, with

ut= (wt,1, wt,2,· · · , wt,m)

denoting

the

-th utterance and

wt,i

the

-th token in it. The

number of utterances in a session and the number

of tokens in each utterance are represented by

and

, respectively. The conversational context for

u<t = (u1, u2,· · · , ut−1)

. Our ultimate goal

is to develop a generation model

p(ut|u<t)

that can

predict the next utterance based on the context of

the conversation.

Figure 1illustrates the overview of our graphical

model, which includes the proposed

latent struc-

ture

consisting of three kinds of latent variables,

i.e.,

c= [c1, c2,· · · , cn]

zI= [zI

1, zI

2,· · · , zI

and

. Speciﬁcally,

depicts the ﬂow of a con-

versation, and each

ci∈ {1,· · · , N}

is a discrete

latent variable with

as a hyper-parameter. It is

𝒖𝟏𝒖𝟐𝒖𝟑

𝒛𝑺

𝒛𝟏

𝑰𝒛𝟐

𝑰𝒛𝟑

𝑰

𝒄𝟏𝒄𝟐𝒄𝟑

𝒉𝟏

𝝓𝒉𝟐

𝝓𝒉𝟑

𝝓𝒖𝟏𝒖𝟐𝒖𝟑

𝒛𝑺

𝒛𝟏

𝑰𝒛𝟐

𝑰𝒛𝟑

𝑰

𝒄𝟏𝒄𝟐𝒄𝟑

𝒉𝟏𝒉𝟐𝒉𝟑

𝒉𝟎

Inference Generation

Figure 1: Graphical illustrations of generation and in-

ference processes. Left: inference of the approximate

posterior as described in Section 3.3. Right: genera-

tion of uand computation of prior as described in Sec-

tion 3.2. The latent structure consists of c,zIand zS.

The initial hidden state h0is a trainable parameter.

worth noting that

is designed for interpretability:

by interpreting these discrete variables, humans are

able to understand the logical ﬂow of the conversa-

tion as to be shown in Section 5.4. Moreover,

and

are two disentangled latent variables to cap-

ture the context-sensitive information and context-

independent information in a dialogue session re-

spectively. In this way, through disentangling

and

with tailor-designed self-supervised learn-

ing objectives (as will be described in Section 3.4),

our model is able to capture intrinsic conversation

ﬂow for better generalization to different domains

(i.e., transferability).

With our designed latent structure, given a con-

versational context

u<t

, the generation of the

next utterance

can be roughly decomposed

into two steps: (1) infer the conversation ﬂow

[c1, c2,· · · , ct−1]

and the context-sensitive variable

based on context information, as shown in Fig-

ure 1(left). (2) compute the priors of

and

and then generate the next utterance

with

and

zS, as shown in Figure 1(right).

3.2 Generation

Context Encoding.

We ﬁrst obtain the contex-

tualized representations of utterances through pre-

trained language models (PLMs). Speciﬁcally, we

exploit GPT-

(Radford et al.,2019), which is pre-

trained using the causal language modeling ob-

jective and achieves state-of-the-art results on a

range of text generation tasks, as the backbone of

our model. Note that our technical novelty lies

in the proposal of a disentangled structure learn-

ing framework that injects a transferable dialogue

structure into PLMs. Given a dialogue session

X= (u1, u2,· · · , un)

, we ﬁrst construct the in-

put

by concatenating all utterances as a single

consecutive token sequence:

I=[BOS]u1[EOS]u2[EOS] . . . [EOS]un[EOS],

(1)

where

[BOS]

and

[EOS]

are special tokens de-

signed to separate sentences. The input

is then

fed into the PLM and the contextualized represen-

tation for

is deﬁned as the hidden states at the

last layer:

h1,1,· · · , ht,i,· · · , hn,m =ftrans(I)∈Rmn×d,

(2)

where

ftrans(·)

denotes the transformer

model (Vaswani et al.,2017) and

ht,i ∈Rd

denotes the hidden state corresponding to token

wt,i

. It’s notable that we use uni-directional atten-

tion since the learning objectives are applied to all

utterances (as will be illustrated in Section 3.4) and

a bi-directional architecture will leak the future

information.

The vector representation of the

-th utterance

is obtained through attentive pooling (Wu et al.,

2020), which is deﬁned as follows:

ht=

j=1

αt,iht,i, αt,i =eq·ht,i

i=1 eq·ht,i ,(3)

where q∈Rdis the attention query vector.

Prior of Discrete Latent Variable.

The discrete

latent variables

[c1, c2,· · · , cn]

are used to auto-

matically discover the structural representation in

dialogues, which is beneﬁcial to analyze how con-

versation ﬂow from one utterance to the next one

and promotes interpretability. We exclude the im-

pact of

u<t

since there is usually a domain dis-

crepancy between the pre-trained and downstream

data, which limits the transferability of the learned

conversation ﬂow. As a result, we directly model

the inﬂuence of

c<t

in the prior. We employ

the transformer model with uni-directional atten-

tion to generate the contextualized representation

of c<t:

1,· · · , hc

t−1=fc−trans ([c1,· · · , ct−1]) ∈R(t−1)×d.

(4)

Then the probability of predicting

is deﬁned as:

p(ct|c<t) = Softmax(fc−mlp(hc

t−1)),(5)

where

fc−mlp(·)

denotes a MLP network. Differ-

ent from Shi et al. (2019), our model preserves the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsEfcientDialoguePre-trainingwithTransferableandInterpretableLatentStructureXueliangZhao1,3,LemaoLiu2,TingchenFu4,ShumingShi2,DongyanZhao1,3,5y,RuiYan4y1WangxuanInstituteofComputerTechnology,PekingUniversity2TencentAILab3CenterforDataScience,AAIS,PekingUniversity4GaolingSchoolofArticialIntel...

展开>> 收起<<

Towards Efﬁcient Dialogue Pre-training with Transferable and Interpretable Latent Structure Xueliang Zhao13 Lemao Liu2 Tingchen Fu4 Shuming Shi2 Dongyan Zhao135y Rui Yan4y.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Efﬁcient Dialogue Pre-training with Transferable and Interpretable Latent Structure Xueliang Zhao13 Lemao Liu2 Tingchen Fu4 Shuming Shi2 Dongyan Zhao135y Rui Yan4y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: