Towards Efficient Dialogue Pre-training with Transferable and Interpretable Latent Structure Xueliang Zhao13 Lemao Liu2 Tingchen Fu4 Shuming Shi2 Dongyan Zhao135y Rui Yan4y

2025-05-06 0 0 477.51KB 13 页 10玖币
侵权投诉
Towards Efficient Dialogue Pre-training with Transferable and
Interpretable Latent Structure
Xueliang Zhao1,3
, Lemao Liu2, Tingchen Fu4, Shuming Shi2, Dongyan Zhao1,3,5
, Rui Yan4
1Wangxuan Institute of Computer Technology, Peking University
2Tencent AI Lab 3Center for Data Science, AAIS, Peking University
4Gaoling School of Artificial Intelligence, Renmin University of China
5Beijing Institute for General Artificial Intelligence
{xl.zhao,zhaody}@pku.edu.cn {redmondliu,shumingshi}@tencent.com
lucas.futingchen@gmail.com ruiyan@ruc.edu.cn
Abstract
With the availability of massive general-
domain dialogue data, pre-trained dialogue
generation appears to be super appealing to
transfer knowledge from the general domain
to downstream applications. In most existing
work, such transferable ability is mainly ob-
tained by fitting a large model with hundreds
of millions of parameters on massive data in an
exhaustive way, leading to inefficient running
and poor interpretability. This paper proposes
a novel dialogue generation model with a la-
tent structure that is easily transferable from
the general domain to downstream tasks in
alightweight and transparent way. Experi-
ments on two benchmarks validate the effec-
tiveness of the proposed model. Thanks to
the transferable latent structure, our model is
able to yield better dialogue responses than
four strong baselines in terms of both auto-
matic and human evaluations, and our model
with about 22% parameters particularly deliv-
ers a 5x speedup in running time compared
with the strongest baseline. Moreover, the pro-
posed model is explainable by interpreting the
discrete latent variables.
1 Introduction
Conversation between humans and machines has
long been a goal of artificial intelligence (AI).
Building an open-domain dialogue system with
data-driven techniques has gotten a lot of attention
in the AI and NLP fields in recent years, thanks
to breakthroughs in deep learning (Sutskever et al.,
2014;Gehring et al.,2017;Vaswani et al.,2017).
In particular, with the availability of massive hu-
man dialogue data (e.g., the Reddit comments) on
social media (Adiwardana et al.,2020), pre-trained
dialogue generation appears to be super appealing
to alleviate potential discrepancies between general
*
This work was done while X. Zhao was an intern at Ten-
cent AI Lab.
Corresponding authors: Dongyan Zhao and Rui Yan.
domain and downstream applications (Zhang et al.,
2020;Bao et al.,2020,2021;Li et al.,2021).
The common idea behind the pre-trained dia-
logue generation can be highlighted as a two-step
pipeline: a) it firstly trains a deep neural model on
massive general-domain dialogue data, b) and then
transfers the model into downstream tasks via fine-
tuning or zero-shot learning. Under this pipeline,
the transferability is mainly obtained by fitting a
large model with millions of parameters on mas-
sive data in an exhaustive way. Consequently, the
downsides in existing works are obvious: their run-
ning is inefficient and their outputs are difficult to
explain.
This paper thereby aims to build a pre-trained
dialogue model which is easily transferable from
the general domain to downstream tasks in a
lightweight and transparent way. To this end, we
propose a novel dialogue model with a latent struc-
ture consisting of several latent variables. By using
some self-supervised tasks to endow its latent vari-
ables with some prior properties during training,
the latent structure makes the knowledge better
transferable across different domains. Specifically,
we first propose to incorporate the transformer ar-
chitecture with a discrete conversation flow. Given
a dialogue session, our model will sequentially
infer the discrete state for each utterance which
provides essential hints for future states and has
an effect on the generation of the associated ut-
terance. We further propose a method to disen-
tangle the context-sensitive information from the
conversation flow, which is achieved by two dis-
entangled latent variables to capture the context-
sensitive information (e.g., topic and persona) and
the context-independent information (e.g., dialogue
logic for each utterance) respectively. Through
tailor-designed self-supervised tasks, the context-
sensitive latent variable is able to capture the holis-
tic information of a dialogue session while the
context-independent variable is supposed to reflect
arXiv:2210.12461v1 [cs.CL] 22 Oct 2022
the dynamic flow of dialogue in each utterance.
Meanwhile, the model is optimized with variational
inference by maximizing the evidence lower bound
of the likelihood.
We conduct experiments with two multi-turn di-
alogue generation benchmarks, including DailyDi-
alog (Li et al.,2017) and ConvAI2 (Dinan et al.,
2020). Thanks to the transferable latent structure,
our model is able to yield better dialogue responses
than four strong baselines in terms of both auto-
matic and human evaluations, and our model in-
cluding about 22% - 66% parameters particularly
delivers a 2x - 30x speedup in running time. More-
over, the proposed model is explainable by visual-
izing the discrete latent variables.
Our contributions in the paper are three-fold: (1)
We present a context-free dialogue structure that
captures the prior knowledge about state transition
in a large-scale dialogue corpus. Furthermore, with
the help of this dialogue structure, our model out-
performs the state-of-the-art dialogue pre-training
method with much fewer parameters. (2) We pro-
pose a disentangled structure learning framework to
induce a context-free dialogue structure that enjoys
better transferability and interpretability. (3) We
empirically verify the effectiveness and efficiency
of the proposed model on two benchmarks.
2 Related Work
The success of neural networks in machine transla-
tion promotes early research on end-to-end open-
domain dialogue generation (Ritter et al.,2011;
Shang et al.,2015;Vinyals and Le,2015). Various
adaptations to the vanilla encoder-decoder archi-
tecture have been built to model the structure of
dialogue contexts (Serban et al.,2016,2017;Zhang
et al.,2019); improve response diversity (Li et al.,
2015;Zhao et al.,2017;Tao et al.,2018); introduce
external knowledge (Dinan et al.,2019;Zhao et al.,
2020a,b); and control response qualities (Xu et al.,
2019;Zhou et al.,2017;Zhang et al.,2018;Wang
et al.,2018;See et al.,2019).
Large-scale pre-training for open-domain dia-
logue generation has recently become promising as
a way to bridge the gap between conversation with
existing systems and conversation with humans. In-
spired by the successfulness of GPT-2 (Radford
et al.,2019), Zhang et al. (2020) propose to train
the transformer models on a very large dialogue
dataset to generate informative text. Bao et al.
(2020) further use discrete latent variables to ad-
dress the one-to-many mapping problem in open-
domain dialogue. Despite prior successes, the di-
alogue context is simply concatenated as a long
sequence, which may fail to capture the discourse-
level coherence among utterances. To this end, Gu
et al. (2021) and Li et al. (2021) introduce more
self-supervision objectives to capture the discourse-
level coherence and the dynamic information flow
respectively.
The concept of dialogue structure has proven
useful in modeling the complicated relationships
between utterances. In the field of task-oriented
dialogue, Shi et al. (2019) propose a discrete varia-
tional recurrent neural network (DVRNN) to learn
the dialogue structure through unsupervised learn-
ing; Qiu et al. (2020) further propose to enhance
prior work with a structured attention mechanism;
and Sun et al. (2021) propose a conversational
graph to represent deterministic dialogue structure,
where nodes and edges represent the utterance and
context information, respectively. In the field of
open-domain dialogue, Xu et al. (2021) construct a
large dialogue structure graph with around
1.6
mil-
lion vertices to cover a wide range of topics. This
work introduces a disentangled structure learning
framework, which can induce a transferable sub-
structure and an interpretable dialogue substruc-
ture, to incorporate the structural bias in dialogue
pre-training. Thanks to the tailor-designed self-
supervised tasks, our latent structure is more gen-
eral than the dialogue structure in existing work.
3 Approach
3.1 Overview
Let
X= (u1, u2,· · · , un)
denote a dialogue ses-
sion, with
ut= (wt,1, wt,2,· · · , wt,m)
denoting
the
t
-th utterance and
wt,i
the
i
-th token in it. The
number of utterances in a session and the number
of tokens in each utterance are represented by
n
and
m
, respectively. The conversational context for
ut
is
u<t = (u1, u2,· · · , ut1)
. Our ultimate goal
is to develop a generation model
p(ut|u<t)
that can
predict the next utterance based on the context of
the conversation.
Figure 1illustrates the overview of our graphical
model, which includes the proposed
latent struc-
ture
consisting of three kinds of latent variables,
i.e.,
c= [c1, c2,· · · , cn]
,
zI= [zI
1, zI
2,· · · , zI
n]
and
zS
. Specifically,
c
depicts the flow of a con-
versation, and each
ci∈ {1,· · · , N}
is a discrete
latent variable with
N
as a hyper-parameter. It is
𝒖𝟏𝒖𝟐𝒖𝟑
𝒛𝑺
𝒛𝟏
𝑰𝒛𝟐
𝑰𝒛𝟑
𝑰
𝒄𝟏𝒄𝟐𝒄𝟑
𝒉𝟏
𝝓𝒉𝟐
𝝓𝒉𝟑
𝝓𝒖𝟏𝒖𝟐𝒖𝟑
𝒛𝑺
𝒛𝟏
𝑰𝒛𝟐
𝑰𝒛𝟑
𝑰
𝒄𝟏𝒄𝟐𝒄𝟑
𝒉𝟏𝒉𝟐𝒉𝟑
𝒉𝟎
Inference Generation
Figure 1: Graphical illustrations of generation and in-
ference processes. Left: inference of the approximate
posterior as described in Section 3.3. Right: genera-
tion of uand computation of prior as described in Sec-
tion 3.2. The latent structure consists of c,zIand zS.
The initial hidden state h0is a trainable parameter.
worth noting that
c
is designed for interpretability:
by interpreting these discrete variables, humans are
able to understand the logical flow of the conversa-
tion as to be shown in Section 5.4. Moreover,
zS
and
zI
are two disentangled latent variables to cap-
ture the context-sensitive information and context-
independent information in a dialogue session re-
spectively. In this way, through disentangling
zS
and
zI
with tailor-designed self-supervised learn-
ing objectives (as will be described in Section 3.4),
our model is able to capture intrinsic conversation
flow for better generalization to different domains
(i.e., transferability).
With our designed latent structure, given a con-
versational context
u<t
, the generation of the
next utterance
ut
can be roughly decomposed
into two steps: (1) infer the conversation flow
[c1, c2,· · · , ct1]
and the context-sensitive variable
zS
based on context information, as shown in Fig-
ure 1(left). (2) compute the priors of
ct
and
zI
t
,
and then generate the next utterance
ut
with
zI
t
and
zS, as shown in Figure 1(right).
3.2 Generation
Context Encoding.
We first obtain the contex-
tualized representations of utterances through pre-
trained language models (PLMs). Specifically, we
exploit GPT-
2
(Radford et al.,2019), which is pre-
trained using the causal language modeling ob-
jective and achieves state-of-the-art results on a
range of text generation tasks, as the backbone of
our model. Note that our technical novelty lies
in the proposal of a disentangled structure learn-
ing framework that injects a transferable dialogue
structure into PLMs. Given a dialogue session
X= (u1, u2,· · · , un)
, we first construct the in-
put
I
by concatenating all utterances as a single
consecutive token sequence:
I=[BOS]u1[EOS]u2[EOS] . . . [EOS]un[EOS],
(1)
where
[BOS]
and
[EOS]
are special tokens de-
signed to separate sentences. The input
I
is then
fed into the PLM and the contextualized represen-
tation for
X
is defined as the hidden states at the
last layer:
h1,1,· · · , ht,i,· · · , hn,m =ftrans(I)Rmn×d,
(2)
where
ftrans(·)
denotes the transformer
model (Vaswani et al.,2017) and
ht,i Rd
denotes the hidden state corresponding to token
wt,i
. It’s notable that we use uni-directional atten-
tion since the learning objectives are applied to all
utterances (as will be illustrated in Section 3.4) and
a bi-directional architecture will leak the future
information.
The vector representation of the
t
-th utterance
is obtained through attentive pooling (Wu et al.,
2020), which is defined as follows:
ht=
m
X
j=1
αt,iht,i, αt,i =eq·ht,i
Pm
i=1 eq·ht,i ,(3)
where qRdis the attention query vector.
Prior of Discrete Latent Variable.
The discrete
latent variables
[c1, c2,· · · , cn]
are used to auto-
matically discover the structural representation in
dialogues, which is beneficial to analyze how con-
versation flow from one utterance to the next one
and promotes interpretability. We exclude the im-
pact of
u<t
on
ct
since there is usually a domain dis-
crepancy between the pre-trained and downstream
data, which limits the transferability of the learned
conversation flow. As a result, we directly model
the influence of
c<t
on
ct
in the prior. We employ
the transformer model with uni-directional atten-
tion to generate the contextualized representation
of c<t:
hc
1,· · · , hc
t1=fctrans ([c1,· · · , ct1]) R(t1)×d.
(4)
Then the probability of predicting
ct
is defined as:
p(ct|c<t) = Softmax(fcmlp(hc
t1)),(5)
where
fcmlp(·)
denotes a MLP network. Differ-
ent from Shi et al. (2019), our model preserves the
摘要:

TowardsEfcientDialoguePre-trainingwithTransferableandInterpretableLatentStructureXueliangZhao1,3,LemaoLiu2,TingchenFu4,ShumingShi2,DongyanZhao1,3,5y,RuiYan4y1WangxuanInstituteofComputerTechnology,PekingUniversity2TencentAILab3CenterforDataScience,AAIS,PekingUniversity4GaolingSchoolofArticialIntel...

展开>> 收起<<
Towards Efficient Dialogue Pre-training with Transferable and Interpretable Latent Structure Xueliang Zhao13 Lemao Liu2 Tingchen Fu4 Shuming Shi2 Dongyan Zhao135y Rui Yan4y.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:477.51KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注