the dynamic flow of dialogue in each utterance.
Meanwhile, the model is optimized with variational
inference by maximizing the evidence lower bound
of the likelihood.
We conduct experiments with two multi-turn di-
alogue generation benchmarks, including DailyDi-
alog (Li et al.,2017) and ConvAI2 (Dinan et al.,
2020). Thanks to the transferable latent structure,
our model is able to yield better dialogue responses
than four strong baselines in terms of both auto-
matic and human evaluations, and our model in-
cluding about 22% - 66% parameters particularly
delivers a 2x - 30x speedup in running time. More-
over, the proposed model is explainable by visual-
izing the discrete latent variables.
Our contributions in the paper are three-fold: (1)
We present a context-free dialogue structure that
captures the prior knowledge about state transition
in a large-scale dialogue corpus. Furthermore, with
the help of this dialogue structure, our model out-
performs the state-of-the-art dialogue pre-training
method with much fewer parameters. (2) We pro-
pose a disentangled structure learning framework to
induce a context-free dialogue structure that enjoys
better transferability and interpretability. (3) We
empirically verify the effectiveness and efficiency
of the proposed model on two benchmarks.
2 Related Work
The success of neural networks in machine transla-
tion promotes early research on end-to-end open-
domain dialogue generation (Ritter et al.,2011;
Shang et al.,2015;Vinyals and Le,2015). Various
adaptations to the vanilla encoder-decoder archi-
tecture have been built to model the structure of
dialogue contexts (Serban et al.,2016,2017;Zhang
et al.,2019); improve response diversity (Li et al.,
2015;Zhao et al.,2017;Tao et al.,2018); introduce
external knowledge (Dinan et al.,2019;Zhao et al.,
2020a,b); and control response qualities (Xu et al.,
2019;Zhou et al.,2017;Zhang et al.,2018;Wang
et al.,2018;See et al.,2019).
Large-scale pre-training for open-domain dia-
logue generation has recently become promising as
a way to bridge the gap between conversation with
existing systems and conversation with humans. In-
spired by the successfulness of GPT-2 (Radford
et al.,2019), Zhang et al. (2020) propose to train
the transformer models on a very large dialogue
dataset to generate informative text. Bao et al.
(2020) further use discrete latent variables to ad-
dress the one-to-many mapping problem in open-
domain dialogue. Despite prior successes, the di-
alogue context is simply concatenated as a long
sequence, which may fail to capture the discourse-
level coherence among utterances. To this end, Gu
et al. (2021) and Li et al. (2021) introduce more
self-supervision objectives to capture the discourse-
level coherence and the dynamic information flow
respectively.
The concept of dialogue structure has proven
useful in modeling the complicated relationships
between utterances. In the field of task-oriented
dialogue, Shi et al. (2019) propose a discrete varia-
tional recurrent neural network (DVRNN) to learn
the dialogue structure through unsupervised learn-
ing; Qiu et al. (2020) further propose to enhance
prior work with a structured attention mechanism;
and Sun et al. (2021) propose a conversational
graph to represent deterministic dialogue structure,
where nodes and edges represent the utterance and
context information, respectively. In the field of
open-domain dialogue, Xu et al. (2021) construct a
large dialogue structure graph with around
1.6
mil-
lion vertices to cover a wide range of topics. This
work introduces a disentangled structure learning
framework, which can induce a transferable sub-
structure and an interpretable dialogue substruc-
ture, to incorporate the structural bias in dialogue
pre-training. Thanks to the tailor-designed self-
supervised tasks, our latent structure is more gen-
eral than the dialogue structure in existing work.
3 Approach
3.1 Overview
Let
X= (u1, u2,· · · , un)
denote a dialogue ses-
sion, with
ut= (wt,1, wt,2,· · · , wt,m)
denoting
the
t
-th utterance and
wt,i
the
i
-th token in it. The
number of utterances in a session and the number
of tokens in each utterance are represented by
n
and
m
, respectively. The conversational context for
ut
is
u<t = (u1, u2,· · · , ut−1)
. Our ultimate goal
is to develop a generation model
p(ut|u<t)
that can
predict the next utterance based on the context of
the conversation.
Figure 1illustrates the overview of our graphical
model, which includes the proposed
latent struc-
ture
consisting of three kinds of latent variables,
i.e.,
c= [c1, c2,· · · , cn]
,
zI= [zI
1, zI
2,· · · , zI
n]
and
zS
. Specifically,
c
depicts the flow of a con-
versation, and each
ci∈ {1,· · · , N}
is a discrete
latent variable with
N
as a hyper-parameter. It is