Transformer-Based Conditioned Variational Autoencoder for Dialogue
Generation
Huihui Yang
Zhejiang University
yanghh0@zju.edu.cn
Abstract
In human dialogue, ang one query usually
elicits numerous appropriate responses. The
Transformer-based dialogue model produces
frequently occurring sentences in the corpus
since it is a one-to-one mapping function.
CVAE is a technique for reducing generic
replies. In this paper, we create a dialogue
model (CVAE-T) based on the Transformer
with CVAE structure. We use a pre-trained
MLM model to rewrite some key n-grams in
responses to obtain a series of negative exam-
ples, and introduce a regularization term dur-
ing training to explicitly guide the latent vari-
able in learning the semantic differences be-
tween each pair of positive and negative ex-
amples. Experiments suggest that the method
we design is capable of producing more infor-
mative replies.
1 Introduction
The training data used to train the dialogue mod-
els contains a great deal of unknown background
information, making the dialogue a one-to-many
problem where different people can come up with
different but reasonable answers to the same ques-
tion. Generative diversity is a crucial characteris-
tic for building dialogue systems. Zhao et al. 2017
use CVAE for dialogue modeling and demonstrate
that the sentences produced by CVAE model are
more diverse than those produced by conventional
sequence-to-sequence model.
For CVAE model, the approximate posterior
carries little useful information at the beginning
of training, and the model tends to fit the distri-
bution directly without reference to the latent vari-
able, which is known as the KL-vanishing (Bow-
man et al.,2016). In order to alleviate this
problem, some researchers introduce dialogue in-
tent labels (Zhao et al.,2017) or sentence func-
tion labels (interrogative, declarative and imper-
ative) (Ke et al.,2018) as additional information
to supervise the posterior network learning. How-
ever, this method has many drawbacks: 1) It is ex-
pensive to annotate labels and challenging to ex-
pand to large-scale datasets. 2) It only focuses on
the attributes of a certain aspect of sentences, and
the limited number of tags are difficult to cover
all the attributes of that aspect. 3) The tags them-
selves do not carry semantic information, which
is not conducive to model learning. We observe
that some key words or phrases in the sentence can
serve as representations of high-level sentence at-
tributes, disregarding the need for additional tags.
We locate the key n-grams in each response using
a keyword extraction algorithm and replace them
with a special token [MASK] respectively. These
masked positions are rewritten by a pre-trained
MLM model to generate a series of negative sen-
tences semantically distinct from the original sen-
tence. A regularization term is used to constrain
the prior and posterior distribution during training,
helping the latent variable to perceive the differ-
ence between positive and negative examples.
Dialogue models should be able to handle long
dependencies well because conversation datasets
usually contain multiple rounds of sentences,
and as the conversation goes on, the dialogue
history accumulates into a very long sequence.
Transformer-based models (Zhang et al.,2020;
Roller et al.,2020) have shown strong genera-
tive power when trained on large-scale conversa-
tional corpora. Due to its self-attention mech-
anism and excellent parallelism, Transformer is
suited for processing prolonged sequences. Its hi-
erarchical structure also enables the decoder to in-
corporate latent variable in a more flexible man-
ner. We choose Transformer as encoder-decoder
framework and explore how the CVAE structure
could be better integrated with it for dialogue gen-
eration.
The contributions of this paper can be summa-
rized as follows:
arXiv:2210.12326v1 [cs.CL] 22 Oct 2022