Recurrence Boosts Diversity Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu123 Xiaoyuan Yi5 Wenhao Li123 Maosong Sun1234 Xing Xie5

2025-04-29 0 0 487.45KB 15 页 10玖币
侵权投诉
Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in
Transformer-Based Variational AutoEncoder for Diverse Text Generation
Jinyi Hu1,2,3, Xiaoyuan Yi5, Wenhao Li1,2,3, Maosong Sun1,2,3,4
, Xing Xie5
1Department of Computer Science and Technology, Tsinghua University, Beijing
2Beijing National Research Center for Information Science and Technology
3Institute for Artificial Intelligence, Tsinghua University, Beijing
4Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University, Xuzhou
5Microsoft Research Asia
hu-jy21@mails.tsinghua.edu.cn, xiaoyuanyi@microsoft.com
Abstract
Variational Auto-Encoder (VAE) has been
widely adopted in text generation. Among
many variants, recurrent VAE learns token-
wise latent variables with each conditioned on
the preceding ones, which captures sequential
variability better in the era of RNN. However,
it is unclear how to incorporate such recurrent
dynamics into the recently dominant Trans-
former due to its parallelism. In this work,
we propose TRACE, a Transformer-based re-
current VAE structure. TRACE imposes recur-
rence on segment-wise latent variables with
arbitrarily separated text segments and con-
structs the posterior distribution with residual
parameterization. Besides, we design an ac-
celeration method by approximating idempo-
tent matrices, which allows parallelism while
maintaining the conditional dependence of la-
tent variables. We demonstrate that TRACE
could enhance the entanglement of each seg-
ment and preceding latent variables and de-
duce a non-zero lower bound of the KL term,
providing a theoretical guarantee of generation
diversity. Experiments on various generation
tasks show that TRACE achieves significantly
improved diversity while maintaining satisfac-
tory generation quality.
1 Introduction
Variational Auto-Encoder (VAE) (Kingma and
Welling,2014;Rezende et al.,2014) has thrived
in various text generation tasks due to its ability
to learn flexible representations, such as machine
translation (Shah and Barber,2018) and the gen-
eration of dialogue (Zhao et al.,2017), story (Yu
et al.,2020) and poetry (Yi et al.,2020). To further
improve expressive power, diverse VAE variants
have been proposed. For example, IWAE (Burda
et al.,2016) optimizes a tighter lower bound; Lad-
der VAE (Sønderby et al.,2016) learns hierarchical
latent representations and vMF-VAE (Xu and Dur-
Corresponding author. Email: sms@tsinghua.edu.cn
rett,2018) replaces Gaussian distributions with von
Mises-Fisher distributions.
Among all variants, temporal VAE (Fabius et al.,
2015;Chung et al.,2015) was prevalent in the era
of RNN, which captures temporal variability by in-
troducing the dependency of a series of latent vari-
ables with each associated with one time step. Such
a VAE variant has succeeded in kinds of sequence
modeling tasks, e.g., dialog generation (Kim et al.,
2020), audio generation (Franceschi et al.,2020),
and time series prediction (Li et al.,2019).
Temporal VAE can be categorized into three
paradigms according to the dependency of prior
distributions at each time step: a) independent nor-
mal distributions (abbr. IND) (Li et al.,2020c), b)
context-conditioned Gaussian distributions (abbr.
CGD) (Du et al.,2018) which are conditioned on
preceding text, and c) recurrent Gaussian distribu-
tions (abbr. RGD), i.e., Recurrent VAE (Chung
et al.,2015), which are conditioned on preceding
both text and latent variables
1
. Both IND and CGD
ignore the interaction of latent variables, limiting
their expressive ability. In comparison, by intro-
ducing the dependency of latent variables, RGD
could better model the sequential variability and
thus greatly improve generation diversity while
maintaining satisfactory quality. We provide the
theoretical proof of such an advantage in Sec. 4.3.
These paradigms can be easily implemented
with RNN benefiting from RNN’s natural recur-
rent structure. Stepping into the age of Trans-
former (Vaswani et al.,2017), it is promising to
adapt temporal VAE to this popular architecture.
IND and CGD paradigms are naturally compati-
ble with Transformer because their latent variables
at each time step are independent which could be
simply combined with the parallel computation
of Transformer self-attention via causal and non-
causal masks (Lin et al.,2020). However, there are
no off-the-shelf solutions to incorporate RGD into
1
See Sec. 3.2 for mathematical details of these paradigms.
arXiv:2210.12409v3 [cs.CL] 23 Nov 2022
Transformer-based VAEs, since recurrence would
be a natural obstacle to parallelism (recurrent latent
variables need to be sequentially sampled), which
limits the capacity of this potential VAE paradigm.
Could we equip Transformer with such recur-
rent dynamics for better diversity while keeping
the training parallelism? To answer this ques-
tion, we propose TRACE
2
, a novel Transformer-
based recurrent VAE structure. TRACE imposes
recurrence on segment-wise (instead of token-wise)
latent variables with arbitrary segmentation, e.g.,
sentences or segments with specified length. Be-
sides, we construct the posterior distribution using
residual parameterization and layer normalization,
which could deduce a non-zero lower bound of the
KL loss to alleviate KL vanishing (Bowman et al.,
2016). Moreover, to accelerate training, we design
a method to recover the parallelism in Transformer
by approximating idempotent parameter matrices
for the latent space, leading to improved diversity,
satisfactory quality, and faster training.
In summary, our contributions are as follows:
We are the first to (
i
) incorporate recurrent VAE
into Transformer with recurrence on segment-wise
latent variables which allows a flexible trade-off of
diversity and quality; (
ii
) propose a method to re-
cover parallelism and accelerate training with com-
parable performance; (
iii
) mathematically demon-
strate that our model has a deducted non-zero lower
bound to mitigate KL vanishing, and a theoretical
interpretation of diversity improvement. (
iv
) We
validate the effectiveness of our model on two un-
conditional and one conditional generation tasks.
2 Related Work
VAE has shown great effectiveness in a wide range
of NLG tasks, such as storytelling (Yu et al.,2020;
Fang et al.,2021), dialogue generation (Serban
et al.,2017;Bao et al.,2020) and poetry composi-
tion (Yi et al.,2021). To further improve the expres-
sive ability of VAE, researchers propose various
variants, e.g., vMF-VAE (Xu and Durrett,2018)
that replaces the latent distribution with von Mises-
Fisher distribution, ml-VAE (Bouchacourt et al.,
2018) that learns multi-level latent variables, and
BN-VAE (Zhu et al.,2020) that utilizes batch nor-
malization to get a non-zero KL lower bound.
Among all variants, temporal VAE is the most
prevalent one in the era of RNN, which introduces
latent variables at each timestep and could natu-
2TRACE:Transformer Recurrent AutoenCodEr
rally fit with the recurrent structure of RNN. Ex-
isting temporal VAE fall into three paradigms ac-
cording to the parameterization and dependence
of the latent variables’ prior distributions, namely
IND,CGD, and RGD, as mentioned in Sec. 1. For
example, TWR-VAE (Li et al.,2020c) utilizes a
timestep-wise regularisation through independent
latent variables with IND. VAD (Du et al.,2018)
incorporates CGD into latent variables and aug-
ments the posterior distribution with a backward
RNN. Recurrent VAE (Chung et al.,2015) learns
token-wise latent variables with each sequentially
conditioned on preceding ones as well as the con-
text (i.e., RGD). By modeling the trajectory of both
observed text sequences and latent space, recurrent
VAE could capture the sequential variability bet-
ter (Goyal et al.,2017;Hajiramezanali et al.,2020).
Besides, we will show that such recurrent dynam-
ics could theoretically reinforce the dependence on
the stochastic and generalized latent space, thus
boosting generation diversity by a large margin.
Recently, with the flourishing of the powerful
Transformer architecture, researchers have devoted
to combining it with VAE for text modeling and
generation (Wang and Wan,2019;Li et al.,2020a;
Fang et al.,2021;Hu et al.,2022). VAEs could
promote generation diversity with satisfactory qual-
ity, benefiting from the intrinsic randomness in
latent space. Therefore, VAE-based Transformers
are essential for various tasks demanding creativ-
ity, such as advertising text generation (Shao et al.,
2019). Two of the temporal VAE paradigms, IND
and CGD, can be easily adapted into Transformer.
For instance, SVT (Lin et al.,2020) applies CGD-
based VAE to dialogue generation. Nonetheless,
the integration of recurrent VAE is still an open
challenge due to the conflict in the parallelism in
Transformer and recurrent dependence of recurrent
VAE. To fully exploit the expressive power of re-
currence, we revisit recurrent VAE in Transformer
and propose TRACE which possesses advantages of
both generation diversity and training parallelism.
3 Preliminaries
3.1 VAE
As one of the representative generative models,
VAE has proven to be an effective paradigm for
estimating the data distribution by introducing a
latent variable
z
and modeling the joint distribution:
p(x,z) = p(x|z)p(z).(1)
The prior distribution
p(z)
is commonly a stan-
dard Gaussian distribution. The conditional dis-
tribution
p(x|z)
is generally parameterized by a
neural network, known as the generative network
(decoder) to recover the observed data from latent
variables. Directly estimating
p(x|z)
brings the
intractable posterior distribution
p(z|x)
. Instead,
VAE introduces a variational approximation
q(z|x)
and derives the Evidence Lower BOund (ELBO):
log p(x)≥ LELBO(x) =
Eq(z|x)[log p(x|z)]KL(q(z|x)||p(z)),(2)
where
KL
means the Kullback-Leibler divergence.
In practice, the approximated posterior
q(z|x)
is parameterized as Gaussian distribution
N(µ,diag(σ2))
, where
µ
and
σ
are estimated by
a neural network, known as the inference network
(encoder). The generative network
p(x|z)
and
inference network
q(z|x)
are jointly optimized by
maximizing the lower bound in Eq.(2).
3.2 Temporal VAE
Unlike standard VAE, which only involves one la-
tent variable
z
, temporal VAE learns one latent
variable at each time step. Denote
ztRl
and
xtRh
as the latent variables and the observed
data at
t
-th step, respectively. Next, we will present
the mathematical details of three paradigms of tem-
poral VAE, namely IND, CGD, and RGD.
IND:
The prior distribution
p(zt)
follows the
standard Gaussian distribution
N(0,I)
, and the
posterior one conditions on the preceding context
as
q(zt|xt)
. Then, we obtain the ELBO of IND:
T
X
t=1
Eq(zt|xt)log p(xt|zt,x<t))
KL(q(zt|xt)||p(zt)).
(3)
CGD:
CGD constructs the prior distribution
considering the observed text
p(zt|x<t)
and the
posterior one based on the complete text
x=
{x1,· · · ,xT}. The lower bound of CGD is:
T
X
t=1
Eq(zt|x)log p(xt|zt,x<t))
KL(q(zt|x)||p(zt|x<t)).
(4)
RGD:
RGD parameterizes the generative process
by the following factorization:
p(xT,zT) =
T
Y
t=1
p(xt|zt,x<t)p(zt|z<t,x<t).
(5)
The latent variables
zt
follows the prior distribu-
tion
p(zt|z<t,x<t)
and the posterior one follows
q(zt|z<t,xt). Then, we obtain the ELBO:
Eq(zT|xT)hT
X
t=1
log p(xt|zt,x<t)
KL(q(zt|z<t,xt)||p(zt|z<t,x<t))i,
(6)
where q(zT|xT)can be factorized as:
q(zT|xT) =
T
Y
t=1
q(zt|z<t,xt).(7)
We present the detailed deduction of Eq.(6) in Ap-
pendix B.1.
In an RNN-like backbone, we can construct the
representation of
xt
with the hidden states at
t
-th
step and compute the distribution parameters of
zt
.
4 Method
To incorporate recurrent VAE (RGD) into Trans-
former, we propose TRACE that learns recurrent
segment-wise latent variables and design an accel-
eration method to make full use of the advantage
of parallelism in Transformer. We present the adap-
tion of recurrent VAE to Transformer and residual
parameterization in Sec. 4.1, demonstrate the par-
allel training method in Sec. 4.2, and provide a
theoretical interpretation of TRACEs effectiveness
for boosting diversity in Sec. 4.3.
4.1 Transformer-based Recurrent VAE
Different from the token-wise latent variables im-
plemented in RNN-based VAEs, TRACE learns
segment-wise
zt
based on the representation of
t
-th segment,
xt
. We can devise different princi-
ples to separate the segments, such as the inherent
separation like sentence or utterance, or specifying
a fixed segment length. We add a special token
[SEP] to the end of each segment.
Fig. 1depicts the architecture of TRACE. At the
encoder, we design two kinds of attention mask
matrices. First, we introduce an extra mask matrix,
a partitioned lower triangular matrix (the left of
Fig. 1), which allows each token to attend to all
tokens in the same segment and previous segments.
Second, we design an intra mask matrix, a strict
partitioned matrix to make each token attend to
only the tokens within the same segment. We input
the separated text sequence into the Transformer
encoder twice, with the extra and intra mask matrix,
摘要:

RecurrenceBoostsDiversity!RevisitingRecurrentLatentVariableinTransformer-BasedVariationalAutoEncoderforDiverseTextGenerationJinyiHu1;2;3,XiaoyuanYi5,WenhaoLi1;2;3,MaosongSun1;2;3;4,XingXie51DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijing2BeijingNationalResearchCenterforInformati...

展开>> 收起<<
Recurrence Boosts Diversity Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu123 Xiaoyuan Yi5 Wenhao Li123 Maosong Sun1234 Xing Xie5.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:487.45KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注