
Transformer-based VAEs, since recurrence would
be a natural obstacle to parallelism (recurrent latent
variables need to be sequentially sampled), which
limits the capacity of this potential VAE paradigm.
Could we equip Transformer with such recur-
rent dynamics for better diversity while keeping
the training parallelism? To answer this ques-
tion, we propose TRACE
2
, a novel Transformer-
based recurrent VAE structure. TRACE imposes
recurrence on segment-wise (instead of token-wise)
latent variables with arbitrary segmentation, e.g.,
sentences or segments with specified length. Be-
sides, we construct the posterior distribution using
residual parameterization and layer normalization,
which could deduce a non-zero lower bound of the
KL loss to alleviate KL vanishing (Bowman et al.,
2016). Moreover, to accelerate training, we design
a method to recover the parallelism in Transformer
by approximating idempotent parameter matrices
for the latent space, leading to improved diversity,
satisfactory quality, and faster training.
In summary, our contributions are as follows:
We are the first to (
i
) incorporate recurrent VAE
into Transformer with recurrence on segment-wise
latent variables which allows a flexible trade-off of
diversity and quality; (
ii
) propose a method to re-
cover parallelism and accelerate training with com-
parable performance; (
iii
) mathematically demon-
strate that our model has a deducted non-zero lower
bound to mitigate KL vanishing, and a theoretical
interpretation of diversity improvement. (
iv
) We
validate the effectiveness of our model on two un-
conditional and one conditional generation tasks.
2 Related Work
VAE has shown great effectiveness in a wide range
of NLG tasks, such as storytelling (Yu et al.,2020;
Fang et al.,2021), dialogue generation (Serban
et al.,2017;Bao et al.,2020) and poetry composi-
tion (Yi et al.,2021). To further improve the expres-
sive ability of VAE, researchers propose various
variants, e.g., vMF-VAE (Xu and Durrett,2018)
that replaces the latent distribution with von Mises-
Fisher distribution, ml-VAE (Bouchacourt et al.,
2018) that learns multi-level latent variables, and
BN-VAE (Zhu et al.,2020) that utilizes batch nor-
malization to get a non-zero KL lower bound.
Among all variants, temporal VAE is the most
prevalent one in the era of RNN, which introduces
latent variables at each timestep and could natu-
2TRACE:Transformer Recurrent AutoenCodEr
rally fit with the recurrent structure of RNN. Ex-
isting temporal VAE fall into three paradigms ac-
cording to the parameterization and dependence
of the latent variables’ prior distributions, namely
IND,CGD, and RGD, as mentioned in Sec. 1. For
example, TWR-VAE (Li et al.,2020c) utilizes a
timestep-wise regularisation through independent
latent variables with IND. VAD (Du et al.,2018)
incorporates CGD into latent variables and aug-
ments the posterior distribution with a backward
RNN. Recurrent VAE (Chung et al.,2015) learns
token-wise latent variables with each sequentially
conditioned on preceding ones as well as the con-
text (i.e., RGD). By modeling the trajectory of both
observed text sequences and latent space, recurrent
VAE could capture the sequential variability bet-
ter (Goyal et al.,2017;Hajiramezanali et al.,2020).
Besides, we will show that such recurrent dynam-
ics could theoretically reinforce the dependence on
the stochastic and generalized latent space, thus
boosting generation diversity by a large margin.
Recently, with the flourishing of the powerful
Transformer architecture, researchers have devoted
to combining it with VAE for text modeling and
generation (Wang and Wan,2019;Li et al.,2020a;
Fang et al.,2021;Hu et al.,2022). VAEs could
promote generation diversity with satisfactory qual-
ity, benefiting from the intrinsic randomness in
latent space. Therefore, VAE-based Transformers
are essential for various tasks demanding creativ-
ity, such as advertising text generation (Shao et al.,
2019). Two of the temporal VAE paradigms, IND
and CGD, can be easily adapted into Transformer.
For instance, SVT (Lin et al.,2020) applies CGD-
based VAE to dialogue generation. Nonetheless,
the integration of recurrent VAE is still an open
challenge due to the conflict in the parallelism in
Transformer and recurrent dependence of recurrent
VAE. To fully exploit the expressive power of re-
currence, we revisit recurrent VAE in Transformer
and propose TRACE which possesses advantages of
both generation diversity and training parallelism.
3 Preliminaries
3.1 VAE
As one of the representative generative models,
VAE has proven to be an effective paradigm for
estimating the data distribution by introducing a
latent variable
z
and modeling the joint distribution:
p(x,z) = p(x|z)p(z).(1)