Recurrence Boosts Diversity Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu123 Xiaoyuan Yi5 Wenhao Li123 Maosong Sun1234 Xing Xie5

2025-04-29 0 0 487.45KB 15 页 10玖币

侵权投诉

Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in

Transformer-Based Variational AutoEncoder for Diverse Text Generation

Jinyi Hu1,2,3, Xiaoyuan Yi5, Wenhao Li1,2,3, Maosong Sun1,2,3,4∗

, Xing Xie5

1Department of Computer Science and Technology, Tsinghua University, Beijing

2Beijing National Research Center for Information Science and Technology

3Institute for Artiﬁcial Intelligence, Tsinghua University, Beijing

4Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University, Xuzhou

5Microsoft Research Asia

hu-jy21@mails.tsinghua.edu.cn, xiaoyuanyi@microsoft.com

Abstract

Variational Auto-Encoder (VAE) has been

widely adopted in text generation. Among

many variants, recurrent VAE learns token-

wise latent variables with each conditioned on

the preceding ones, which captures sequential

variability better in the era of RNN. However,

it is unclear how to incorporate such recurrent

dynamics into the recently dominant Trans-

former due to its parallelism. In this work,

we propose TRACE, a Transformer-based re-

current VAE structure. TRACE imposes recur-

rence on segment-wise latent variables with

arbitrarily separated text segments and con-

structs the posterior distribution with residual

parameterization. Besides, we design an ac-

celeration method by approximating idempo-

tent matrices, which allows parallelism while

maintaining the conditional dependence of la-

tent variables. We demonstrate that TRACE

could enhance the entanglement of each seg-

ment and preceding latent variables and de-

duce a non-zero lower bound of the KL term,

providing a theoretical guarantee of generation

diversity. Experiments on various generation

tasks show that TRACE achieves signiﬁcantly

improved diversity while maintaining satisfac-

tory generation quality.

1 Introduction

Variational Auto-Encoder (VAE) (Kingma and

Welling,2014;Rezende et al.,2014) has thrived

in various text generation tasks due to its ability

to learn ﬂexible representations, such as machine

translation (Shah and Barber,2018) and the gen-

eration of dialogue (Zhao et al.,2017), story (Yu

et al.,2020) and poetry (Yi et al.,2020). To further

improve expressive power, diverse VAE variants

have been proposed. For example, IWAE (Burda

et al.,2016) optimizes a tighter lower bound; Lad-

der VAE (Sønderby et al.,2016) learns hierarchical

latent representations and vMF-VAE (Xu and Dur-

∗Corresponding author. Email: sms@tsinghua.edu.cn

rett,2018) replaces Gaussian distributions with von

Mises-Fisher distributions.

Among all variants, temporal VAE (Fabius et al.,

2015;Chung et al.,2015) was prevalent in the era

of RNN, which captures temporal variability by in-

troducing the dependency of a series of latent vari-

ables with each associated with one time step. Such

a VAE variant has succeeded in kinds of sequence

modeling tasks, e.g., dialog generation (Kim et al.,

2020), audio generation (Franceschi et al.,2020),

and time series prediction (Li et al.,2019).

Temporal VAE can be categorized into three

paradigms according to the dependency of prior

distributions at each time step: a) independent nor-

mal distributions (abbr. IND) (Li et al.,2020c), b)

context-conditioned Gaussian distributions (abbr.

CGD) (Du et al.,2018) which are conditioned on

preceding text, and c) recurrent Gaussian distribu-

tions (abbr. RGD), i.e., Recurrent VAE (Chung

et al.,2015), which are conditioned on preceding

both text and latent variables

. Both IND and CGD

ignore the interaction of latent variables, limiting

their expressive ability. In comparison, by intro-

ducing the dependency of latent variables, RGD

could better model the sequential variability and

thus greatly improve generation diversity while

maintaining satisfactory quality. We provide the

theoretical proof of such an advantage in Sec. 4.3.

These paradigms can be easily implemented

with RNN beneﬁting from RNN’s natural recur-

rent structure. Stepping into the age of Trans-

former (Vaswani et al.,2017), it is promising to

adapt temporal VAE to this popular architecture.

IND and CGD paradigms are naturally compati-

ble with Transformer because their latent variables

at each time step are independent which could be

simply combined with the parallel computation

of Transformer self-attention via causal and non-

causal masks (Lin et al.,2020). However, there are

no off-the-shelf solutions to incorporate RGD into

See Sec. 3.2 for mathematical details of these paradigms.

arXiv:2210.12409v3 [cs.CL] 23 Nov 2022

Transformer-based VAEs, since recurrence would

be a natural obstacle to parallelism (recurrent latent

variables need to be sequentially sampled), which

limits the capacity of this potential VAE paradigm.

Could we equip Transformer with such recur-

rent dynamics for better diversity while keeping

the training parallelism? To answer this ques-

tion, we propose TRACE

, a novel Transformer-

based recurrent VAE structure. TRACE imposes

recurrence on segment-wise (instead of token-wise)

latent variables with arbitrary segmentation, e.g.,

sentences or segments with speciﬁed length. Be-

sides, we construct the posterior distribution using

residual parameterization and layer normalization,

which could deduce a non-zero lower bound of the

KL loss to alleviate KL vanishing (Bowman et al.,

2016). Moreover, to accelerate training, we design

a method to recover the parallelism in Transformer

by approximating idempotent parameter matrices

for the latent space, leading to improved diversity,

satisfactory quality, and faster training.

In summary, our contributions are as follows:

We are the ﬁrst to (

) incorporate recurrent VAE

into Transformer with recurrence on segment-wise

latent variables which allows a ﬂexible trade-off of

diversity and quality; (

) propose a method to re-

cover parallelism and accelerate training with com-

parable performance; (

iii

) mathematically demon-

strate that our model has a deducted non-zero lower

bound to mitigate KL vanishing, and a theoretical

interpretation of diversity improvement. (

) We

validate the effectiveness of our model on two un-

conditional and one conditional generation tasks.

2 Related Work

VAE has shown great effectiveness in a wide range

of NLG tasks, such as storytelling (Yu et al.,2020;

Fang et al.,2021), dialogue generation (Serban

et al.,2017;Bao et al.,2020) and poetry composi-

tion (Yi et al.,2021). To further improve the expres-

sive ability of VAE, researchers propose various

variants, e.g., vMF-VAE (Xu and Durrett,2018)

that replaces the latent distribution with von Mises-

Fisher distribution, ml-VAE (Bouchacourt et al.,

2018) that learns multi-level latent variables, and

BN-VAE (Zhu et al.,2020) that utilizes batch nor-

malization to get a non-zero KL lower bound.

Among all variants, temporal VAE is the most

prevalent one in the era of RNN, which introduces

latent variables at each timestep and could natu-

2TRACE:Transformer Recurrent AutoenCodEr

rally ﬁt with the recurrent structure of RNN. Ex-

isting temporal VAE fall into three paradigms ac-

cording to the parameterization and dependence

of the latent variables’ prior distributions, namely

IND,CGD, and RGD, as mentioned in Sec. 1. For

example, TWR-VAE (Li et al.,2020c) utilizes a

timestep-wise regularisation through independent

latent variables with IND. VAD (Du et al.,2018)

incorporates CGD into latent variables and aug-

ments the posterior distribution with a backward

RNN. Recurrent VAE (Chung et al.,2015) learns

token-wise latent variables with each sequentially

conditioned on preceding ones as well as the con-

text (i.e., RGD). By modeling the trajectory of both

observed text sequences and latent space, recurrent

VAE could capture the sequential variability bet-

ter (Goyal et al.,2017;Hajiramezanali et al.,2020).

Besides, we will show that such recurrent dynam-

ics could theoretically reinforce the dependence on

the stochastic and generalized latent space, thus

boosting generation diversity by a large margin.

Recently, with the ﬂourishing of the powerful

Transformer architecture, researchers have devoted

to combining it with VAE for text modeling and

generation (Wang and Wan,2019;Li et al.,2020a;

Fang et al.,2021;Hu et al.,2022). VAEs could

promote generation diversity with satisfactory qual-

ity, beneﬁting from the intrinsic randomness in

latent space. Therefore, VAE-based Transformers

are essential for various tasks demanding creativ-

ity, such as advertising text generation (Shao et al.,

2019). Two of the temporal VAE paradigms, IND

and CGD, can be easily adapted into Transformer.

For instance, SVT (Lin et al.,2020) applies CGD-

based VAE to dialogue generation. Nonetheless,

the integration of recurrent VAE is still an open

challenge due to the conﬂict in the parallelism in

Transformer and recurrent dependence of recurrent

VAE. To fully exploit the expressive power of re-

currence, we revisit recurrent VAE in Transformer

and propose TRACE which possesses advantages of

both generation diversity and training parallelism.

3 Preliminaries

3.1 VAE

As one of the representative generative models,

VAE has proven to be an effective paradigm for

estimating the data distribution by introducing a

latent variable

and modeling the joint distribution:

p(x,z) = p(x|z)p(z).(1)

The prior distribution

p(z)

is commonly a stan-

dard Gaussian distribution. The conditional dis-

tribution

p(x|z)

is generally parameterized by a

neural network, known as the generative network

(decoder) to recover the observed data from latent

variables. Directly estimating

p(x|z)

brings the

intractable posterior distribution

p(z|x)

. Instead,

VAE introduces a variational approximation

q(z|x)

and derives the Evidence Lower BOund (ELBO):

log p(x)≥ LELBO(x) =

Eq(z|x)[log p(x|z)]−KL(q(z|x)||p(z)),(2)

where

means the Kullback-Leibler divergence.

In practice, the approximated posterior

q(z|x)

is parameterized as Gaussian distribution

N(µ,diag(σ2))

, where

and

are estimated by

a neural network, known as the inference network

(encoder). The generative network

p(x|z)

and

inference network

q(z|x)

are jointly optimized by

maximizing the lower bound in Eq.(2).

3.2 Temporal VAE

Unlike standard VAE, which only involves one la-

tent variable

, temporal VAE learns one latent

variable at each time step. Denote

zt∈Rl

and

xt∈Rh

as the latent variables and the observed

data at

-th step, respectively. Next, we will present

the mathematical details of three paradigms of tem-

poral VAE, namely IND, CGD, and RGD.

IND:

The prior distribution

p(zt)

follows the

standard Gaussian distribution

N(0,I)

, and the

posterior one conditions on the preceding context

q(zt|x≤t)

. Then, we obtain the ELBO of IND:

t=1

Eq(zt|x≤t)log p(xt|zt,x<t))

−KL(q(zt|x≤t)||p(zt)).

(3)

CGD:

CGD constructs the prior distribution

considering the observed text

p(zt|x<t)

and the

posterior one based on the complete text

{x1,· · · ,xT}. The lower bound of CGD is:

t=1

Eq(zt|x)log p(xt|zt,x<t))

−KL(q(zt|x)||p(zt|x<t)).

(4)

RGD:

RGD parameterizes the generative process

by the following factorization:

p(x≤T,z≤T) =

t=1

p(xt|z≤t,x<t)p(zt|z<t,x<t).

(5)

The latent variables

follows the prior distribu-

tion

p(zt|z<t,x<t)

and the posterior one follows

q(zt|z<t,x≤t). Then, we obtain the ELBO:

Eq(z≤T|x≤T)hT

t=1

log p(xt|z≤t,x<t)

−KL(q(zt|z<t,x≤t)||p(zt|z<t,x<t))i,

(6)

where q(z≤T|x≤T)can be factorized as:

q(z≤T|x≤T) =

t=1

q(zt|z<t,x≤t).(7)

We present the detailed deduction of Eq.(6) in Ap-

pendix B.1.

In an RNN-like backbone, we can construct the

representation of

x≤t

with the hidden states at

-th

step and compute the distribution parameters of

4 Method

To incorporate recurrent VAE (RGD) into Trans-

former, we propose TRACE that learns recurrent

segment-wise latent variables and design an accel-

eration method to make full use of the advantage

of parallelism in Transformer. We present the adap-

tion of recurrent VAE to Transformer and residual

parameterization in Sec. 4.1, demonstrate the par-

allel training method in Sec. 4.2, and provide a

theoretical interpretation of TRACE’s effectiveness

for boosting diversity in Sec. 4.3.

4.1 Transformer-based Recurrent VAE

Different from the token-wise latent variables im-

plemented in RNN-based VAEs, TRACE learns

segment-wise

based on the representation of

-th segment,

. We can devise different princi-

ples to separate the segments, such as the inherent

separation like sentence or utterance, or specifying

a ﬁxed segment length. We add a special token

[SEP] to the end of each segment.

Fig. 1depicts the architecture of TRACE. At the

encoder, we design two kinds of attention mask

matrices. First, we introduce an extra mask matrix,

a partitioned lower triangular matrix (the left of

Fig. 1), which allows each token to attend to all

tokens in the same segment and previous segments.

Second, we design an intra mask matrix, a strict

partitioned matrix to make each token attend to

only the tokens within the same segment. We input

the separated text sequence into the Transformer

encoder twice, with the extra and intra mask matrix,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RecurrenceBoostsDiversity!RevisitingRecurrentLatentVariableinTransformer-BasedVariationalAutoEncoderforDiverseTextGenerationJinyiHu1;2;3,XiaoyuanYi5,WenhaoLi1;2;3,MaosongSun1;2;3;4,XingXie51DepartmentofComputerScienceandTechnology,TsinghuaUniversity,Beijing2BeijingNationalResearchCenterforInformati...

展开>> 收起<<

Recurrence Boosts Diversity Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu123 Xiaoyuan Yi5 Wenhao Li123 Maosong Sun1234 Xing Xie5.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Recurrence Boosts Diversity Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu123 Xiaoyuan Yi5 Wenhao Li123 Maosong Sun1234 Xing Xie5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: