
Knowledge Based Systems
as input, we aim at learning a language model ()to
calculate the distribution over the text samples 𝒀. Thus,
when the condition is specified, the model could generate
realistic text samples that fulfill the given condition. And
in practice, we usually leverage a trained text classifier to
distinguish texts with different concepts (see Sec. 4.4.1 for
controllability analysis).
Tosupportgeneratingsentencesthatfulfillsuchrequest,
recent researches are mainly divided into threefold accord-
ing to their training paradigm: supervised, self-supervised
and semi-supervised. For fully supervised methods, adver-
sarial components like specified discriminators are widely
employed [3,51]. In spite of their high controllability, they
require abundant labeled data and enormous computational
resources, which is unpractical for real world applications.
Self-supervised methods commonlyexplorethehiddenem-
beddingsofLMs [51,47] andapply themselvestocatch the
underlying control rules during training, yet they normally
provide sequences with a low degree of control.
The third party is semi-supervised, which requires only
limited labeled data for controllable generation. SVAE [24]
asthefirstsemi-supervisedVAEmodel,wasinitiallyapplied
to visual domain. Duan etal. [6] explored its modeling for-
mulation into language domain, which treats the label em-
beddingasanextendedpartofthelatentvariablewhenthere
are label-text pairs available. Li et al. [28] proposed OP-
TIMUS with BERT and GPT-2 as encoder and decoder re-
spectively. They conducted controllable text generation via
alatentspaceadversarialnetworkusingatwo-stagetraining,
which only requires labeled data at the second stage.
ApartfromSVAEandOPTIMUS,oneimportantbranch
named“Pre-trainandPlug-in”(alsoknownasplug-and-play)
is rising recently. Since labeled samples are generally re-
quired only at “Plug-in” stage in PnP models, their train-
ing fashion is categorized as semi-supervised. Keskar et al.
[22]usedhuman-defined “controlcode”topre-trained LMs
in order to generate controllable texts, but needs full-scale
fine-tuning. To reduce training time, [4] firstly proposed
theconceptofplug-and-playforconditionaltextgeneration,
which generates controlled sentences by pulling the gradi-
ents of LMs along the desired path using extra components
with few parameters. However, it was proposed based on
large pre-trainedlanguagemodelsandstillrequireshoursto
be trained. What followed was the PPVAE [6], which can
beinsertedtoanypre-trainedAEtocreateconditionaltexts.
Nevertheless, it does not equip label infuser to incorporate
condition knowledge explicitly into generation, thus has to
train plug-in VAEs when new conditions come in. To
focus on a fine-grained generation, Mai et al. [35] further
extended the paradigm of PnP to text style transfer, which
treats target texts as labels and employs a novel “offset” net
as well as the latent adversarial loss for generation. Other
lines of PnP controllable generation either targets at chang-
ing the prompts/prefix to be fed into the base LMs during
training procedure [50,30], or shifting output probabilities
fromtrainedLMsatinferencetime[25,39]. Thesemethods
are mostly based on large pre-trained models and generally
Variable Description
𝑿Input unlabeled text corpus
𝒀Input labeled text corpus
The -th word from a data point in 𝒀
𝑳Task label set
The -th label from the label set
𝒀𝒊Labeled text corpus with label
𝒁𝒈Global latent space
𝒛𝒈Global latent vector from 𝒁𝒈
𝒁𝒍Local latent space
𝒛𝒍Local latent vector from 𝒁𝒍
Label embedding network
𝒆𝒍𝒊Label embedding of label
()The -th latent transformation network
𝒛𝒍(𝒕)The local latent vector after ()
𝒉𝒊The -th hidden state of the decoder
(⋅)The encoder of models
(⋅)The latent discriminator of AAE models
(⋅,⋅)The kernel function
(⋅)The prior distribution
(⋅)The posterior distribution
Table 1
The main variable denotations in our method.
takehoursto be fullytamed(sometimestheir training times
are even longer than fine-tuning) [25,30,15].
3. PCAE Methodology
WepresentthemainvariabledenotationsinTable1. The
key idea of our framework is to reduce the resource con-
sumption of training a language model with high control-
lability. The PnP framework with one full model training
andplug-incontrollablecomponentsisanefficientandflex-
ible for this demand. Thus our model is separated into two
disconnected sections: BaseAE and PluginAE, which cor-
responds to pre-training and plug-in training stage respec-
tively. The model’s workflow is in Figure 2: the first figure
representsthemodelstructureof BaseAE,whilethesecond
figureisthestructureof PluginAE.Asforthethirdfigure,it
istheprocessforcontrollabletextgeneration,whichrequires
components from both BaseAE and PluginAE.
Forpre-trainingstage,weuseunlabeledtextualdata𝑿to
traintheBaseAE languagemodel(trainfromthescratchfor
RNN-basedmodel andfine-tuningforBART-basedmodel).
Forplug-intraining,weinputtext-labelpair{𝒀,𝑳}={𝒀, },
where𝒀isthetrainingcorpusfrom𝒀withlabel. Weuse
thelabeleddatapairsforconditionaltraininginorder toob-
tain the controllable decoder of PluginAE, which takes the
latentvariableandlabelconditiontogeneratecontrollable
texts. Thus, once we trained the PluginAE, we only need
toinputthesampledglobal latent vectorfromits prior𝒛𝒈∼
(0, )andacontrollabel(one-hotlabel)tothemodelfor
controlled generation. This training process makes PCAE
only access to labels at the second stage, which makes it
semi-supervised.
H. Tu, Z. Yang, J. Yang et al.: Preprint submitted to Elsevier Page 3 of 14