FAST Improving Controllability for Text Generation with Feedback AwareSelf-Training Junyi Chai Reid Pryzant Victor Ye Dong Konstantin Golobokov Chenguang Zhu Yi Liu

2025-04-27 0 0 797.92KB 18 页 10玖币
侵权投诉
FAST: Improving Controllability for Text Generation with
Feedback Aware Self-Training
Junyi Chai, Reid Pryzant, Victor Ye Dong, Konstantin Golobokov, Chenguang Zhu, Yi Liu
Microsoft Corporation
juchai,reidpryzant,victordong,kogolobo,chezhu,lewisliu@microsoft.com
Abstract
Controllable text generation systems often
leverage control codes to direct various prop-
erties of the output like style and length. In-
spired by recent work on causal inference
for NLP, this paper reveals a previously over-
looked flaw in these control code-based con-
ditional text generation algorithms. Spurious
correlations in the training data can lead mod-
els to incorrectly rely on parts of the input
other than the control code for attribute se-
lection, significantly undermining downstream
generation quality and controllability. We
demonstrate the severity of this issue with a
series of case studies and then propose two
simple techniques to reduce these correlations
in training sets. The first technique is based
on resampling the data according to an ex-
ample’s propensity towards each linguistic at-
tribute (IPS). The second produces multiple
counterfactual versions of each example and
then uses an additional feedback mechanism to
remove noisy examples (feedback aware self-
training, FAST). We evaluate on 3 tasks – news
headline, meta review, and search ads gener-
ation – and demonstrate that FAST can sig-
nificantly improve the controllability and lan-
guage quality of generated outputs when com-
pared to state-of-the-art controllable text gen-
eration approaches.
1 Introduction
In neural text generation, there is a growing in-
terest in controlling the presence of particular lin-
guistic attributes in the output text, for example
sentiment, length, politeness, and topic (Sennrich
et al.,2016;Kikuchi et al.,2016;Ficler and Gold-
berg,2017;Shen et al.,2022). This is typically
accomplished via control codes: categorical vari-
ables that represent the desired output property and
are pre-pended to the model inputs during training
and testing (Keskar et al.,2019).
This paper builds on recent work in text-based
causal inference (Feder et al.,2021;Veitch et al.,
2021;Pryzant et al.,2021) to reveal a previously
overlooked flaw in control code-based text gener-
ation systems: spurious correlations in the data
can cause models to incorrectly rely on parts of
the input other than the control code for attribute
selection, undermining downstream generation per-
formance.
For example, consider a system that generates
news headlines while conditioning on article text
and a control code for headline length (e.g. long
for desktop, short for mobile) as in Murao et al.
(2019). We show in §4.1 that among publicly avail-
able news datasets, correlations exist between the
contents of an article and the length of that article’s
title. Longer articles or articles about technical top-
ics may be associated with longer titles. This leads
NLP models to struggle at generating short titles
from “long title”-looking articles.
We show how this phenomenon can introduce
confounding statistical relationships in the data,
leading to assumption violations and significantly
degrading model quality. Then we propose two sim-
ple data augmentation techniques for improving the
issue. Both algorithms operate by breaking these
spurious correlations and isolating the statistical
relationship between control codes and linguistic
attributes. In the first approach, we resample the
training set according to an inverse propensity score
(IPS, Robins et al. (1994)), boosting the presence
of rare context-attribute combinations in the data.
In the second approach (FAST) we train a prelimi-
nary model, use counterfactual data augmentation
to generate all possible attributes for each exam-
ple, then retrain on the counterfactually balanced
dataset, as illustrated in Figure 1.
We conduct experiments in 3 conditional text
generation scenarios: generating news headlines
from article contents (controlling the headline
lengths), generating the next sentence from pre-
ceding sentences (controlling the intent), and gen-
erating search ad copy from landing pages (control-
arXiv:2210.03167v1 [cs.CL] 6 Oct 2022
Train BART for conditional text generation
BART
target
𝑦𝑐
context
𝑥
Generate counterfactuals using different control codes
𝑐′
context
𝑥
target
𝑦′
𝑐′′
context
𝑥BART
𝑐′′′
context
𝑥
target
𝑦′′
target
𝑦′′′
Filter out noisy examples if control code attribute
𝑐′
context
𝑥
target
𝑦′
𝑐′′
context
𝑥𝒞
𝑐′′′
context
𝑥
target
𝑦′′
target
𝑦′′′
𝑎′
𝑎′
𝑎′′′
check if match
Retrain BART on augmented data
𝑐
𝑥
BART
target
𝑦
+
𝑐′
𝑥
𝑐′′
𝑥
target
𝑦′
target
𝑦′′
+
Figure 1: Illustration of FAST algorithm.
ling the rhetorical appeal of the ad). Our results
suggest that FAST can significantly improve the
controllability and language quality of state-of-the-
art controllable generation systems.
In summary, our contributions are:
Identifying an important flaw with recent con-
trollable text generation techniques and show-
ing how this flaw can undermine model per-
formance.
A pair of simple yet effective data augmenta-
tion algorithms for dealing with this issue.
Results and analysis demonstrating the effi-
cacy of the proposed algorithms, and their
ability to significantly improve controllabil-
ity and language quality over state-of-the-art
baselines.
2 Spuriously Correlated Control Codes
2.1 Controllable Generation
We focus on the case of conditional text
generation, where the training data
Dtr =
{(x1, y1, a1), ..., (xn, yn, an)}
is collection of
triples consisting of context text
x
, output text
y
,
and output linguistic attribute
a
. Note that in many
practical scenarios,
a
is inferred from
y
by a clas-
sifier
C(y)
, e.g. a rule or deep learning model.
The goal is to learn a conditional language model
(CLM) for
p(y|x, a)
, i.e. a text generation system
which conditions on the context and attribute to
generate texts that express the desired linguistic
attribute.
In practice, the linguistic attributes
a
are oper-
ationalized as control code tokens
c
which are in
one-to-one correspondence with the attributes (e.g.
“short”, “long”) and pre-pended to the context
x
be-
fore model input. This approach has been shown to
be effective in both non-conditional (Keskar et al.,
2019;Ficler and Goldberg,2017) and conditional
(Shen et al.,2022;Fan et al.,2018) controllable
text generation.
2.2 Spurious Correlations
In theory, the correspondence between the control
code
c
and linguistic attribute
a
should cause mod-
els to rely on the control code to determine the
linguistic properties of the generated output. This
paper argues that in practice, parts of the context
x
may be spuriously correlated with the attribute
a
,
undermining the consistency and efficacy of control
code-based systems.
These spurious correlations between the the con-
texts and attributes have a causal interpretation that
explains how they can undermine model perfor-
mance. The issue is that
p(a|x)6=p(a)
, which is
similar to a violation of the ignorability assump-
tion in causal inference (Feder et al.,2021). This
implies that any spurious correlations between the
context
x
and target attribute
a
could represent
backdoor paths that confound the model’s learned
relationship between the control code
c
and the tar-
get attributes
a
. Thus, models are likely to depend
on context beyond the control code when determin-
ing output attributes, making them less likely to
generalize to rare context/control-code combina-
tions.
In this paper, we aim to break up these backdoor
paths and prevent the model from learning spurious
correlations. We accomplish this by modifying the
training data in two ways such that
p(a|x)p(a)
,
with both techniques isolating the relationship be-
tween the control codes and target attributes.
2.3 Inverse propensity score (IPS) resampling
The first method we investigate for breaking the
aforementioned spurious correlations leverages
propensity scores. A propensity score is the con-
ditional probability of an example being assigned
to a treatment, given background variables (Rosen-
baum and Rubin,1983). It plays a central role in
causal inference for dealing with spurious corre-
lations in observational data and therefore it is a
natural choice for us to try. In our case, the propen-
sity score for the
i
th example is the conditional
probability of the output text exhibiting linguis-
tic attribute
ai
given the context
xi
. This can be
written as
wi=p(a=ai|xi).
Intuitively, examples with low propensity scores
represent rare attribute-context combinations that
are especially important to learn (Tu et al.,2020).
Therefore, our procedure works by resampling the
data with replacement, setting the sample weight
of the
i
th example to
1/wi
. The procedure should
work because the propensity scores of the resam-
pled data should be close to uniform:
p(ai|xi)
wi/wi.
In practice, we train a model to estimate propen-
sity scores. For the experiments we finetune
Roberta (Liu et al.,2019) as a sequence classi-
fier using
{(x1, a1), ..., (xn, an)}
. We then use the
model’s probability prediction for the observed cat-
egory
ai
as the propensity score estimate. We will
refer to this estimator as S(a|x).
2.4 Feedback aware self-training (FAST)
The above IPS resampling procedure has several
shortcomings, including the duplication of exam-
ples (Lee et al.,2021;Carlini et al.,2021) and
the noise/bias inherent to estimated propensity
scores (Pearl,2009). Therefore our second method,
though originating with the same motivation and
tackling the same issue, takes an orthogonal ap-
proach. First, use a separately trained model to
produce multiple counterfactual target sequences
for each context. Next, filter the data such that
new target sequence expresses a different linguistic
attribute. Then, we retrain on the new counterfac-
tually balanced dataset. In detail, the steps are:
1.
Train a conditional language model (CLM)
using the standard control code approach on
Dtr, which is denoted as CLMbaseline.
2.
Use CLM
baseline
to generate multiple outputs
for each context
xi
, one output for every con-
trol code except that which corresponds to the
ground truth attribute. For example, the set
of control codes used for datum
i
would be
{∀c∈ {1, ..., K}, c 6=ai}.
3.
Detect the linguistic attribute of the genera-
tion outputs with a classifier
C
, and filter out
examples where the predicted attribute does
not match the inputted control code.
4.
Augment the original training set
Dtr
with
samples from Step 3 and retrain.
Intuitively, this procedure should also drive the
propensity scores of the data towards uniform and
break the unwanted correlations between contexts
and attributes, since every context becomes paired
with multiple targets, each having a unique at-
tribute. Step 3 uses feedback from the classifier
C
to remove noisy examples, preventing errors from
propagating into the final model (§4.3). We ex-
periment with classifiers
C
that are given a prior,
trained on
(y, a)
pairs from the training data
Dtr
,
and trained on a separate dataset having similar
properties.
3 Experimental Setup
We perform experiments in 3 important control-
lable generation settings: generating news head-
lines from article contents (controlling the head-
line lengths), generating the next sentence of a
meta-review from preceding sentences and addi-
tional context (controlling the intent), and generat-
ing search ad copy from landing pages (controlling
the rhetorical appeal of the ad). Our results sug-
gest that the proposed methods can significantly
improve the controllability and fluency of state-of-
the-art baselines.
3.1 Datasets
We experiment using 3 datasets (Table 1) that re-
flect important real-world application scenarios for
controllable generation systems.
First, we use the
PENS
dataset released by Mi-
crosoft News (Ao et al.,2021). This task in-
volves generating news headlines from news ar-
ticles, while using a binary control code “short”
or “long” to control the length of the generated
headline (useful for mobile and desktop render-
ing). We use a length threshold of 55 to determine
the long/short status of existing headlines in the
data. We evaluate on these data using (1) random
train/dev/test splits, and (2) a “balanced” test set.
There are equal numbers of long and short head-
lines per article in this balanced test set. The head-
lines were sourced from 103 college students who
wrote long or short headlines without seeing the
original headlines, for an average of 3.7 headlines
per article.
Second, we use the
MReD
dataset released by
Shen et al. (2022). It consists of 4 years of ICLR
meta reviews with each sentence being manually
PENS
Category train dev test rnd. test bal.
Short 31,245 3,614 4,001 5,509
Long 57,351 6,666 7,074 5,509
Total 88,596 10,280 10,240 11,018
MReD
Category train dev test
Weakness 1,491 200 200
Strength 757 200 200
Decision 716 200 200
Rebuttal process 674 200 200
Abstract 581 200 200
Suggestion 438 200 200
Rating summary 338 159 135
Misc 225 143 150
AC disagreement 24 18 18
Total 5,244 1,520 1,503
Search Ads
Category train dev test
Product or Service 1,771 44.6 43.1
Call to action 1,207 37.5 36.6
Location 931 22 21.4
Highlight 851 32 30.8
Inventory 590 19 15.7
Brand name 466 11.9 11
Price 367 21.1 18.1
Benefit 309 8.6 8.6
Customer problem 156 3.7 3.9
Total 6,649 200.5 189.2
Table 1: Summary of PENS (top), MReD (middle) and
Search Ads datasets (bottom, in thousands).
annotated into one of 9 categories. Using these
data, our task is to follow the assisted writing sce-
nario of Chen et al. (2019). We generate the
ith
sentence in the meta-review, controlling the intent
of the generated sentence and conditioning on all
preceding sentences and additional context (rat-
ings, individual reviews). We reuse the original
train/dev/test splits and randomly sample sentences
with at least 4 words as the target sequences. For
the training set, we pick one sentence per review.
For dev and test sets, we pick multiple sentences
per review while ensuring a nearly equal number
of samples per category. To detect the categories
of generated sentences, we train a Roberta-base
classifier on 37,252 sentences (a superset of our
generation training set), achieving a macro-F1 of
79% on hold-out test set, implying that it has strong
generalization capabilities.
Finally, we use a
Search Ads
dataset consisting
of landing pages, search advertisements for those
landing pages, and labels for those ads classifying
them into one of 9 common advertising strategies.
Here, the goal is to generate search ads (title and
description) from landing pages while controlling
the rhetorical appeal of the ad copy (Golobokov
et al.,2022). To obtain the category labels, we
apply a BERT-base-uncased model (Devlin et al.,
2019) trained on a separate dataset of 5,735 manu-
ally labeled ad-category pairs. This model achieves
a macro-F1 score of 70% on hold-out test set. Un-
like the PENS data, the Search Ads data do not
contain a balanced test set. However, the train, dev
and test splits for the ads data contain an average
of 1.9, 2.3 and 2.6 ads from different categories per
landing page, respectively, so there is a moderate
degree of category depth.
3.2 Baselines
We compare against five baselines: an uncontrolled
system to establish a lower bound on performance,
and four recently published neural controllable gen-
eration systems.
Uncontrolled
We train BART-base (Lewis et al.,
2020) for uncontrolled generation, where the model
is only conditioned on the context.
BART+CTRL
We train BART-base for control-
lable generation using the standard control code
approach (Keskar et al.,2019). The control code
is represented as the name of the category (“long”,
“price”, etc). The paragraph symbol § is used as
delimiter to separate control code and context.
PPLM
We aim to enhance controllability of
BART+CTRL by further steering its decoding
towards the desired attribute. PPLM achieves
this by using gradients from an attribute classifier
p(a|y)
to update the CLM’s hidden representations
(Dathathri et al.,2020).
GeDi
This is a state-of-the-art technique for con-
trolling open-ended and non-conditional generation
(Krause et al.,2021). We adapt its weighted decod-
ing formula to our conditional generation setting
by including a dependency on the context x:
pw(y|x, c)p(y|x)p(c|y)ω.(1)
The key insight from GeDi is to compute
p(c|y)
using Bayes rule (i.e., leveraging
p(y|c)
). We train
two BART-base models for
p(y|x)
and
p(y|c)
using
the same procedure as the BART+CTRL baseline.
We pick
ω= 4
for PENS and MReD, and
ω= 3.5
for Ads based on a brief hyperparameter search.
GeDi+x
Our last baseline involves further adapt-
ing GeDi to our application domain by conditioning
everything on the context
x
as well as the control
code
c
, i.e. we concatenate the control code
c
and
摘要:

FAST:ImprovingControllabilityforTextGenerationwithFeedbackAwareSelf-TrainingJunyiChai,ReidPryzant,VictorYeDong,KonstantinGolobokov,ChenguangZhu,YiLiuMicrosoftCorporationjuchai,reidpryzant,victordong,kogolobo,chezhu,lewisliu@microsoft.comAbstractControllabletextgenerationsystemsoftenleveragecontrolco...

展开>> 收起<<
FAST Improving Controllability for Text Generation with Feedback AwareSelf-Training Junyi Chai Reid Pryzant Victor Ye Dong Konstantin Golobokov Chenguang Zhu Yi Liu.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:797.92KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注