FAST Improving Controllability for Text Generation with Feedback AwareSelf-Training Junyi Chai Reid Pryzant Victor Ye Dong Konstantin Golobokov Chenguang Zhu Yi Liu

2025-04-27 0 0 797.92KB 18 页 10玖币

侵权投诉

FAST: Improving Controllability for Text Generation with

Feedback Aware Self-Training

Junyi Chai, Reid Pryzant, Victor Ye Dong, Konstantin Golobokov, Chenguang Zhu, Yi Liu

Microsoft Corporation

juchai,reidpryzant,victordong,kogolobo,chezhu,lewisliu@microsoft.com

Abstract

Controllable text generation systems often

leverage control codes to direct various prop-

erties of the output like style and length. In-

spired by recent work on causal inference

for NLP, this paper reveals a previously over-

looked ﬂaw in these control code-based con-

ditional text generation algorithms. Spurious

correlations in the training data can lead mod-

els to incorrectly rely on parts of the input

other than the control code for attribute se-

lection, signiﬁcantly undermining downstream

generation quality and controllability. We

demonstrate the severity of this issue with a

series of case studies and then propose two

simple techniques to reduce these correlations

in training sets. The ﬁrst technique is based

on resampling the data according to an ex-

ample’s propensity towards each linguistic at-

tribute (IPS). The second produces multiple

counterfactual versions of each example and

then uses an additional feedback mechanism to

remove noisy examples (feedback aware self-

training, FAST). We evaluate on 3 tasks – news

headline, meta review, and search ads gener-

ation – and demonstrate that FAST can sig-

niﬁcantly improve the controllability and lan-

guage quality of generated outputs when com-

pared to state-of-the-art controllable text gen-

eration approaches.

1 Introduction

In neural text generation, there is a growing in-

terest in controlling the presence of particular lin-

guistic attributes in the output text, for example

sentiment, length, politeness, and topic (Sennrich

et al.,2016;Kikuchi et al.,2016;Ficler and Gold-

berg,2017;Shen et al.,2022). This is typically

accomplished via control codes: categorical vari-

ables that represent the desired output property and

are pre-pended to the model inputs during training

and testing (Keskar et al.,2019).

This paper builds on recent work in text-based

causal inference (Feder et al.,2021;Veitch et al.,

2021;Pryzant et al.,2021) to reveal a previously

overlooked ﬂaw in control code-based text gener-

ation systems: spurious correlations in the data

can cause models to incorrectly rely on parts of

the input other than the control code for attribute

selection, undermining downstream generation per-

formance.

For example, consider a system that generates

news headlines while conditioning on article text

and a control code for headline length (e.g. long

for desktop, short for mobile) as in Murao et al.

(2019). We show in §4.1 that among publicly avail-

able news datasets, correlations exist between the

contents of an article and the length of that article’s

title. Longer articles or articles about technical top-

ics may be associated with longer titles. This leads

NLP models to struggle at generating short titles

from “long title”-looking articles.

We show how this phenomenon can introduce

confounding statistical relationships in the data,

leading to assumption violations and signiﬁcantly

degrading model quality. Then we propose two sim-

ple data augmentation techniques for improving the

issue. Both algorithms operate by breaking these

spurious correlations and isolating the statistical

relationship between control codes and linguistic

attributes. In the ﬁrst approach, we resample the

training set according to an inverse propensity score

(IPS, Robins et al. (1994)), boosting the presence

of rare context-attribute combinations in the data.

In the second approach (FAST) we train a prelimi-

nary model, use counterfactual data augmentation

to generate all possible attributes for each exam-

ple, then retrain on the counterfactually balanced

dataset, as illustrated in Figure 1.

We conduct experiments in 3 conditional text

generation scenarios: generating news headlines

from article contents (controlling the headline

lengths), generating the next sentence from pre-

ceding sentences (controlling the intent), and gen-

erating search ad copy from landing pages (control-

arXiv:2210.03167v1 [cs.CL] 6 Oct 2022

①Train BART for conditional text generation

BART

target

𝑦𝑐

context

𝑥

②Generate counterfactuals using different control codes

𝑐′

context

𝑥

target

𝑦′

𝑐′′

context

𝑥BART

𝑐′′′

context

𝑥

target

𝑦′′

target

𝑦′′′

③Filter out noisy examples if control code ≠attribute

𝑐′

context

𝑥

target

𝑦′

𝑐′′

context

𝑥𝒞

𝑐′′′

context

𝑥

target

𝑦′′

target

𝑦′′′

𝑎′

𝑎′′′

check if match

④Retrain BART on augmented data

𝑐

context

𝑥

BART

target

𝑦

𝑐′

context

𝑥

𝑐′′

context

𝑥

target

𝑦′

target

𝑦′′

Figure 1: Illustration of FAST algorithm.

ling the rhetorical appeal of the ad). Our results

suggest that FAST can signiﬁcantly improve the

controllability and language quality of state-of-the-

art controllable generation systems.

In summary, our contributions are:

•

Identifying an important ﬂaw with recent con-

trollable text generation techniques and show-

ing how this ﬂaw can undermine model per-

formance.

•

A pair of simple yet effective data augmenta-

tion algorithms for dealing with this issue.

•

Results and analysis demonstrating the efﬁ-

cacy of the proposed algorithms, and their

ability to signiﬁcantly improve controllabil-

ity and language quality over state-of-the-art

baselines.

2 Spuriously Correlated Control Codes

2.1 Controllable Generation

We focus on the case of conditional text

generation, where the training data

Dtr =

{(x1, y1, a1), ..., (xn, yn, an)}

is collection of

triples consisting of context text

, output text

and output linguistic attribute

. Note that in many

practical scenarios,

is inferred from

by a clas-

siﬁer

C(y)

, e.g. a rule or deep learning model.

The goal is to learn a conditional language model

(CLM) for

p(y|x, a)

, i.e. a text generation system

which conditions on the context and attribute to

generate texts that express the desired linguistic

attribute.

In practice, the linguistic attributes

are oper-

ationalized as control code tokens

which are in

one-to-one correspondence with the attributes (e.g.

“short”, “long”) and pre-pended to the context

be-

fore model input. This approach has been shown to

be effective in both non-conditional (Keskar et al.,

2019;Ficler and Goldberg,2017) and conditional

(Shen et al.,2022;Fan et al.,2018) controllable

text generation.

2.2 Spurious Correlations

In theory, the correspondence between the control

code

and linguistic attribute

should cause mod-

els to rely on the control code to determine the

linguistic properties of the generated output. This

paper argues that in practice, parts of the context

may be spuriously correlated with the attribute

undermining the consistency and efﬁcacy of control

code-based systems.

These spurious correlations between the the con-

texts and attributes have a causal interpretation that

explains how they can undermine model perfor-

mance. The issue is that

p(a|x)6=p(a)

, which is

similar to a violation of the ignorability assump-

tion in causal inference (Feder et al.,2021). This

implies that any spurious correlations between the

context

and target attribute

could represent

backdoor paths that confound the model’s learned

relationship between the control code

and the tar-

get attributes

. Thus, models are likely to depend

on context beyond the control code when determin-

ing output attributes, making them less likely to

generalize to rare context/control-code combina-

tions.

In this paper, we aim to break up these backdoor

paths and prevent the model from learning spurious

correlations. We accomplish this by modifying the

training data in two ways such that

p(a|x)≈p(a)

with both techniques isolating the relationship be-

tween the control codes and target attributes.

2.3 Inverse propensity score (IPS) resampling

The ﬁrst method we investigate for breaking the

aforementioned spurious correlations leverages

propensity scores. A propensity score is the con-

ditional probability of an example being assigned

to a treatment, given background variables (Rosen-

baum and Rubin,1983). It plays a central role in

causal inference for dealing with spurious corre-

lations in observational data and therefore it is a

natural choice for us to try. In our case, the propen-

sity score for the

th example is the conditional

probability of the output text exhibiting linguis-

tic attribute

given the context

. This can be

written as

wi=p(a=ai|xi).

Intuitively, examples with low propensity scores

represent rare attribute-context combinations that

are especially important to learn (Tu et al.,2020).

Therefore, our procedure works by resampling the

data with replacement, setting the sample weight

of the

th example to

1/wi

. The procedure should

work because the propensity scores of the resam-

pled data should be close to uniform:

p(ai|xi)∝

wi/wi.

In practice, we train a model to estimate propen-

sity scores. For the experiments we ﬁnetune

Roberta (Liu et al.,2019) as a sequence classi-

ﬁer using

{(x1, a1), ..., (xn, an)}

. We then use the

model’s probability prediction for the observed cat-

egory

as the propensity score estimate. We will

refer to this estimator as S(a|x).

2.4 Feedback aware self-training (FAST)

The above IPS resampling procedure has several

shortcomings, including the duplication of exam-

ples (Lee et al.,2021;Carlini et al.,2021) and

the noise/bias inherent to estimated propensity

scores (Pearl,2009). Therefore our second method,

though originating with the same motivation and

tackling the same issue, takes an orthogonal ap-

proach. First, use a separately trained model to

produce multiple counterfactual target sequences

for each context. Next, ﬁlter the data such that

new target sequence expresses a different linguistic

attribute. Then, we retrain on the new counterfac-

tually balanced dataset. In detail, the steps are:

Train a conditional language model (CLM)

using the standard control code approach on

Dtr, which is denoted as CLMbaseline.

Use CLM

baseline

to generate multiple outputs

for each context

, one output for every con-

trol code except that which corresponds to the

ground truth attribute. For example, the set

of control codes used for datum

would be

{∀c∈ {1, ..., K}, c 6=ai}.

Detect the linguistic attribute of the genera-

tion outputs with a classiﬁer

, and ﬁlter out

examples where the predicted attribute does

not match the inputted control code.

Augment the original training set

Dtr

with

samples from Step 3 and retrain.

Intuitively, this procedure should also drive the

propensity scores of the data towards uniform and

break the unwanted correlations between contexts

and attributes, since every context becomes paired

with multiple targets, each having a unique at-

tribute. Step 3 uses feedback from the classiﬁer

to remove noisy examples, preventing errors from

propagating into the ﬁnal model (§4.3). We ex-

periment with classiﬁers

that are given a prior,

trained on

(y, a)

pairs from the training data

Dtr

and trained on a separate dataset having similar

properties.

3 Experimental Setup

We perform experiments in 3 important control-

lable generation settings: generating news head-

lines from article contents (controlling the head-

line lengths), generating the next sentence of a

meta-review from preceding sentences and addi-

tional context (controlling the intent), and generat-

ing search ad copy from landing pages (controlling

the rhetorical appeal of the ad). Our results sug-

gest that the proposed methods can signiﬁcantly

improve the controllability and ﬂuency of state-of-

the-art baselines.

3.1 Datasets

We experiment using 3 datasets (Table 1) that re-

ﬂect important real-world application scenarios for

controllable generation systems.

First, we use the

PENS

dataset released by Mi-

crosoft News (Ao et al.,2021). This task in-

volves generating news headlines from news ar-

ticles, while using a binary control code “short”

or “long” to control the length of the generated

headline (useful for mobile and desktop render-

ing). We use a length threshold of 55 to determine

the long/short status of existing headlines in the

data. We evaluate on these data using (1) random

train/dev/test splits, and (2) a “balanced” test set.

There are equal numbers of long and short head-

lines per article in this balanced test set. The head-

lines were sourced from 103 college students who

wrote long or short headlines without seeing the

original headlines, for an average of 3.7 headlines

per article.

Second, we use the

MReD

dataset released by

Shen et al. (2022). It consists of 4 years of ICLR

meta reviews with each sentence being manually

PENS

Category train dev test rnd. test bal.

Short 31,245 3,614 4,001 5,509

Long 57,351 6,666 7,074 5,509

Total 88,596 10,280 10,240 11,018

MReD

Category train dev test

Weakness 1,491 200 200

Strength 757 200 200

Decision 716 200 200

Rebuttal process 674 200 200

Abstract 581 200 200

Suggestion 438 200 200

Rating summary 338 159 135

Misc 225 143 150

AC disagreement 24 18 18

Total 5,244 1,520 1,503

Search Ads

Category train dev test

Product or Service 1,771 44.6 43.1

Call to action 1,207 37.5 36.6

Location 931 22 21.4

Highlight 851 32 30.8

Inventory 590 19 15.7

Brand name 466 11.9 11

Price 367 21.1 18.1

Beneﬁt 309 8.6 8.6

Customer problem 156 3.7 3.9

Total 6,649 200.5 189.2

Table 1: Summary of PENS (top), MReD (middle) and

Search Ads datasets (bottom, in thousands).

annotated into one of 9 categories. Using these

data, our task is to follow the assisted writing sce-

nario of Chen et al. (2019). We generate the

ith

sentence in the meta-review, controlling the intent

of the generated sentence and conditioning on all

preceding sentences and additional context (rat-

ings, individual reviews). We reuse the original

train/dev/test splits and randomly sample sentences

with at least 4 words as the target sequences. For

the training set, we pick one sentence per review.

For dev and test sets, we pick multiple sentences

per review while ensuring a nearly equal number

of samples per category. To detect the categories

of generated sentences, we train a Roberta-base

classiﬁer on 37,252 sentences (a superset of our

generation training set), achieving a macro-F1 of

79% on hold-out test set, implying that it has strong

generalization capabilities.

Finally, we use a

Search Ads

dataset consisting

of landing pages, search advertisements for those

landing pages, and labels for those ads classifying

them into one of 9 common advertising strategies.

Here, the goal is to generate search ads (title and

description) from landing pages while controlling

the rhetorical appeal of the ad copy (Golobokov

et al.,2022). To obtain the category labels, we

apply a BERT-base-uncased model (Devlin et al.,

2019) trained on a separate dataset of 5,735 manu-

ally labeled ad-category pairs. This model achieves

a macro-F1 score of 70% on hold-out test set. Un-

like the PENS data, the Search Ads data do not

contain a balanced test set. However, the train, dev

and test splits for the ads data contain an average

of 1.9, 2.3 and 2.6 ads from different categories per

landing page, respectively, so there is a moderate

degree of category depth.

3.2 Baselines

We compare against ﬁve baselines: an uncontrolled

system to establish a lower bound on performance,

and four recently published neural controllable gen-

eration systems.

Uncontrolled

We train BART-base (Lewis et al.,

2020) for uncontrolled generation, where the model

is only conditioned on the context.

BART+CTRL

We train BART-base for control-

lable generation using the standard control code

approach (Keskar et al.,2019). The control code

is represented as the name of the category (“long”,

“price”, etc). The paragraph symbol § is used as

delimiter to separate control code and context.

PPLM

We aim to enhance controllability of

BART+CTRL by further steering its decoding

towards the desired attribute. PPLM achieves

this by using gradients from an attribute classiﬁer

p(a|y)

to update the CLM’s hidden representations

(Dathathri et al.,2020).

GeDi

This is a state-of-the-art technique for con-

trolling open-ended and non-conditional generation

(Krause et al.,2021). We adapt its weighted decod-

ing formula to our conditional generation setting

by including a dependency on the context x:

pw(y|x, c)∝p(y|x)p(c|y)ω.(1)

The key insight from GeDi is to compute

p(c|y)

using Bayes rule (i.e., leveraging

p(y|c)

). We train

two BART-base models for

p(y|x)

and

p(y|c)

using

the same procedure as the BART+CTRL baseline.

We pick

ω= 4

for PENS and MReD, and

ω= 3.5

for Ads based on a brief hyperparameter search.

GeDi+x

Our last baseline involves further adapt-

ing GeDi to our application domain by conditioning

everything on the context

as well as the control

code

, i.e. we concatenate the control code

and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FAST:ImprovingControllabilityforTextGenerationwithFeedbackAwareSelf-TrainingJunyiChai,ReidPryzant,VictorYeDong,KonstantinGolobokov,ChenguangZhu,YiLiuMicrosoftCorporationjuchai,reidpryzant,victordong,kogolobo,chezhu,lewisliu@microsoft.comAbstractControllabletextgenerationsystemsoftenleveragecontrolco...

展开>> 收起<<

FAST Improving Controllability for Text Generation with Feedback AwareSelf-Training Junyi Chai Reid Pryzant Victor Ye Dong Konstantin Golobokov Chenguang Zhu Yi Liu.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

FAST Improving Controllability for Text Generation with Feedback AwareSelf-Training Junyi Chai Reid Pryzant Victor Ye Dong Konstantin Golobokov Chenguang Zhu Yi Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: