On Text Style Transfer via Style Masked Language Models Sharan Narasimhan Pooja Shekar Suvodip Dey Maunendra Sankar Desarkar Department of Computer Science and Engineering

2025-04-27 1 0 639.19KB 13 页 10玖币

侵权投诉

On Text Style Transfer via Style Masked Language Models

Sharan Narasimhan, Pooja Shekar, Suvodip Dey, Maunendra Sankar Desarkar

Department of Computer Science and Engineering

Indian Institute of Technology Hyderabad, India

{sharan.n21, poojashekar15}@gmail.com

{cs19resch01003, maunendra}@iith.ac.in

Abstract

Text Style Transfer (TST) is performable

through approaches such as latent space dis-

entanglement, cycle-consistency losses, proto-

type editing etc. The prototype editing ap-

proach, which is known to be quite successful

in TST, involves two key phases a) Masking

of source style-associated tokens and b) Re-

construction of this source-style masked sen-

tence conditioned with the target style. We fol-

low a similar transduction method, in which

we transpose the more difﬁcult direct source

to target TST task to a simpler Style-Masked

Language Model (SMLM) Task, wherein, sim-

ilar to BERT (Devlin et al.,2019a), the goal

of our model is now to reconstruct the source

sentence from its style-masked version. We ar-

rive at the SMLM mechanism naturally by for-

mulating prototype editing/ transduction meth-

ods in a probabilistic framework, where TST

resolves into estimating a hypothetical paral-

lel dataset from a partially observed parallel

dataset, wherein each domain is assumed to

have a common latent style-masked prior. To

generate this style-masked prior, we use “Ex-

plainable Attention” as our choice of attribu-

tion for a more precise style-masking step and

also introduce a cost-effective and accurate

“Attribution-Surplus” method of determining

the position of masks from any arbitrary at-

tribution model in O(1) time. We empirically

show that this non-generational approach well

suites the “content preserving” criteria for a

task like TST, even for a complex style like

Discourse Manipulation. Our model, the Style

MLM, outperforms strong TST baselines and

is on par with state-of-the-art TST models,

which use complex architectures and orders of

more parameters.

1 Introduction

Text Style Transfer (TST) can be thought of as a

subset of the Controllable Language Generation

Task (Hu et al.,2017) with tighter criteria. TST

involves converting input text possessing some

source style into an output possessing the target

style, meanwhile preserving style-independent con-

tent and maintaining ﬂuency. Style is usually de-

ﬁned by the class labels present in an annotated

dataset, commonly considered to be Sentiment,

Formality, Toxicity etc. We consider the unsu-

pervised setting where only non-parallel datasets

are available, i.e. the style-transferred output is

not available. Past work focuses on various com-

mon paradigms such as disentanglement (Hu et al.,

2017;Shen et al.,2017), cycle-consistency losses

(Yi et al.,2020;Luo et al.,2019;Dai et al.,2019;

Liu et al.,2021), induction (Narasimhan et al.,

2022;Shen et al.,2020) etc. We focus on a sentence

to sentence “transduction” (or prototype editing)

method, a solution which naturally emerges when

following a probabilistic formulation consisting

of a single transduction model with a latent prior

consisting of a style-absent corpus. Our approach

is also inspired by BERT’s MLM (Devlin et al.,

2019a)

, modiﬁed to incorporate style information

to enable TST. This notion can be seen from vari-

ous viewpoints.

Another MLM.

By making use of various text

attribution models, it is possible to determine the

fraction of tokens that signiﬁcant contribution to

the style label. Masking out these tokens and train-

ing a bi-directional transformer encoder with self-

attention with the task of reconstructing the original

source style sentence is intuitive and closely resem-

bles BERT-based MLM’s (Devlin et al.,2019a;Jin

et al.,2020). Instead of random masking or per-

plexity based masking (Jin et al.,2020), by incorpo-

rating style-masking and concatenation of control

codes that inform the encoder about the target style

needed, the resultant MLM now resembles a rough

TST model. This notion of "ﬁlling in only the style

words" also does well to automatically ensure that

A key difference to BERT, ours is not using a self-

supervised approach of pre-training over large corpora and

ﬁne-tuning for a downstream task.

arXiv:2210.06394v1 [cs.CL] 12 Oct 2022

Task Positive

to Negative

Contradiction

to Entailment

Input

this movie is by far

one of the best ur-

ban crime dramas i

’ve seen .

a woman is sitting

outside at a table us-

ing a knife to cut

into a sandwich . a

woman is sitting in-

side

Style

Masked

this movie is by

mask one of the

mask urban crime

mask i ’ve seen .

a woman is sitting

outside at a table us-

ing a knife to cut

into a sandwich . a

woman mask mask

mask

Output

this movie is by far

one of the worst ur-

ban crime garbage i

’ve seen .

a woman is sitting

outside at a table us-

ing a knife to cut

into a sandwich . a

woman is a outside

Table 1: Examples of Sentiment and Discourse TST

by the SMLM on the IMDb and SNLI datasets respec-

tively.

other content words are preserved, and ﬂuency of

the original sentence is maintained. Unlike most

generational approaches, this method has an easier

objective by inferring what to ﬁll in by looking at

both directions, similar to BERT.

Parallel vs Non Parallel.

A "parallel" dataset,

in which every instance in the dataset has asso-

ciated sentences with all given style labels, is

naturally much easier to salvage but harder to come

by. In this case, TST resolves into the supervised

task of simply learning the input distribution corre-

sponding to this parallel dataset. In reality, most

available datasets are non-parallel/unsupervised

(E.g. Yelp). TST models must infer the notion of

"Style" present in the dataset using non-aligned

sentences of the same styles, differing in content,

with no direct supervision signal.

A Hypothetical Parallel.

In this case, TST

resolves into the task of estimating the missing

portion of a hypothetical parallel dataset with-

out explicit knowledge of the complete input

distribution as in the parallel case. We adopt a

probabilistic formulation of TST, in which we

assume each domain in this hypothetical parallel

dataset originates from a common discrete latent

prior, a latent style-masked prior in the case

of SMLM. Salvaging a non-parallel dataset to

learn a simple "Same-Style Reconstruction" task

using the SMLM also indirectly approximates the

cross-domain TST task, which we only need to

ﬁne-tune for a single epoch to make it comparable

to larger SOTA models.

Contributions.

a) We introduce the Style

Masked Langauge Model (SMLM), another type

of MLM capable of being of strong TST model

by initially pretrained on an unsupervised same-

style reconstruction task and ﬁne-tuned for TST

metrics. b) We use an accurate and effective style

masking step and present an analysis to support our

choice. c) We empirically show that the SMLM

beats strong baselines and compares to state of the

art, using far lesser parameters. d) We adopt a novel

"Discourse" TST task by salvaging NLI datasets

and show that our style-masking is also capable of

working with more complex "Inter-sentence’ styles

to manipulate the ﬂow of logic.

2 Related Work

Jin et al. (2021) and Hu et al. (2020) provide sur-

veys detailing the current state of TST and lay down

useful taxonomies to structure the ﬁeld. In this sec-

tion, we only discuss recent work similar to ours

(a prototype editing approach) assuming the same

unsupervised setting.

Li et al. (2018) present the earliest known work

using the prototype editing method, in which a

"delete" operation is performed on tokens based

on simple count-based methods, and the retrieval

of the target word is done by considering TF-

IDF weighted word overlap. Malmi (2020) ﬁrst

train MLMs for the source and target domains and

perform TST by ﬁrst masking text spans where

the models disagree (in terms of perplexity) the

most, and use the target domain MLM to ﬁll these

spans. Wu et al. (2019b) introduce the Attribute-

Conditional MLM, which most closely aligns to the

working of the MLM, also uses an attention clas-

siﬁer for attribution scores, a count and frequency-

based method to perform masking, and a pretrained

BERT model ﬁne-tuned on the TST task. Lee

(2020) and Wu et al. (2020) also follows roughly

the same pipeline but uses a generational trans-

former encoder-decoder approach and also ﬁne-

tunes using signals from a pretrained classiﬁer.

Wu et al. (2019a) uses a hierarchical reinforced

sequence operation method is used to iteratively

revise the words of original sentences. Madaan

et al. (2020) uses n-gram TF-IDF based methods

to identify style tokens and modify them as ei-

ther "add" or "replace" TAG tokens, which are

then substituted by the decoder to perform TST.

Similar to the SMLM, (Xu et al.,2018) also uses

attribution-based methods from a self-attention

classiﬁer. However, they use an LSTM (Hochreiter

and Schmidhuber,1997) based approach, one to

generate sentences from each domain.

3 Method

3.1 Notation

Let

denote the set of all style labels

for a supervised dataset

of the form

{(x0, l0),(x1, l1). . . (xn, ln)}

where

denotes

the input sentence and

li∈S

denotes the label cor-

responding to

. The set of all sentences of style

is denoted by

xs={xj:∀j where(xj, s)∈

. We use a special meta label

to represents

the "style-masked" label class having

the original

style. Subsequently,

xms

refers to the set of all

style-masked sentences which had source style

The set of all style-masked sentences from

given by

xm=Union(xms:∀s∈S)

. Let

denote the model’s estimation of ˆ

xs.

Probabilistic Overview.

We treat the non-

parallel dataset as a partially observed parallel

dataset, where the SMLM has to estimate the en-

tire joint distribution

P(xs, xˆs)

. The SMLM as-

sumes that every sentence from each style domain

is a result of,

sampling from a latent style-

masked prior,

p(xms)

, which we estimate using the

Style-Masking model,

p(xms|xs, θSM M )

(condi-

tional posterior), and is then

conditionally recon-

structed to form the sentence with target style using

the SMLM,

q(ss, xs|xms)

(the style-conditioned

likelihood). The posterior in this case is easily es-

timatable by using various intuitive methods. The

likelihood function, represented by the SMLM, has

a simple task of performing the same-style recon-

struction task using a non-parallel dataset. Using

control codes to guide which domain to sample

from, enables the SMLM to also estimate the un-

seen section of the hypothetical parallel and there-

fore work as a rought TST model. The probabilistic

formulation of the overall model is given in Fig. 1.

3.2 Explainable Attention as Attribution

Markers

Many prototype editing methods use Vanilla At-

tention (VA) as attribute scores (Wu et al.,2019b;

Zhang et al.,2018;Wu et al.,2020). However, it

has been shown that attention is not explanation,

at least for the purpose of human-aligned inter-

pretability (Jain and Wallace,2019). VA does not

correlate well with other well-known attribution

methods (such as Integrated Gradients Sundarara-

jan et al. (2017)). We instead use "Explainable

Attention" (EA) scores from a Diversity-LSTM

classiﬁer (Mohankumar et al.,2020;Nema et al.,

2017) which have been shown to correlate better

with other attribution methods. EA scores as attri-

bution also correlates better with human judgement

as compared to VA. We discuss the intuitive rea-

soning as to why VA over LSTM hidden states fail

as attribution scores, the intuition of the Diversity-

LSTM, its mechanism and loss equations in Sec-

tion 8. We also quantitatively compare the efﬁcacy

of the style-masking step between EA and VA in

Table 2.

3.3 The Style-Masking module

Even with having accurate attribution scores, ef-

fective style-masking requires careful selection of

a policy which satisﬁes certain criteria. The pri-

mary criteria being that only tokens which signiﬁ-

cantly contribute to the style of a sentence must be

masked, and leaving other tokens, to ensure content

is also preserved. Similar to the masking policy

in Wu et al. (2020), it is natural to consider a "top

tokens" scheme in which the top

tokens with

highest attribution are masked. However, this static

approach fails for sentences which do not have

style-contributing tokens, leading to either partial

style masking or erroneous masking of content to-

kens. For the same reason, even a sentence length

aware scheme such as "top 15%" masking fails.

Furthermore, all such policies require sorting, lead-

ing to O(nlogn) time complexity for style masking

of each sentence in a batch.

An "Attention Surplus" policy.

Let

{Ai. . . An}

denote the attention distribution of a

sentence of size

. Intuitively, we can reason that

all "special" tokens which might contribute more

to style should have an attribution greater than the

average base attribution of the sentence, given by

Amean = 1/n

. Generalising this further, We refer

to tokens with

Ai≥Abaseline

as tokens with "at-

tention surplus" w.r.t to a sentence-length sensitive

baseline attention Abaseline given by:

Abaseline = (1 + λ)∗Amean (1)

where

λ

is a hyperparameter of range

0−1.0

This chosen threshold

Abaseline

is sensitive to the

sentence length as well and subsequently ensures

that the number of style-signiﬁcant tokens can be

dynamically determined, without need of an elab-

orate algorithm. As a sanity check, we observe

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnTextStyleTransferviaStyleMaskedLanguageModelsSharanNarasimhan,PoojaShekar,SuvodipDey,MaunendraSankarDesarkarDepartmentofComputerScienceandEngineeringIndianInstituteofTechnologyHyderabad,India{sharan.n21,poojashekar15}@gmail.com{cs19resch01003,maunendra}@iith.ac.inAbstractTextStyleTransfer(TST)ispe...

展开>> 收起<<

On Text Style Transfer via Style Masked Language Models Sharan Narasimhan Pooja Shekar Suvodip Dey Maunendra Sankar Desarkar Department of Computer Science and Engineering.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Text Style Transfer via Style Masked Language Models Sharan Narasimhan Pooja Shekar Suvodip Dey Maunendra Sankar Desarkar Department of Computer Science and Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: