On Text Style Transfer via Style Masked Language Models Sharan Narasimhan Pooja Shekar Suvodip Dey Maunendra Sankar Desarkar Department of Computer Science and Engineering

2025-04-27 0 0 639.19KB 13 页 10玖币
侵权投诉
On Text Style Transfer via Style Masked Language Models
Sharan Narasimhan, Pooja Shekar, Suvodip Dey, Maunendra Sankar Desarkar
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad, India
{sharan.n21, poojashekar15}@gmail.com
{cs19resch01003, maunendra}@iith.ac.in
Abstract
Text Style Transfer (TST) is performable
through approaches such as latent space dis-
entanglement, cycle-consistency losses, proto-
type editing etc. The prototype editing ap-
proach, which is known to be quite successful
in TST, involves two key phases a) Masking
of source style-associated tokens and b) Re-
construction of this source-style masked sen-
tence conditioned with the target style. We fol-
low a similar transduction method, in which
we transpose the more difficult direct source
to target TST task to a simpler Style-Masked
Language Model (SMLM) Task, wherein, sim-
ilar to BERT (Devlin et al.,2019a), the goal
of our model is now to reconstruct the source
sentence from its style-masked version. We ar-
rive at the SMLM mechanism naturally by for-
mulating prototype editing/ transduction meth-
ods in a probabilistic framework, where TST
resolves into estimating a hypothetical paral-
lel dataset from a partially observed parallel
dataset, wherein each domain is assumed to
have a common latent style-masked prior. To
generate this style-masked prior, we use “Ex-
plainable Attention” as our choice of attribu-
tion for a more precise style-masking step and
also introduce a cost-effective and accurate
Attribution-Surplus” method of determining
the position of masks from any arbitrary at-
tribution model in O(1) time. We empirically
show that this non-generational approach well
suites the “content preserving” criteria for a
task like TST, even for a complex style like
Discourse Manipulation. Our model, the Style
MLM, outperforms strong TST baselines and
is on par with state-of-the-art TST models,
which use complex architectures and orders of
more parameters.
1 Introduction
Text Style Transfer (TST) can be thought of as a
subset of the Controllable Language Generation
Task (Hu et al.,2017) with tighter criteria. TST
involves converting input text possessing some
source style into an output possessing the target
style, meanwhile preserving style-independent con-
tent and maintaining fluency. Style is usually de-
fined by the class labels present in an annotated
dataset, commonly considered to be Sentiment,
Formality, Toxicity etc. We consider the unsu-
pervised setting where only non-parallel datasets
are available, i.e. the style-transferred output is
not available. Past work focuses on various com-
mon paradigms such as disentanglement (Hu et al.,
2017;Shen et al.,2017), cycle-consistency losses
(Yi et al.,2020;Luo et al.,2019;Dai et al.,2019;
Liu et al.,2021), induction (Narasimhan et al.,
2022;Shen et al.,2020) etc. We focus on a sentence
to sentence “transduction” (or prototype editing)
method, a solution which naturally emerges when
following a probabilistic formulation consisting
of a single transduction model with a latent prior
consisting of a style-absent corpus. Our approach
is also inspired by BERT’s MLM (Devlin et al.,
2019a)
*
, modified to incorporate style information
to enable TST. This notion can be seen from vari-
ous viewpoints.
Another MLM.
By making use of various text
attribution models, it is possible to determine the
fraction of tokens that significant contribution to
the style label. Masking out these tokens and train-
ing a bi-directional transformer encoder with self-
attention with the task of reconstructing the original
source style sentence is intuitive and closely resem-
bles BERT-based MLM’s (Devlin et al.,2019a;Jin
et al.,2020). Instead of random masking or per-
plexity based masking (Jin et al.,2020), by incorpo-
rating style-masking and concatenation of control
codes that inform the encoder about the target style
needed, the resultant MLM now resembles a rough
TST model. This notion of "filling in only the style
words" also does well to automatically ensure that
*
A key difference to BERT, ours is not using a self-
supervised approach of pre-training over large corpora and
fine-tuning for a downstream task.
arXiv:2210.06394v1 [cs.CL] 12 Oct 2022
Task Positive
to Negative
Contradiction
to Entailment
Input
this movie is by far
one of the best ur-
ban crime dramas i
ve seen .
a woman is sitting
outside at a table us-
ing a knife to cut
into a sandwich . a
woman is sitting in-
side
Style
Masked
this movie is by
mask one of the
mask urban crime
mask i ’ve seen .
a woman is sitting
outside at a table us-
ing a knife to cut
into a sandwich . a
woman mask mask
mask
Output
this movie is by far
one of the worst ur-
ban crime garbage i
ve seen .
a woman is sitting
outside at a table us-
ing a knife to cut
into a sandwich . a
woman is a outside
Table 1: Examples of Sentiment and Discourse TST
by the SMLM on the IMDb and SNLI datasets respec-
tively.
other content words are preserved, and fluency of
the original sentence is maintained. Unlike most
generational approaches, this method has an easier
objective by inferring what to fill in by looking at
both directions, similar to BERT.
Parallel vs Non Parallel.
A "parallel" dataset,
in which every instance in the dataset has asso-
ciated sentences with all given style labels, is
naturally much easier to salvage but harder to come
by. In this case, TST resolves into the supervised
task of simply learning the input distribution corre-
sponding to this parallel dataset. In reality, most
available datasets are non-parallel/unsupervised
(E.g. Yelp). TST models must infer the notion of
"Style" present in the dataset using non-aligned
sentences of the same styles, differing in content,
with no direct supervision signal.
A Hypothetical Parallel.
In this case, TST
resolves into the task of estimating the missing
portion of a hypothetical parallel dataset with-
out explicit knowledge of the complete input
distribution as in the parallel case. We adopt a
probabilistic formulation of TST, in which we
assume each domain in this hypothetical parallel
dataset originates from a common discrete latent
prior, a latent style-masked prior in the case
of SMLM. Salvaging a non-parallel dataset to
learn a simple "Same-Style Reconstruction" task
using the SMLM also indirectly approximates the
cross-domain TST task, which we only need to
fine-tune for a single epoch to make it comparable
to larger SOTA models.
Contributions.
a) We introduce the Style
Masked Langauge Model (SMLM), another type
of MLM capable of being of strong TST model
by initially pretrained on an unsupervised same-
style reconstruction task and fine-tuned for TST
metrics. b) We use an accurate and effective style
masking step and present an analysis to support our
choice. c) We empirically show that the SMLM
beats strong baselines and compares to state of the
art, using far lesser parameters. d) We adopt a novel
"Discourse" TST task by salvaging NLI datasets
and show that our style-masking is also capable of
working with more complex "Inter-sentence’ styles
to manipulate the flow of logic.
2 Related Work
Jin et al. (2021) and Hu et al. (2020) provide sur-
veys detailing the current state of TST and lay down
useful taxonomies to structure the field. In this sec-
tion, we only discuss recent work similar to ours
(a prototype editing approach) assuming the same
unsupervised setting.
Li et al. (2018) present the earliest known work
using the prototype editing method, in which a
"delete" operation is performed on tokens based
on simple count-based methods, and the retrieval
of the target word is done by considering TF-
IDF weighted word overlap. Malmi (2020) first
train MLMs for the source and target domains and
perform TST by first masking text spans where
the models disagree (in terms of perplexity) the
most, and use the target domain MLM to fill these
spans. Wu et al. (2019b) introduce the Attribute-
Conditional MLM, which most closely aligns to the
working of the MLM, also uses an attention clas-
sifier for attribution scores, a count and frequency-
based method to perform masking, and a pretrained
BERT model fine-tuned on the TST task. Lee
(2020) and Wu et al. (2020) also follows roughly
the same pipeline but uses a generational trans-
former encoder-decoder approach and also fine-
tunes using signals from a pretrained classifier.
Wu et al. (2019a) uses a hierarchical reinforced
sequence operation method is used to iteratively
revise the words of original sentences. Madaan
et al. (2020) uses n-gram TF-IDF based methods
to identify style tokens and modify them as ei-
ther "add" or "replace" TAG tokens, which are
then substituted by the decoder to perform TST.
Similar to the SMLM, (Xu et al.,2018) also uses
attribution-based methods from a self-attention
classifier. However, they use an LSTM (Hochreiter
and Schmidhuber,1997) based approach, one to
generate sentences from each domain.
3 Method
3.1 Notation
Let
S
denote the set of all style labels
for a supervised dataset
D
of the form
{(x0, l0),(x1, l1). . . (xn, ln)}
where
xi
denotes
the input sentence and
liS
denotes the label cor-
responding to
xi
. The set of all sentences of style
s
in
D
is denoted by
ˆ
xs={xj:j where(xj, s)
D}
. We use a special meta label
ms
to represents
the "style-masked" label class having
s
the original
style. Subsequently,
xms
refers to the set of all
style-masked sentences which had source style
s
.
The set of all style-masked sentences from
D
is
given by
xm=Union(xms:sS)
. Let
xs
denote the model’s estimation of ˆ
xs.
Probabilistic Overview.
We treat the non-
parallel dataset as a partially observed parallel
dataset, where the SMLM has to estimate the en-
tire joint distribution
P(xs, xˆs)
. The SMLM as-
sumes that every sentence from each style domain
is a result of,
a)
sampling from a latent style-
masked prior,
p(xms)
, which we estimate using the
Style-Masking model,
p(xms|xs, θSM M )
(condi-
tional posterior), and is then
b)
conditionally recon-
structed to form the sentence with target style using
the SMLM,
q(ss, xs|xms)
(the style-conditioned
likelihood). The posterior in this case is easily es-
timatable by using various intuitive methods. The
likelihood function, represented by the SMLM, has
a simple task of performing the same-style recon-
struction task using a non-parallel dataset. Using
control codes to guide which domain to sample
from, enables the SMLM to also estimate the un-
seen section of the hypothetical parallel and there-
fore work as a rought TST model. The probabilistic
formulation of the overall model is given in Fig. 1.
3.2 Explainable Attention as Attribution
Markers
Many prototype editing methods use Vanilla At-
tention (VA) as attribute scores (Wu et al.,2019b;
Zhang et al.,2018;Wu et al.,2020). However, it
has been shown that attention is not explanation,
at least for the purpose of human-aligned inter-
pretability (Jain and Wallace,2019). VA does not
correlate well with other well-known attribution
methods (such as Integrated Gradients Sundarara-
jan et al. (2017)). We instead use "Explainable
Attention" (EA) scores from a Diversity-LSTM
classifier (Mohankumar et al.,2020;Nema et al.,
2017) which have been shown to correlate better
with other attribution methods. EA scores as attri-
bution also correlates better with human judgement
as compared to VA. We discuss the intuitive rea-
soning as to why VA over LSTM hidden states fail
as attribution scores, the intuition of the Diversity-
LSTM, its mechanism and loss equations in Sec-
tion 8. We also quantitatively compare the efficacy
of the style-masking step between EA and VA in
Table 2.
3.3 The Style-Masking module
Even with having accurate attribution scores, ef-
fective style-masking requires careful selection of
a policy which satisfies certain criteria. The pri-
mary criteria being that only tokens which signifi-
cantly contribute to the style of a sentence must be
masked, and leaving other tokens, to ensure content
is also preserved. Similar to the masking policy
in Wu et al. (2020), it is natural to consider a "top
k
tokens" scheme in which the top
k
tokens with
highest attribution are masked. However, this static
approach fails for sentences which do not have
k
style-contributing tokens, leading to either partial
style masking or erroneous masking of content to-
kens. For the same reason, even a sentence length
aware scheme such as "top 15%" masking fails.
Furthermore, all such policies require sorting, lead-
ing to O(nlogn) time complexity for style masking
of each sentence in a batch.
An "Attention Surplus" policy.
Let
A=
{Ai. . . An}
denote the attention distribution of a
sentence of size
n
. Intuitively, we can reason that
all "special" tokens which might contribute more
to style should have an attribution greater than the
average base attribution of the sentence, given by
Amean = 1/n
. Generalising this further, We refer
to tokens with
AiAbaseline
as tokens with "at-
tention surplus" w.r.t to a sentence-length sensitive
baseline attention Abaseline given by:
Abaseline = (1 + λ)Amean (1)
where
λ
is a hyperparameter of range
01.0
.
This chosen threshold
Abaseline
is sensitive to the
sentence length as well and subsequently ensures
that the number of style-significant tokens can be
dynamically determined, without need of an elab-
orate algorithm. As a sanity check, we observe
摘要:

OnTextStyleTransferviaStyleMaskedLanguageModelsSharanNarasimhan,PoojaShekar,SuvodipDey,MaunendraSankarDesarkarDepartmentofComputerScienceandEngineeringIndianInstituteofTechnologyHyderabad,India{sharan.n21,poojashekar15}@gmail.com{cs19resch01003,maunendra}@iith.ac.inAbstractTextStyleTransfer(TST)ispe...

展开>> 收起<<
On Text Style Transfer via Style Masked Language Models Sharan Narasimhan Pooja Shekar Suvodip Dey Maunendra Sankar Desarkar Department of Computer Science and Engineering.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:639.19KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注